Mailing List Archive: Deprecate Schemaless Mode?

Deprecate Schemaless Mode?

marcuseagan at gmail

Aug 3, 2020, 8:32 AM

Post #1 of 16 (799 views)

Community,

There are many of us that have had to deal with the pain of managing the
schemaless mode of operation in Solr. I'm curious to get others thoughts
about how well it is working for them and if they would like to continue to
use it.

I for one don't think Schemaless works as intended and favor deprecating it
and replacing it with some more usable but I am sure others have thoughts
here.

Is anyone on this list using schemaless mode in production or have you
tried to?

A preliminary discussion has occurred in this Jira ticket:
https://issues.apache.org/jira/browse/SOLR-14701
<https://issues.apache.org/jira/browse/SOLR-14701?>

Thank you all,

Marcus Eagan

Re: Deprecate Schemaless Mode? [ In reply to ]

gerlowskija at gmail

Aug 3, 2020, 10:41 AM

Post #2 of 16 (798 views)

> Is anyone on this list using schemaless mode in production or have you tried to?

Schemaless mode is one of a group of Solr features present for
convenience but not intended for production usage. It's in the same
boat as "bin/post", and SolrCell, and others. These features do cause
headaches when users ignore the documented restrictions and use them
for more than prototyping. But at the same time they're super
valuable for these sort of demo-ing or getting-started use cases. An
easy getting-started experience is important, and schemaless et al
serve a mostly positive role in that.

I think we'd better serve our users if we left schemaless
in/undeprecated, and instead focused on making it harder to
(unknowingly) use them in ways contrary to community recommendations.
Add louder warnings in the documentation (where not already present).
Add warnings to the Solr logs the first time these features are used.
Disable them by default (where that makes sense). Taken to the
extreme, we could even add a section into Solr's response that lists
non-production features used in serving a given request.

There are lots of ways to address the "feature X is trappy" problem
without removing X together.

On Mon, Aug 3, 2020 at 11:33 AM Marcus Eagan <marcuseagan@gmail.com> wrote:
>
> Community,
>
> There are many of us that have had to deal with the pain of managing the schemaless mode of operation in Solr. I'm curious to get others thoughts about how well it is working for them and if they would like to continue to use it.
>
> I for one don't think Schemaless works as intended and favor deprecating it and replacing it with some more usable but I am sure others have thoughts here.
>
> Is anyone on this list using schemaless mode in production or have you tried to?
>
> A preliminary discussion has occurred in this Jira ticket: https://issues.apache.org/jira/browse/SOLR-14701
>
> Thank you all,
>
> Marcus Eagan
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Deprecate Schemaless Mode? [ In reply to ]

tomasflobbe at gmail

Aug 3, 2020, 11:04 AM

Post #3 of 16 (798 views)

Agree with Jason. It's useful for prototyping and developing. I remember
seeing some warnings about it (in the logs?), but maybe we need more?

On Mon, Aug 3, 2020 at 10:41 AM Jason Gerlowski <gerlowskija@gmail.com>
wrote:

> > Is anyone on this list using schemaless mode in production or have you
> tried to?
>
> Schemaless mode is one of a group of Solr features present for
> convenience but not intended for production usage. It's in the same
> boat as "bin/post", and SolrCell, and others. These features do cause
> headaches when users ignore the documented restrictions and use them
> for more than prototyping. But at the same time they're super
> valuable for these sort of demo-ing or getting-started use cases. An
> easy getting-started experience is important, and schemaless et al
> serve a mostly positive role in that.
>
> I think we'd better serve our users if we left schemaless
> in/undeprecated, and instead focused on making it harder to
> (unknowingly) use them in ways contrary to community recommendations.
> Add louder warnings in the documentation (where not already present).
> Add warnings to the Solr logs the first time these features are used.
> Disable them by default (where that makes sense). Taken to the
> extreme, we could even add a section into Solr's response that lists
> non-production features used in serving a given request.
>
> There are lots of ways to address the "feature X is trappy" problem
> without removing X together.
>
> On Mon, Aug 3, 2020 at 11:33 AM Marcus Eagan <marcuseagan@gmail.com>
> wrote:
> >
> > Community,
> >
> > There are many of us that have had to deal with the pain of managing the
> schemaless mode of operation in Solr. I'm curious to get others thoughts
> about how well it is working for them and if they would like to continue to
> use it.
> >
> > I for one don't think Schemaless works as intended and favor deprecating
> it and replacing it with some more usable but I am sure others have
> thoughts here.
> >
> > Is anyone on this list using schemaless mode in production or have you
> tried to?
> >
> > A preliminary discussion has occurred in this Jira ticket:
> https://issues.apache.org/jira/browse/SOLR-14701
> >
> > Thank you all,
> >
> > Marcus Eagan
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Deprecate Schemaless Mode? [ In reply to ]

jan.asf at cominvent

Aug 3, 2020, 11:31 AM

Post #4 of 16 (798 views)

?I’m against deprecating it.

Can we rename the feature as SchemaGuessing or FieldGuessing mode? That would set expectations right from the start.

You may want to ask the user community too, but ask if they use it in development, and if they like it, since it is not made for prod use :)

Jan Høydahl

>> 3. aug. 2020 kl. 20:05 skrev Tomás Fernández Löbbe <tomasflobbe@gmail.com>:
> ?
> Agree with Jason. It's useful for prototyping and developing. I remember seeing some warnings about it (in the logs?), but maybe we need more?
>
>> On Mon, Aug 3, 2020 at 10:41 AM Jason Gerlowski <gerlowskija@gmail.com> wrote:
>> > Is anyone on this list using schemaless mode in production or have you tried to?
>>
>> Schemaless mode is one of a group of Solr features present for
>> convenience but not intended for production usage. It's in the same
>> boat as "bin/post", and SolrCell, and others. These features do cause
>> headaches when users ignore the documented restrictions and use them
>> for more than prototyping. But at the same time they're super
>> valuable for these sort of demo-ing or getting-started use cases. An
>> easy getting-started experience is important, and schemaless et al
>> serve a mostly positive role in that.
>>
>> I think we'd better serve our users if we left schemaless
>> in/undeprecated, and instead focused on making it harder to
>> (unknowingly) use them in ways contrary to community recommendations.
>> Add louder warnings in the documentation (where not already present).
>> Add warnings to the Solr logs the first time these features are used.
>> Disable them by default (where that makes sense). Taken to the
>> extreme, we could even add a section into Solr's response that lists
>> non-production features used in serving a given request.
>>
>> There are lots of ways to address the "feature X is trappy" problem
>> without removing X together.
>>
>> On Mon, Aug 3, 2020 at 11:33 AM Marcus Eagan <marcuseagan@gmail.com> wrote:
>> >
>> > Community,
>> >
>> > There are many of us that have had to deal with the pain of managing the schemaless mode of operation in Solr. I'm curious to get others thoughts about how well it is working for them and if they would like to continue to use it.
>> >
>> > I for one don't think Schemaless works as intended and favor deprecating it and replacing it with some more usable but I am sure others have thoughts here.
>> >
>> > Is anyone on this list using schemaless mode in production or have you tried to?
>> >
>> > A preliminary discussion has occurred in this Jira ticket: https://issues.apache.org/jira/browse/SOLR-14701
>> >
>> > Thank you all,
>> >
>> > Marcus Eagan
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org

Re: Deprecate Schemaless Mode? [ In reply to ]

gus.heck at gmail

Aug 3, 2020, 11:39 AM

Post #5 of 16 (798 views)

I almost never use schemaless mode (better named "schema guessing mode")
and I would never recommend it for use beyond prototyping. The primary use
I see for it is to throw a bunch of data at it to get a starting point for
a schema... say for example you want to see what tika's going to produce
for metadata before solidifying what you will and will not rely on. I think
the ability to suggest a schema is valuable and shouldn't go away. I'm all
for not having it be the default configuration however, and I really like
the suggestions linked in the ticket for features that consider a number of
documents before trying to guess the schema and if we implement one of
those I'd be for deprecation and eventual removal, but not before.

The ticket contains a suggestion of adding a catch all '*' dynamic field,
but we should make sure to indicate that that ALSO is not typically good
for production use because one garbage (or malicious) document can explode
the number of fields in the index, or cause cases where forgetting to add a
properly typed field makes it much further down the development cycle
before getting caught. (i.e. not caught until a user tries to sort on it
and gets 1, 10, 11, 2,... ), and dev churn due to data silently indexed
into typo variants.... etc.

Perhaps we should distribute more than one pre-baked config set and
label none of them as "default"? I'd suggest maybe

- guessing-proto --> our current _default possibly refined, for
protoytping
- dynamic-proto --> a schema based on dynamic fields with a * default to
text-general as an alternative prototyping tool less dependent on data
order, but requiring more editing
- managed-min --> A base on which to build a production quality managed
schema
- static-min --> A base on which to build a production quality classic
(non-managed) schema

Also +1 to renaming the feature away from "Schemaless" to "Schema Guessing"

-Gus

On Mon, Aug 3, 2020 at 11:33 AM Marcus Eagan <marcuseagan@gmail.com> wrote:

> Community,
>
> There are many of us that have had to deal with the pain of managing the
> schemaless mode of operation in Solr. I'm curious to get others thoughts
> about how well it is working for them and if they would like to continue to
> use it.
>
> I for one don't think Schemaless works as intended and favor deprecating
> it and replacing it with some more usable but I am sure others have
> thoughts here.
>
> Is anyone on this list using schemaless mode in production or have you
> tried to?
>
> A preliminary discussion has occurred in this Jira ticket:
> https://issues.apache.org/jira/browse/SOLR-14701
> <https://issues.apache.org/jira/browse/SOLR-14701?>
>
> Thank you all,
>
> Marcus Eagan
>
>

--
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: Deprecate Schemaless Mode? [ In reply to ]

anshum at anshumgupta

Aug 3, 2020, 11:39 AM

Post #6 of 16 (798 views)

+1 Jason.

Here's some context on how this came into being.

Users find it difficult to understand and create a basic schema when just
trying out Solr. This mode was supposed to help them bootstrap, and one
they had a better understanding of how things worked, they'd tune it before
using the schema in production.
This did improve the OTB experience for new users, but a lot of people
abused this convenience and used this in production causing issues.

As Jason mentioned, we'd better serve our users if we left this feature for
the getting started experience and add warnings (in UI and responses?) so
users would know what they are doing when they take this to production.

This feature isn't trappy unless people use it in ways it was not intended
to be used in. We just need to warn and educate people better.

On Mon, Aug 3, 2020 at 10:41 AM Jason Gerlowski <gerlowskija@gmail.com>
wrote:

> > Is anyone on this list using schemaless mode in production or have you
> tried to?
>
> Schemaless mode is one of a group of Solr features present for
> convenience but not intended for production usage. It's in the same
> boat as "bin/post", and SolrCell, and others. These features do cause
> headaches when users ignore the documented restrictions and use them
> for more than prototyping. But at the same time they're super
> valuable for these sort of demo-ing or getting-started use cases. An
> easy getting-started experience is important, and schemaless et al
> serve a mostly positive role in that.
>
> I think we'd better serve our users if we left schemaless
> in/undeprecated, and instead focused on making it harder to
> (unknowingly) use them in ways contrary to community recommendations.
> Add louder warnings in the documentation (where not already present).
> Add warnings to the Solr logs the first time these features are used.
> Disable them by default (where that makes sense). Taken to the
> extreme, we could even add a section into Solr's response that lists
> non-production features used in serving a given request.
>
> There are lots of ways to address the "feature X is trappy" problem
> without removing X together.
>
> On Mon, Aug 3, 2020 at 11:33 AM Marcus Eagan <marcuseagan@gmail.com>
> wrote:
> >
> > Community,
> >
> > There are many of us that have had to deal with the pain of managing the
> schemaless mode of operation in Solr. I'm curious to get others thoughts
> about how well it is working for them and if they would like to continue to
> use it.
> >
> > I for one don't think Schemaless works as intended and favor deprecating
> it and replacing it with some more usable but I am sure others have
> thoughts here.
> >
> > Is anyone on this list using schemaless mode in production or have you
> tried to?
> >
> > A preliminary discussion has occurred in this Jira ticket:
> https://issues.apache.org/jira/browse/SOLR-14701
> >
> > Thank you all,
> >
> > Marcus Eagan
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--
Anshum Gupta

Re: Deprecate Schemaless Mode? [ In reply to ]

marcuseagan at gmail

Aug 3, 2020, 11:44 AM

Post #7 of 16 (798 views)

I know a person using it in production today. It's causing problems. They
could abandon Solr altogether. It seems like a schema creation wizard is
the right getting started motion if we know that schemaless doesn't do what
people think it does. It's misleading. It's also a false representation of
how easy it is to get started when compared to other solutions on the
market. If schemaless is about support new use/adoption, it should actually
help that more than hurt it.

That's why I raised it. Re-branding this feature is like pig-lipsticking in
my mind, but you all have more experience than me and are committers. I
will defer to you for now. I am in favor on re-naming the feature as the
minimum change that should happen.

Schemaless mode makes sense in a world where schemas are largely opaque
like IoT-telemetry or server logs. When you are searching data primarily
for human consumption, I think it is just a headache in a bottle. In the
cases of CSV and TSV, customers know the schema. I like to approach
designing software such that no one ever needs to talk to me. No
firefighting consulting is necessary, and you can skim the docs and proceed
safely. I understand others may not feel that way, but it is the future of
software.

I encourage everyone here to try the newer search systems that have been
released and are growing rapidly to inform your opinions on this topic. I
am doing that because it is the concrete poured to build the common ground
of the future.

On Mon, Aug 3, 2020 at 11:40 AM Anshum Gupta <anshum@anshumgupta.net> wrote:

> +1 Jason.
>
> Here's some context on how this came into being.
>
> Users find it difficult to understand and create a basic schema when just
> trying out Solr. This mode was supposed to help them bootstrap, and one
> they had a better understanding of how things worked, they'd tune it before
> using the schema in production.
> This did improve the OTB experience for new users, but a lot of people
> abused this convenience and used this in production causing issues.
>
> As Jason mentioned, we'd better serve our users if we left this feature
> for the getting started experience and add warnings (in UI and responses?)
> so users would know what they are doing when they take this to production.
>
> This feature isn't trappy unless people use it in ways it was not intended
> to be used in. We just need to warn and educate people better.
>
> On Mon, Aug 3, 2020 at 10:41 AM Jason Gerlowski <gerlowskija@gmail.com>
> wrote:
>
>> > Is anyone on this list using schemaless mode in production or have you
>> tried to?
>>
>> Schemaless mode is one of a group of Solr features present for
>> convenience but not intended for production usage. It's in the same
>> boat as "bin/post", and SolrCell, and others. These features do cause
>> headaches when users ignore the documented restrictions and use them
>> for more than prototyping. But at the same time they're super
>> valuable for these sort of demo-ing or getting-started use cases. An
>> easy getting-started experience is important, and schemaless et al
>> serve a mostly positive role in that.
>>
>> I think we'd better serve our users if we left schemaless
>> in/undeprecated, and instead focused on making it harder to
>> (unknowingly) use them in ways contrary to community recommendations.
>> Add louder warnings in the documentation (where not already present).
>> Add warnings to the Solr logs the first time these features are used.
>> Disable them by default (where that makes sense). Taken to the
>> extreme, we could even add a section into Solr's response that lists
>> non-production features used in serving a given request.
>>
>> There are lots of ways to address the "feature X is trappy" problem
>> without removing X together.
>>
>> On Mon, Aug 3, 2020 at 11:33 AM Marcus Eagan <marcuseagan@gmail.com>
>> wrote:
>> >
>> > Community,
>> >
>> > There are many of us that have had to deal with the pain of managing
>> the schemaless mode of operation in Solr. I'm curious to get others
>> thoughts about how well it is working for them and if they would like to
>> continue to use it.
>> >
>> > I for one don't think Schemaless works as intended and favor
>> deprecating it and replacing it with some more usable but I am sure others
>> have thoughts here.
>> >
>> > Is anyone on this list using schemaless mode in production or have you
>> tried to?
>> >
>> > A preliminary discussion has occurred in this Jira ticket:
>> https://issues.apache.org/jira/browse/SOLR-14701
>> >
>> > Thank you all,
>> >
>> > Marcus Eagan
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> --
> Anshum Gupta
>

--
Marcus Eagan

Re: Deprecate Schemaless Mode? [ In reply to ]

marcuseagan at gmail

Aug 3, 2020, 12:05 PM

Post #8 of 16 (798 views)

Furthermore, just to be clear, I opened a discussion about deprecating and
not replacing schemaless mode for two reasons:

(1) the pain it has inflicted on Solr users and reputation of Solr —
deprecation logs speak volumes.
(2) to get a better understanding of what engineers and others in the
community use Schemaless for to inform the design of its replacement.

At no point would I argue that a feature like Schemaless is unnecessary. It
was the first way I used Solr (the second time around, the first time I
tried it I built my company using Elasticsearch because of other issues). I
am of the opinion that "Schemaless Mode" has done more harm to Solr than
good in my limited experience with the feature. Heck, *I've only been
consulting for a week and it has already come up*. I acknowledge a very
small sample size.

I am curious as to your thoughts on these points. There are not lots of
people getting started with Solr today relative to the other solutions on
the market regardless of what you might assume. I am here to see if I can
change that through a shift in how we approach user experience and the
knowledge requisite to operate a production cluster. I hope no one takes
offense to me challenging how some community members think about what is a
good feature vs what is a bad one.

Marcus

On Mon, Aug 3, 2020 at 11:44 AM Marcus Eagan <marcuseagan@gmail.com> wrote:

> I know a person using it in production today. It's causing problems. They
> could abandon Solr altogether. It seems like a schema creation wizard is
> the right getting started motion if we know that schemaless doesn't do what
> people think it does. It's misleading. It's also a false representation of
> how easy it is to get started when compared to other solutions on the
> market. If schemaless is about support new use/adoption, it should actually
> help that more than hurt it.
>
> That's why I raised it. Re-branding this feature is like pig-lipsticking
> in my mind, but you all have more experience than me and are committers. I
> will defer to you for now. I am in favor on re-naming the feature as the
> minimum change that should happen.
>
> Schemaless mode makes sense in a world where schemas are largely opaque
> like IoT-telemetry or server logs. When you are searching data primarily
> for human consumption, I think it is just a headache in a bottle. In the
> cases of CSV and TSV, customers know the schema. I like to approach
> designing software such that no one ever needs to talk to me. No
> firefighting consulting is necessary, and you can skim the docs and proceed
> safely. I understand others may not feel that way, but it is the future of
> software.
>
> I encourage everyone here to try the newer search systems that have been
> released and are growing rapidly to inform your opinions on this topic. I
> am doing that because it is the concrete poured to build the common ground
> of the future.
>
> On Mon, Aug 3, 2020 at 11:40 AM Anshum Gupta <anshum@anshumgupta.net>
> wrote:
>
>> +1 Jason.
>>
>> Here's some context on how this came into being.
>>
>> Users find it difficult to understand and create a basic schema when just
>> trying out Solr. This mode was supposed to help them bootstrap, and one
>> they had a better understanding of how things worked, they'd tune it before
>> using the schema in production.
>> This did improve the OTB experience for new users, but a lot of people
>> abused this convenience and used this in production causing issues.
>>
>> As Jason mentioned, we'd better serve our users if we left this feature
>> for the getting started experience and add warnings (in UI and responses?)
>> so users would know what they are doing when they take this to production.
>>
>> This feature isn't trappy unless people use it in ways it was not
>> intended to be used in. We just need to warn and educate people better.
>>
>> On Mon, Aug 3, 2020 at 10:41 AM Jason Gerlowski <gerlowskija@gmail.com>
>> wrote:
>>
>>> > Is anyone on this list using schemaless mode in production or have you
>>> tried to?
>>>
>>> Schemaless mode is one of a group of Solr features present for
>>> convenience but not intended for production usage. It's in the same
>>> boat as "bin/post", and SolrCell, and others. These features do cause
>>> headaches when users ignore the documented restrictions and use them
>>> for more than prototyping. But at the same time they're super
>>> valuable for these sort of demo-ing or getting-started use cases. An
>>> easy getting-started experience is important, and schemaless et al
>>> serve a mostly positive role in that.
>>>
>>> I think we'd better serve our users if we left schemaless
>>> in/undeprecated, and instead focused on making it harder to
>>> (unknowingly) use them in ways contrary to community recommendations.
>>> Add louder warnings in the documentation (where not already present).
>>> Add warnings to the Solr logs the first time these features are used.
>>> Disable them by default (where that makes sense). Taken to the
>>> extreme, we could even add a section into Solr's response that lists
>>> non-production features used in serving a given request.
>>>
>>> There are lots of ways to address the "feature X is trappy" problem
>>> without removing X together.
>>>
>>> On Mon, Aug 3, 2020 at 11:33 AM Marcus Eagan <marcuseagan@gmail.com>
>>> wrote:
>>> >
>>> > Community,
>>> >
>>> > There are many of us that have had to deal with the pain of managing
>>> the schemaless mode of operation in Solr. I'm curious to get others
>>> thoughts about how well it is working for them and if they would like to
>>> continue to use it.
>>> >
>>> > I for one don't think Schemaless works as intended and favor
>>> deprecating it and replacing it with some more usable but I am sure others
>>> have thoughts here.
>>> >
>>> > Is anyone on this list using schemaless mode in production or have you
>>> tried to?
>>> >
>>> > A preliminary discussion has occurred in this Jira ticket:
>>> https://issues.apache.org/jira/browse/SOLR-14701
>>> >
>>> > Thank you all,
>>> >
>>> > Marcus Eagan
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>>
>> --
>> Anshum Gupta
>>
>
>
> --
> Marcus Eagan
>
>

--
Marcus Eagan

Re: Deprecate Schemaless Mode? [ In reply to ]

marcuseagan at gmail

Aug 3, 2020, 12:50 PM

Post #9 of 16 (791 views)

Typo*, I meant deprecate vs. remove, which obviously cannot do.

On Mon, Aug 3, 2020 at 12:05 Marcus Eagan <marcuseagan@gmail.com> wrote:

> Furthermore, just to be clear, I opened a discussion about deprecating and
> not replacing schemaless mode for two reasons:
>
> (1) the pain it has inflicted on Solr users and reputation of Solr —
> deprecation logs speak volumes.
> (2) to get a better understanding of what engineers and others in the
> community use Schemaless for to inform the design of its replacement.
>
> At no point would I argue that a feature like Schemaless is unnecessary.
> It was the first way I used Solr (the second time around, the first time I
> tried it I built my company using Elasticsearch because of other issues). I
> am of the opinion that "Schemaless Mode" has done more harm to Solr than
> good in my limited experience with the feature. Heck, *I've only been
> consulting for a week and it has already come up*. I acknowledge a very
> small sample size.
>
> I am curious as to your thoughts on these points. There are not lots of
> people getting started with Solr today relative to the other solutions on
> the market regardless of what you might assume. I am here to see if I can
> change that through a shift in how we approach user experience and the
> knowledge requisite to operate a production cluster. I hope no one takes
> offense to me challenging how some community members think about what is a
> good feature vs what is a bad one.
>
> Marcus
>
>
>
>
> On Mon, Aug 3, 2020 at 11:44 AM Marcus Eagan <marcuseagan@gmail.com>
> wrote:
>
>> I know a person using it in production today. It's causing problems. They
>> could abandon Solr altogether. It seems like a schema creation wizard is
>> the right getting started motion if we know that schemaless doesn't do what
>> people think it does. It's misleading. It's also a false representation of
>> how easy it is to get started when compared to other solutions on the
>> market. If schemaless is about support new use/adoption, it should actually
>> help that more than hurt it.
>>
>> That's why I raised it. Re-branding this feature is like pig-lipsticking
>> in my mind, but you all have more experience than me and are committers. I
>> will defer to you for now. I am in favor on re-naming the feature as the
>> minimum change that should happen.
>>
>> Schemaless mode makes sense in a world where schemas are largely opaque
>> like IoT-telemetry or server logs. When you are searching data primarily
>> for human consumption, I think it is just a headache in a bottle. In the
>> cases of CSV and TSV, customers know the schema. I like to approach
>> designing software such that no one ever needs to talk to me. No
>> firefighting consulting is necessary, and you can skim the docs and proceed
>> safely. I understand others may not feel that way, but it is the future of
>> software.
>>
>> I encourage everyone here to try the newer search systems that have been
>> released and are growing rapidly to inform your opinions on this topic. I
>> am doing that because it is the concrete poured to build the common ground
>> of the future.
>>
>> On Mon, Aug 3, 2020 at 11:40 AM Anshum Gupta <anshum@anshumgupta.net>
>> wrote:
>>
>>> +1 Jason.
>>>
>>> Here's some context on how this came into being.
>>>
>>> Users find it difficult to understand and create a basic schema when
>>> just trying out Solr. This mode was supposed to help them bootstrap, and
>>> one they had a better understanding of how things worked, they'd tune it
>>> before using the schema in production.
>>> This did improve the OTB experience for new users, but a lot of people
>>> abused this convenience and used this in production causing issues.
>>>
>>> As Jason mentioned, we'd better serve our users if we left this feature
>>> for the getting started experience and add warnings (in UI and responses?)
>>> so users would know what they are doing when they take this to production.
>>>
>>> This feature isn't trappy unless people use it in ways it was not
>>> intended to be used in. We just need to warn and educate people better.
>>>
>>> On Mon, Aug 3, 2020 at 10:41 AM Jason Gerlowski <gerlowskija@gmail.com>
>>> wrote:
>>>
>>>> > Is anyone on this list using schemaless mode in production or have
>>>> you tried to?
>>>>
>>>> Schemaless mode is one of a group of Solr features present for
>>>> convenience but not intended for production usage. It's in the same
>>>> boat as "bin/post", and SolrCell, and others. These features do cause
>>>> headaches when users ignore the documented restrictions and use them
>>>> for more than prototyping. But at the same time they're super
>>>> valuable for these sort of demo-ing or getting-started use cases. An
>>>> easy getting-started experience is important, and schemaless et al
>>>> serve a mostly positive role in that.
>>>>
>>>> I think we'd better serve our users if we left schemaless
>>>> in/undeprecated, and instead focused on making it harder to
>>>> (unknowingly) use them in ways contrary to community recommendations.
>>>> Add louder warnings in the documentation (where not already present).
>>>> Add warnings to the Solr logs the first time these features are used.
>>>> Disable them by default (where that makes sense). Taken to the
>>>> extreme, we could even add a section into Solr's response that lists
>>>> non-production features used in serving a given request.
>>>>
>>>> There are lots of ways to address the "feature X is trappy" problem
>>>> without removing X together.
>>>>
>>>> On Mon, Aug 3, 2020 at 11:33 AM Marcus Eagan <marcuseagan@gmail.com>
>>>> wrote:
>>>> >
>>>> > Community,
>>>> >
>>>> > There are many of us that have had to deal with the pain of managing
>>>> the schemaless mode of operation in Solr. I'm curious to get others
>>>> thoughts about how well it is working for them and if they would like to
>>>> continue to use it.
>>>> >
>>>> > I for one don't think Schemaless works as intended and favor
>>>> deprecating it and replacing it with some more usable but I am sure others
>>>> have thoughts here.
>>>> >
>>>> > Is anyone on this list using schemaless mode in production or have
>>>> you tried to?
>>>> >
>>>> > A preliminary discussion has occurred in this Jira ticket:
>>>> https://issues.apache.org/jira/browse/SOLR-14701
>>>> >
>>>> > Thank you all,
>>>> >
>>>> > Marcus Eagan
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>>
>>>
>>> --
>>> Anshum Gupta
>>>
>>
>>
>> --
>> Marcus Eagan
>>
>>
>
> --
> Marcus Eagan
>
> --
Marcus Eagan

Re: Deprecate Schemaless Mode? [ In reply to ]

erickerickson at gmail

Aug 3, 2020, 2:03 PM

Post #10 of 16 (791 views)

Putting this up top so people will read it ;) Perhaps this is all just overthinking. Is the crux of the matter that schemaless is the default? Would it suffice to make it something that had to be explicitly enabled, rather than be something in solrconfig? In essence, flip the current way we do things where we can _disable_ schemaless via "bin/solr config -c mycollection -p 8983 -action set-user-property -property update.autoCreateFields -value false” and instead have it off by default and require that people _enable_ it when desired?

I think my antipathy is rooted in the fact that OOB, Solr enables schemaless. New users then have to somewhere find out that buried in the 1,500 pages of the ref guide that they can’t search is a caution that you shouldn’t take Solr to production as it’s configured OOB. It’s far too easy to miss. At least if we required that people explicitly enable it they’d have some incentive to look at https://lucene.apache.org/solr/guide/8_5/schemaless-mode.html where we call out not using it in production. Currently there isn’t any incentive to understand anything about schemaless before blithely going to production.

OK, on to my antipathy, some of which directly contradicts the above….

Just because we have other “getting started” tools that aren’t recommended for production isn’t a justification for keeping something as problematic as schemaless. ExtractingRequestHandler is probably the closest in that it can unexpectedly blow up down the road. bin/post is reasonably safe, just inefficient.

Gus’s point about implementing something before removing it is well taken, but we can deprecate it immediately without removing it. Gus’s point about dynamic fields not being found until later in the cycle is well taken, but not enough to persuade me.

I’m not enthusiastic about multiple getting started schemas. The whole motivation behind schemaless is that the user doesn’t need to know about schemas to get started. By providing multiple “getting started” schemas we require them to become aware of schemas again.

Sorry, Anshum, but "This feature isn't trappy unless people use it in ways it was not intended “ is not persuasive at all. If we have such intentions, we should enforce them. How, I don’t quite know however. How are users supposed to understand that some feature is or is not intended?

All that said, maybe we could rethink the approach. My two objections are:
1> schemaless, by updating the schema based on a very small sample set is very susceptible to failing early and often
2> Constantly updating the config in ZK and reloading the collections seems very hard to get right.

So I can imagine a “getting started” mode that indexed to the glob field while creating a schema. Ideally, it would be necessary to enable it specifically rather than have it be the default. I’d imagine this being coupled with some kind of “export schema” button. So the process would be
> start Solr with -Dsolr.learningmode.confg=some_config_name.
> index a bunch of documents, perhaps prototyping the search app on the dynamic glob field.
> The admin UI should have a big, intrusive banner saying “RUNNING IN LEARNING MODE” with instructions on what to do next.
> In that mode there’d need to be a “save schema” button or something. What I’d like that to do would be examine the index and write a new schema somewhere. If ths was the mode, then you’d be able to run it any time.

> On Aug 3, 2020, at 2:39 PM, Gus Heck <gus.heck@gmail.com> wrote:
>
> I almost never use schemaless mode (better named "schema guessing mode") and I would never recommend it for use beyond prototyping. The primary use I see for it is to throw a bunch of data at it to get a starting point for a schema... say for example you want to see what tika's going to produce for metadata before solidifying what you will and will not rely on. I think the ability to suggest a schema is valuable and shouldn't go away. I'm all for not having it be the default configuration however, and I really like the suggestions linked in the ticket for features that consider a number of documents before trying to guess the schema and if we implement one of those I'd be for deprecation and eventual removal, but not before.
>
> The ticket contains a suggestion of adding a catch all '*' dynamic field, but we should make sure to indicate that that ALSO is not typically good for production use because one garbage (or malicious) document can explode the number of fields in the index, or cause cases where forgetting to add a properly typed field makes it much further down the development cycle before getting caught. (i.e. not caught until a user tries to sort on it and gets 1, 10, 11, 2,... ), and dev churn due to data silently indexed into typo variants.... etc.
>
> Perhaps we should distribute more than one pre-baked config set and label none of them as "default"? I'd suggest maybe
> • guessing-proto --> our current _default possibly refined, for protoytping
> • dynamic-proto --> a schema based on dynamic fields with a * default to text-general as an alternative prototyping tool less dependent on data order, but requiring more editing
> • managed-min --> A base on which to build a production quality managed schema
> • static-min --> A base on which to build a production quality classic (non-managed) schema
> Also +1 to renaming the feature away from "Schemaless" to "Schema Guessing"
>
> -Gus
>
> On Mon, Aug 3, 2020 at 11:33 AM Marcus Eagan <marcuseagan@gmail.com> wrote:
> Community,
>
> There are many of us that have had to deal with the pain of managing the schemaless mode of operation in Solr. I'm curious to get others thoughts about how well it is working for them and if they would like to continue to use it.
>
> I for one don't think Schemaless works as intended and favor deprecating it and replacing it with some more usable but I am sure others have thoughts here.
>
> Is anyone on this list using schemaless mode in production or have you tried to?
>
> A preliminary discussion has occurred in this Jira ticket: https://issues.apache.org/jira/browse/SOLR-14701
>
> Thank you all,
>
> Marcus Eagan
>
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Deprecate Schemaless Mode? [ In reply to ]

gus.heck at gmail

Aug 3, 2020, 4:03 PM

Post #11 of 16 (791 views)

On Mon, Aug 3, 2020 at 5:03 PM Erick Erickson <erickerickson@gmail.com>
wrote:

> Gus’s point about implementing something before removing it is well taken,
> but we can deprecate it immediately without removing it. Gus’s point about
> dynamic fields not being found until later in the cycle is well taken, but
> not enough to persuade me.
>
> Fair enough :)

> I’m not enthusiastic about multiple getting started schemas. The whole
> motivation behind schemaless is that the user doesn’t need to know about
> schemas to get started. By providing multiple “getting started” schemas we
> require them to become aware of schemas again.
>
> Here's my theory (which may or may not be persuasive :) )

My thinking in that suggestion is that the majority of the problem is due
to the fact that people new to a technology will tend to latch onto the
defaults that come with something as being something that should be held
onto until you have a good reason to change it. This is reasonable because
changing things you don't understand willy nilly is often a road to pain.
And people DO want a safe starting point and we should give it to them
because it makes their life easier once they get a little further down the
road, but this is not compatible with the easy-start schemaless mode.
Looking at https://lucene.apache.org/solr/guide/8_5/solr-tutorial.html I
see that the initial tutorial experience is fully scripted, and the user
won't likely notice if they are told to ignore _default or guessing-proto
in favor of the tech products config set... BUT when they do get to the
point of looking at the config name they'll see the more descriptive name.
So rather than seeing "_default" and thinking "Ah ha! Here's something I
can take as gospel and not change until I have a reason!" they'll see
"guessing-proto" or "dynamic-proto" and say "Hunh, I wonder what that
means?" which is a good question for them to ask I think.

The concept of a default lays in a strong bias of not touching it (IMHO)
which will be wrong most of the time no matter what we give them as a
default. If something must be a default I'd favor a non-managed,
non-dynamic, non-guessing minimal schema with the required fields, and an
id field, maybe a _text_ field, and a comment pointing to the section of
the ref guide where they can copy and paste in all the stuff that's
currently in our base schema as example (things like the text_ga type), IF
they want it. I get really tired of seeing mile long schemas that have a
ton of unused stuff that is retained because people didn't know if they
needed it or not...

Note that not having some default would break back compat, on bin/solr but
changing the default is also a break of sorts.

>
> All that said, maybe we could rethink the approach. My two objections are:
> 1> schemaless, by updating the schema based on a very small sample set is
> very susceptible to failing early and often
> 2> Constantly updating the config in ZK and reloading the collections
> seems very hard to get right.
>

I have for some time thought the inability to upload and download a config
(or files within a config) via the web UI was a gap. But I found it easier
to write https://plugins.gradle.org/plugin/com.needhamsoftware.solr-gradle than
add that feature to the UI :)

> So I can imagine a “getting started” mode that indexed to the glob field
> while creating a schema. Ideally, it would be necessary to enable it
> specifically rather than have it be the default. I’d imagine this being
> coupled with some kind of “export schema” button. So the process would be
> > start Solr with -Dsolr.learningmode.confg=some_config_name.
> > index a bunch of documents, perhaps prototyping the search app on the
> dynamic glob field.
> > The admin UI should have a big, intrusive banner saying “RUNNING IN
> LEARNING MODE” with instructions on what to do next.
> > In that mode there’d need to be a “save schema” button or something.
> What I’d like that to do would be examine the index and write a new schema
> somewhere. If ths was the mode, then you’d be able to run it any time.
>

+1 for anything that makes a round-trip of working with the schema easier,
but not really a fan of learning mode.

>
>
>

Re: Deprecate Schemaless Mode? [ In reply to ]

erickerickson at gmail

Aug 4, 2020, 5:27 AM

Post #12 of 16 (788 views)

Having the admin UI allow uploads may not be secure. When I had a similar idea a long time ago it got shot down, see the discussion at: https://issues.apache.org/jira/browse/SOLR-5287.

I _think_ this is a different issue if the configs have to be residing on the system, not coming in from outside, just FYI...

> On Aug 3, 2020, at 7:03 PM, Gus Heck <gus.heck@gmail.com> wrote:
>
>
>
> On Mon, Aug 3, 2020 at 5:03 PM Erick Erickson <erickerickson@gmail.com> wrote:
> Gus’s point about implementing something before removing it is well taken, but we can deprecate it immediately without removing it. Gus’s point about dynamic fields not being found until later in the cycle is well taken, but not enough to persuade me.
>
> Fair enough :)
>
> I’m not enthusiastic about multiple getting started schemas. The whole motivation behind schemaless is that the user doesn’t need to know about schemas to get started. By providing multiple “getting started” schemas we require them to become aware of schemas again.
>
> Here's my theory (which may or may not be persuasive :) )
>
> My thinking in that suggestion is that the majority of the problem is due to the fact that people new to a technology will tend to latch onto the defaults that come with something as being something that should be held onto until you have a good reason to change it. This is reasonable because changing things you don't understand willy nilly is often a road to pain. And people DO want a safe starting point and we should give it to them because it makes their life easier once they get a little further down the road, but this is not compatible with the easy-start schemaless mode. Looking at https://lucene.apache.org/solr/guide/8_5/solr-tutorial.html I see that the initial tutorial experience is fully scripted, and the user won't likely notice if they are told to ignore _default or guessing-proto in favor of the tech products config set... BUT when they do get to the point of looking at the config name they'll see the more descriptive name. So rather than seeing "_default" and thinking "Ah ha! Here's something I can take as gospel and not change until I have a reason!" they'll see "guessing-proto" or "dynamic-proto" and say "Hunh, I wonder what that means?" which is a good question for them to ask I think.
>
> The concept of a default lays in a strong bias of not touching it (IMHO) which will be wrong most of the time no matter what we give them as a default. If something must be a default I'd favor a non-managed, non-dynamic, non-guessing minimal schema with the required fields, and an id field, maybe a _text_ field, and a comment pointing to the section of the ref guide where they can copy and paste in all the stuff that's currently in our base schema as example (things like the text_ga type), IF they want it. I get really tired of seeing mile long schemas that have a ton of unused stuff that is retained because people didn't know if they needed it or not...
>
> Note that not having some default would break back compat, on bin/solr but changing the default is also a break of sorts.
>
>
> All that said, maybe we could rethink the approach. My two objections are:
> 1> schemaless, by updating the schema based on a very small sample set is very susceptible to failing early and often
> 2> Constantly updating the config in ZK and reloading the collections seems very hard to get right.
>
> I have for some time thought the inability to upload and download a config (or files within a config) via the web UI was a gap. But I found it easier to write https://plugins.gradle.org/plugin/com.needhamsoftware.solr-gradle than add that feature to the UI :)
>
> So I can imagine a “getting started” mode that indexed to the glob field while creating a schema. Ideally, it would be necessary to enable it specifically rather than have it be the default. I’d imagine this being coupled with some kind of “export schema” button. So the process would be
> > start Solr with -Dsolr.learningmode.confg=some_config_name.
> > index a bunch of documents, perhaps prototyping the search app on the dynamic glob field.
> > The admin UI should have a big, intrusive banner saying “RUNNING IN LEARNING MODE” with instructions on what to do next.
> > In that mode there’d need to be a “save schema” button or something. What I’d like that to do would be examine the index and write a new schema somewhere. If ths was the mode, then you’d be able to run it any time.
>
> +1 for anything that makes a round-trip of working with the schema easier, but not really a fan of learning mode.
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Deprecate Schemaless Mode? [ In reply to ]

gus.heck at gmail

Aug 4, 2020, 6:29 AM

Post #13 of 16 (787 views)

Interesting read. Might have changed now that we have authentication
capabilities... but let's not thread jack :)

On Tue, Aug 4, 2020 at 8:28 AM Erick Erickson <erickerickson@gmail.com>
wrote:

> Having the admin UI allow uploads may not be secure. When I had a similar
> idea a long time ago it got shot down, see the discussion at:
> https://issues.apache.org/jira/browse/SOLR-5287.
>
> I _think_ this is a different issue if the configs have to be residing on
> the system, not coming in from outside, just FYI...
>
> > On Aug 3, 2020, at 7:03 PM, Gus Heck <gus.heck@gmail.com> wrote:
> >
> >
> >
> > On Mon, Aug 3, 2020 at 5:03 PM Erick Erickson <erickerickson@gmail.com>
> wrote:
> > Gus’s point about implementing something before removing it is well
> taken, but we can deprecate it immediately without removing it. Gus’s point
> about dynamic fields not being found until later in the cycle is well
> taken, but not enough to persuade me.
> >
> > Fair enough :)
> >
> > I’m not enthusiastic about multiple getting started schemas. The whole
> motivation behind schemaless is that the user doesn’t need to know about
> schemas to get started. By providing multiple “getting started” schemas we
> require them to become aware of schemas again.
> >
> > Here's my theory (which may or may not be persuasive :) )
> >
> > My thinking in that suggestion is that the majority of the problem is
> due to the fact that people new to a technology will tend to latch onto the
> defaults that come with something as being something that should be held
> onto until you have a good reason to change it. This is reasonable because
> changing things you don't understand willy nilly is often a road to pain.
> And people DO want a safe starting point and we should give it to them
> because it makes their life easier once they get a little further down the
> road, but this is not compatible with the easy-start schemaless mode.
> Looking at https://lucene.apache.org/solr/guide/8_5/solr-tutorial.html I
> see that the initial tutorial experience is fully scripted, and the user
> won't likely notice if they are told to ignore _default or guessing-proto
> in favor of the tech products config set... BUT when they do get to the
> point of looking at the config name they'll see the more descriptive name.
> So rather than seeing "_default" and thinking "Ah ha! Here's something I
> can take as gospel and not change until I have a reason!" they'll see
> "guessing-proto" or "dynamic-proto" and say "Hunh, I wonder what that
> means?" which is a good question for them to ask I think.
> >
> > The concept of a default lays in a strong bias of not touching it (IMHO)
> which will be wrong most of the time no matter what we give them as a
> default. If something must be a default I'd favor a non-managed,
> non-dynamic, non-guessing minimal schema with the required fields, and an
> id field, maybe a _text_ field, and a comment pointing to the section of
> the ref guide where they can copy and paste in all the stuff that's
> currently in our base schema as example (things like the text_ga type), IF
> they want it. I get really tired of seeing mile long schemas that have a
> ton of unused stuff that is retained because people didn't know if they
> needed it or not...
> >
> > Note that not having some default would break back compat, on bin/solr
> but changing the default is also a break of sorts.
> >
> >
> > All that said, maybe we could rethink the approach. My two objections
> are:
> > 1> schemaless, by updating the schema based on a very small sample set
> is very susceptible to failing early and often
> > 2> Constantly updating the config in ZK and reloading the collections
> seems very hard to get right.
> >
> > I have for some time thought the inability to upload and download a
> config (or files within a config) via the web UI was a gap. But I found it
> easier to write
> https://plugins.gradle.org/plugin/com.needhamsoftware.solr-gradle than
> add that feature to the UI :)
> >
> > So I can imagine a “getting started” mode that indexed to the glob field
> while creating a schema. Ideally, it would be necessary to enable it
> specifically rather than have it be the default. I’d imagine this being
> coupled with some kind of “export schema” button. So the process would be
> > > start Solr with -Dsolr.learningmode.confg=some_config_name.
> > > index a bunch of documents, perhaps prototyping the search app on the
> dynamic glob field.
> > > The admin UI should have a big, intrusive banner saying “RUNNING IN
> LEARNING MODE” with instructions on what to do next.
> > > In that mode there’d need to be a “save schema” button or something.
> What I’d like that to do would be examine the index and write a new schema
> somewhere. If ths was the mode, then you’d be able to run it any time.
> >
> > +1 for anything that makes a round-trip of working with the schema
> easier, but not really a fan of learning mode.
> >
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: Deprecate Schemaless Mode? [ In reply to ]

jan.asf at cominvent

Aug 4, 2020, 11:24 AM

Post #14 of 16 (787 views)

Learning mode won’t work if you have 10 existing collections and want to create #11. We could rather have a SchemaLearningUpdateHandler so people could explicitly post documents to say /schema-guess to modify the schema. We could even have this implicit. Then the _default config would have just _root_, is and a few more, and if you want guessing you first send a number of docs to /schema-guess endpoint and then inspect in schema browser what you got. That handler could support a Parma &reset=true which would wipe the schema to start guessing from scratch.

Jan Høydahl

> 4. aug. 2020 kl. 15:30 skrev Gus Heck <gus.heck@gmail.com>:
>
> ?
> Interesting read. Might have changed now that we have authentication capabilities... but let's not thread jack :)
>
>> On Tue, Aug 4, 2020 at 8:28 AM Erick Erickson <erickerickson@gmail.com> wrote:
>> Having the admin UI allow uploads may not be secure. When I had a similar idea a long time ago it got shot down, see the discussion at: https://issues.apache.org/jira/browse/SOLR-5287.
>>
>> I _think_ this is a different issue if the configs have to be residing on the system, not coming in from outside, just FYI...
>>
>> > On Aug 3, 2020, at 7:03 PM, Gus Heck <gus.heck@gmail.com> wrote:
>> >
>> >
>> >
>> > On Mon, Aug 3, 2020 at 5:03 PM Erick Erickson <erickerickson@gmail.com> wrote:
>> > Gus’s point about implementing something before removing it is well taken, but we can deprecate it immediately without removing it. Gus’s point about dynamic fields not being found until later in the cycle is well taken, but not enough to persuade me.
>> >
>> > Fair enough :)
>> >
>> > I’m not enthusiastic about multiple getting started schemas. The whole motivation behind schemaless is that the user doesn’t need to know about schemas to get started. By providing multiple “getting started” schemas we require them to become aware of schemas again.
>> >
>> > Here's my theory (which may or may not be persuasive :) )
>> >
>> > My thinking in that suggestion is that the majority of the problem is due to the fact that people new to a technology will tend to latch onto the defaults that come with something as being something that should be held onto until you have a good reason to change it. This is reasonable because changing things you don't understand willy nilly is often a road to pain. And people DO want a safe starting point and we should give it to them because it makes their life easier once they get a little further down the road, but this is not compatible with the easy-start schemaless mode. Looking at https://lucene.apache.org/solr/guide/8_5/solr-tutorial.html I see that the initial tutorial experience is fully scripted, and the user won't likely notice if they are told to ignore _default or guessing-proto in favor of the tech products config set... BUT when they do get to the point of looking at the config name they'll see the more descriptive name. So rather than seeing "_default" and thinking "Ah ha! Here's something I can take as gospel and not change until I have a reason!" they'll see "guessing-proto" or "dynamic-proto" and say "Hunh, I wonder what that means?" which is a good question for them to ask I think.
>> >
>> > The concept of a default lays in a strong bias of not touching it (IMHO) which will be wrong most of the time no matter what we give them as a default. If something must be a default I'd favor a non-managed, non-dynamic, non-guessing minimal schema with the required fields, and an id field, maybe a _text_ field, and a comment pointing to the section of the ref guide where they can copy and paste in all the stuff that's currently in our base schema as example (things like the text_ga type), IF they want it. I get really tired of seeing mile long schemas that have a ton of unused stuff that is retained because people didn't know if they needed it or not...
>> >
>> > Note that not having some default would break back compat, on bin/solr but changing the default is also a break of sorts.
>> >
>> >
>> > All that said, maybe we could rethink the approach. My two objections are:
>> > 1> schemaless, by updating the schema based on a very small sample set is very susceptible to failing early and often
>> > 2> Constantly updating the config in ZK and reloading the collections seems very hard to get right.
>> >
>> > I have for some time thought the inability to upload and download a config (or files within a config) via the web UI was a gap. But I found it easier to write https://plugins.gradle.org/plugin/com.needhamsoftware.solr-gradle than add that feature to the UI :)
>> >
>> > So I can imagine a “getting started” mode that indexed to the glob field while creating a schema. Ideally, it would be necessary to enable it specifically rather than have it be the default. I’d imagine this being coupled with some kind of “export schema” button. So the process would be
>> > > start Solr with -Dsolr.learningmode.confg=some_config_name.
>> > > index a bunch of documents, perhaps prototyping the search app on the dynamic glob field.
>> > > The admin UI should have a big, intrusive banner saying “RUNNING IN LEARNING MODE” with instructions on what to do next.
>> > > In that mode there’d need to be a “save schema” button or something. What I’d like that to do would be examine the index and write a new schema somewhere. If ths was the mode, then you’d be able to run it any time.
>> >
>> > +1 for anything that makes a round-trip of working with the schema easier, but not really a fan of learning mode.
>> >
>> >
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)

Re: Deprecate Schemaless Mode? [ In reply to ]

dsmiley at apache

Aug 4, 2020, 10:01 PM

Post #15 of 16 (785 views)

Thanks for starting this thread Marcus! For a historical note, the current
_default configSet being "data driven" (aka "schemaless", a worse name) is
largely because of SOLR-10272
<https://issues.apache.org/jira/browse/SOLR-10272> Maybe I should have
fought harder against it then. I threatened to veto but I was placated by
it being easily disabled. And it's true; you can disable it, and there are
some loud warnings on the CLI so... yeah.

I think my views most align with Gus. The name "default" is suggestive of
good settings you ought to change if you know what you are doing. Perhaps
there simply can be no reasonable "default" for a search platform. There
might be "basic minimal blah blah" etc. that _is_ the default choice if you
don't specify it but naming the configSet itself as "default" gives too
much blessing to it. I've seen too many configs with tons of stuff that
were there because it was inherited, and then it's hard to guess what's
_actually_ being used. Alexandre Rafalov had done some great work in
figuring out how to minimize configs. There's more to do there.

I'd be happy to see basically any change though; even a simple change from
opt-out to opt-in to "data driven" URPs. I don't like the status quo.

BTW I've also seen people try to take "bin/solr -e cloud" to production :-(
"Hey look, this is how a tutorial told me to run SolrCloud" (so the logic
goes).

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Tue, Aug 4, 2020 at 2:24 PM Jan Høydahl <jan.asf@cominvent.com> wrote:

> Learning mode won’t work if you have 10 existing collections and want to
> create #11. We could rather have a SchemaLearningUpdateHandler so people
> could explicitly post documents to say /schema-guess to modify the schema.
> We could even have this implicit. Then the _default config would have just
> _root_, is and a few more, and if you want guessing you first send a number
> of docs to /schema-guess endpoint and then inspect in schema browser what
> you got. That handler could support a Parma &reset=true which would wipe
> the schema to start guessing from scratch.
>
> Jan Høydahl
>
> 4. aug. 2020 kl. 15:30 skrev Gus Heck <gus.heck@gmail.com>:
>
> ?
> Interesting read. Might have changed now that we have authentication
> capabilities... but let's not thread jack :)
>
> On Tue, Aug 4, 2020 at 8:28 AM Erick Erickson <erickerickson@gmail.com>
> wrote:
>
>> Having the admin UI allow uploads may not be secure. When I had a similar
>> idea a long time ago it got shot down, see the discussion at:
>> https://issues.apache.org/jira/browse/SOLR-5287.
>>
>> I _think_ this is a different issue if the configs have to be residing on
>> the system, not coming in from outside, just FYI...
>>
>> > On Aug 3, 2020, at 7:03 PM, Gus Heck <gus.heck@gmail.com> wrote:
>> >
>> >
>> >
>> > On Mon, Aug 3, 2020 at 5:03 PM Erick Erickson <erickerickson@gmail.com>
>> wrote:
>> > Gus’s point about implementing something before removing it is well
>> taken, but we can deprecate it immediately without removing it. Gus’s point
>> about dynamic fields not being found until later in the cycle is well
>> taken, but not enough to persuade me.
>> >
>> > Fair enough :)
>> >
>> > I’m not enthusiastic about multiple getting started schemas. The whole
>> motivation behind schemaless is that the user doesn’t need to know about
>> schemas to get started. By providing multiple “getting started” schemas we
>> require them to become aware of schemas again.
>> >
>> > Here's my theory (which may or may not be persuasive :) )
>> >
>> > My thinking in that suggestion is that the majority of the problem is
>> due to the fact that people new to a technology will tend to latch onto the
>> defaults that come with something as being something that should be held
>> onto until you have a good reason to change it. This is reasonable because
>> changing things you don't understand willy nilly is often a road to pain.
>> And people DO want a safe starting point and we should give it to them
>> because it makes their life easier once they get a little further down the
>> road, but this is not compatible with the easy-start schemaless mode.
>> Looking at https://lucene.apache.org/solr/guide/8_5/solr-tutorial.html I
>> see that the initial tutorial experience is fully scripted, and the user
>> won't likely notice if they are told to ignore _default or guessing-proto
>> in favor of the tech products config set... BUT when they do get to the
>> point of looking at the config name they'll see the more descriptive name.
>> So rather than seeing "_default" and thinking "Ah ha! Here's something I
>> can take as gospel and not change until I have a reason!" they'll see
>> "guessing-proto" or "dynamic-proto" and say "Hunh, I wonder what that
>> means?" which is a good question for them to ask I think.
>> >
>> > The concept of a default lays in a strong bias of not touching it
>> (IMHO) which will be wrong most of the time no matter what we give them as
>> a default. If something must be a default I'd favor a non-managed,
>> non-dynamic, non-guessing minimal schema with the required fields, and an
>> id field, maybe a _text_ field, and a comment pointing to the section of
>> the ref guide where they can copy and paste in all the stuff that's
>> currently in our base schema as example (things like the text_ga type), IF
>> they want it. I get really tired of seeing mile long schemas that have a
>> ton of unused stuff that is retained because people didn't know if they
>> needed it or not...
>> >
>> > Note that not having some default would break back compat, on bin/solr
>> but changing the default is also a break of sorts.
>> >
>> >
>> > All that said, maybe we could rethink the approach. My two objections
>> are:
>> > 1> schemaless, by updating the schema based on a very small sample set
>> is very susceptible to failing early and often
>> > 2> Constantly updating the config in ZK and reloading the collections
>> seems very hard to get right.
>> >
>> > I have for some time thought the inability to upload and download a
>> config (or files within a config) via the web UI was a gap. But I found it
>> easier to write
>> https://plugins.gradle.org/plugin/com.needhamsoftware.solr-gradle than
>> add that feature to the UI :)
>> >
>> > So I can imagine a “getting started” mode that indexed to the glob
>> field while creating a schema. Ideally, it would be necessary to enable it
>> specifically rather than have it be the default. I’d imagine this being
>> coupled with some kind of “export schema” button. So the process would be
>> > > start Solr with -Dsolr.learningmode.confg=some_config_name.
>> > > index a bunch of documents, perhaps prototyping the search app on the
>> dynamic glob field.
>> > > The admin UI should have a big, intrusive banner saying “RUNNING IN
>> LEARNING MODE” with instructions on what to do next.
>> > > In that mode there’d need to be a “save schema” button or something.
>> What I’d like that to do would be examine the index and write a new schema
>> somewhere. If ths was the mode, then you’d be able to run it any time.
>> >
>> > +1 for anything that makes a round-trip of working with the schema
>> easier, but not really a fan of learning mode.
>> >
>> >
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>
>

Re: Deprecate Schemaless Mode? [ In reply to ]

arafalov at gmail

Aug 5, 2020, 9:17 AM

Post #16 of 16 (783 views)

As David said, I did a lot of breaking apart of default configuration
and it is a bit of a mess in there. (if anybody wants to review the
breakdown for Solr 6:
https://www.slideshare.net/arafalov/rebuilding-solr-6-examples-layer-by-layer-lucenesolrrevolution-2016,
slide 19 is the kicker)

I certainly agree with others that said that it is very hard for a
user to figure out what a 'production' schema should look like and
they just keep the one we give, including the schemaless part and all.
This seems to crop-up on the User list over and over again.

My +1 is SOLR-11741 (Offline training mode) and on it being an
explicit configuration to let users define their own
chain/type-widening sequence. So, the user would throw a subset (or
all) of the data at a separate end-point and receive back the
suggested schema addition commands to support the data. Perhaps this
learning mode should not live in a default schema either but in a
kitchen sync one that also has all the extra type definitions
(separate discussion, especially since DIH and 5 DIH schemas are going
away as well).

Regards,
Alex.

On Wed, 5 Aug 2020 at 01:01, David Smiley <dsmiley@apache.org> wrote:
>
> Thanks for starting this thread Marcus! For a historical note, the current _default configSet being "data driven" (aka "schemaless", a worse name) is largely because of SOLR-10272 Maybe I should have fought harder against it then. I threatened to veto but I was placated by it being easily disabled. And it's true; you can disable it, and there are some loud warnings on the CLI so... yeah.
>
> I think my views most align with Gus. The name "default" is suggestive of good settings you ought to change if you know what you are doing. Perhaps there simply can be no reasonable "default" for a search platform. There might be "basic minimal blah blah" etc. that _is_ the default choice if you don't specify it but naming the configSet itself as "default" gives too much blessing to it. I've seen too many configs with tons of stuff that were there because it was inherited, and then it's hard to guess what's _actually_ being used. Alexandre Rafalov had done some great work in figuring out how to minimize configs. There's more to do there.
>
> I'd be happy to see basically any change though; even a simple change from opt-out to opt-in to "data driven" URPs. I don't like the status quo.
>
> BTW I've also seen people try to take "bin/solr -e cloud" to production :-( "Hey look, this is how a tutorial told me to run SolrCloud" (so the logic goes).
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Tue, Aug 4, 2020 at 2:24 PM Jan Høydahl <jan.asf@cominvent.com> wrote:
>>
>> Learning mode won’t work if you have 10 existing collections and want to create #11. We could rather have a SchemaLearningUpdateHandler so people could explicitly post documents to say /schema-guess to modify the schema. We could even have this implicit. Then the _default config would have just _root_, is and a few more, and if you want guessing you first send a number of docs to /schema-guess endpoint and then inspect in schema browser what you got. That handler could support a Parma &reset=true which would wipe the schema to start guessing from scratch.
>>
>> Jan Høydahl
>>
>> 4. aug. 2020 kl. 15:30 skrev Gus Heck <gus.heck@gmail.com>:
>>
>> ?
>> Interesting read. Might have changed now that we have authentication capabilities... but let's not thread jack :)
>>
>> On Tue, Aug 4, 2020 at 8:28 AM Erick Erickson <erickerickson@gmail.com> wrote:
>>>
>>> Having the admin UI allow uploads may not be secure. When I had a similar idea a long time ago it got shot down, see the discussion at: https://issues.apache.org/jira/browse/SOLR-5287.
>>>
>>> I _think_ this is a different issue if the configs have to be residing on the system, not coming in from outside, just FYI...
>>>
>>> > On Aug 3, 2020, at 7:03 PM, Gus Heck <gus.heck@gmail.com> wrote:
>>> >
>>> >
>>> >
>>> > On Mon, Aug 3, 2020 at 5:03 PM Erick Erickson <erickerickson@gmail.com> wrote:
>>> > Gus’s point about implementing something before removing it is well taken, but we can deprecate it immediately without removing it. Gus’s point about dynamic fields not being found until later in the cycle is well taken, but not enough to persuade me.
>>> >
>>> > Fair enough :)
>>> >
>>> > I’m not enthusiastic about multiple getting started schemas. The whole motivation behind schemaless is that the user doesn’t need to know about schemas to get started. By providing multiple “getting started” schemas we require them to become aware of schemas again.
>>> >
>>> > Here's my theory (which may or may not be persuasive :) )
>>> >
>>> > My thinking in that suggestion is that the majority of the problem is due to the fact that people new to a technology will tend to latch onto the defaults that come with something as being something that should be held onto until you have a good reason to change it. This is reasonable because changing things you don't understand willy nilly is often a road to pain. And people DO want a safe starting point and we should give it to them because it makes their life easier once they get a little further down the road, but this is not compatible with the easy-start schemaless mode. Looking at https://lucene.apache.org/solr/guide/8_5/solr-tutorial.html I see that the initial tutorial experience is fully scripted, and the user won't likely notice if they are told to ignore _default or guessing-proto in favor of the tech products config set... BUT when they do get to the point of looking at the config name they'll see the more descriptive name. So rather than seeing "_default" and thinking "Ah ha! Here's something I can take as gospel and not change until I have a reason!" they'll see "guessing-proto" or "dynamic-proto" and say "Hunh, I wonder what that means?" which is a good question for them to ask I think.
>>> >
>>> > The concept of a default lays in a strong bias of not touching it (IMHO) which will be wrong most of the time no matter what we give them as a default. If something must be a default I'd favor a non-managed, non-dynamic, non-guessing minimal schema with the required fields, and an id field, maybe a _text_ field, and a comment pointing to the section of the ref guide where they can copy and paste in all the stuff that's currently in our base schema as example (things like the text_ga type), IF they want it. I get really tired of seeing mile long schemas that have a ton of unused stuff that is retained because people didn't know if they needed it or not...
>>> >
>>> > Note that not having some default would break back compat, on bin/solr but changing the default is also a break of sorts.
>>> >
>>> >
>>> > All that said, maybe we could rethink the approach. My two objections are:
>>> > 1> schemaless, by updating the schema based on a very small sample set is very susceptible to failing early and often
>>> > 2> Constantly updating the config in ZK and reloading the collections seems very hard to get right.
>>> >
>>> > I have for some time thought the inability to upload and download a config (or files within a config) via the web UI was a gap. But I found it easier to write https://plugins.gradle.org/plugin/com.needhamsoftware.solr-gradle than add that feature to the UI :)
>>> >
>>> > So I can imagine a “getting started” mode that indexed to the glob field while creating a schema. Ideally, it would be necessary to enable it specifically rather than have it be the default. I’d imagine this being coupled with some kind of “export schema” button. So the process would be
>>> > > start Solr with -Dsolr.learningmode.confg=some_config_name.
>>> > > index a bunch of documents, perhaps prototyping the search app on the dynamic glob field.
>>> > > The admin UI should have a big, intrusive banner saying “RUNNING IN LEARNING MODE” with instructions on what to do next.
>>> > > In that mode there’d need to be a “save schema” button or something. What I’d like that to do would be examine the index and write a new schema somewhere. If ths was the mode, then you’d be able to run it any time.
>>> >
>>> > +1 for anything that makes a round-trip of working with the schema easier, but not really a fan of learning mode.
>>> >
>>> >
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org