Mailing List Archive

[DISCUSS] ConfigSet ZK to file system fallback
Summary: I've been contemplating a simple enhancement to how SolrCloud
resolves files in a configSet: when a file isn't in ZooKeeper, fallback
resolution to the same-named configset on the file system (which normally
is ignored in SolrCloud today). A further fallback to _default on the
filesystem could be useful as well. The mutable space is always ZK if you
edit a schema or configOverlay.json or whatever.

My primary motivation is allowing for upgrades to plugins, configs, or Solr
itself to be easier in some scenarios (certainly not all!). Imagine that
you've got configOverlay.json (with some handlers defined) & params.json &
schema.xml in ZK, and solrconfig.xml on the file system, plus some partial
xml file of schema field types that is "xi:include"-ed by schema.xml.
Assume that a custom Solr Docker image is used including custom plugins,
and with this configSet baked in. One day you add some new token filters,
add a new Lucene merge policy, and remove some outdated update request
processor. You do plugin code changes and xi:included field type changes
and edit solrconfig.xml, and build this into your latest company Solr
Docker image, and you get it deployed using Kubernetes. Those changes can
be safe to deploy without touching any ZK resident configSet. Other
changes might not be (e.g. removing a field type that is referenced, etc.
or doing changes to analyzed text that are too incompatible requiring a
re-index) but my point is that some are, and this would be easier.

An additional motivation is storing large relatively static common
resources on the file system. Where I work, I've got over a gig of them
:-). This can be worked around with solr.allow.unsafe.resourceloading=true
but... it'd be nice to not have to resort to that.

Another benefit would be to make it easier to separate one's own
configuration with that of the _default configSet you took from Solr when
starting a new project. Resolving differences and then doing Solr upgrades
was a common task I had to do as a consultant and my own Solr upgrades.
Granted this is possible today but perhaps if this overlay was
emphasized/embraced more, it would lead to this outcome. It's still a
problem that a bare-bones solrconfig.xml & schema.xml are either too
bare-bones or say too much, and it's a separate issue for Solr to improve
that.

Probably secondary related issue: If the SolrCloud configSet ZK node were
to be optional instead of required (thus assume the configSet is entirely
on the file system), it would bring other benefits. It would allow users
to use the "file store" or some network mounted storage (NFS) as the
configSet location. It would accelerate experimentation with SolrCloud in
docker locally. The biggest PITA anyone notices when first exploring
SolrCloud is that configs are fundamentally not on the file system despite
you seeing them there; it's all in ZK. And there's no super convenient way
to edit the configuration, not even a web UI. Using the file system for
configSets would be especially nice when doing local SolrCloud
experimentation in Docker, eliminating an annoying configSet deployment
step.

I plan to file an issue of course but I think this deserved a dev list
discussion.

I know the new package manager could help with my primary motivating
use-case, but I think at present there are too many obstacles there, at
least at present. A file system fallback is a simple thing by comparison.

Question: Does the k8s Solr Operator do anything to make configSet &
plugin upgrades better?

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley
Re: [DISCUSS] ConfigSet ZK to file system fallback [ In reply to ]
There is a lot in here ;-).

With the caveat that I don’t have recent experience that many of you do with massive solr clusters, I think that we need to commit to fewer, not more, ways of maintaining the supporting resources that these clusters need.. I’d like to see ways of managing our Solr clusters that encourage easy change and experimentation, and encourage us to separate the physical layer (version of Solr, networking setup, packages used) from the logical layer (individual collections and their supporting code and resources).

I think the configSet was a huge jump forward.. My workflow is to think
1) What’s unusual about this Solr setup? What is the physical layer need to be? Special package? Special code? Build a Docker image.
2) Fire up a three node Solr cluster, wait till it’s up and responsive via checking APIs.
3) Now think about my specific use case. What collections do I need? Is it just 1, or is it 5 or 10 collections. Are they on the same configSet or different. Great, zip up the configSet and pop it into Solr via APIs.
4) Create the collections in the shapes I need with the APIs, and now start iterating on what I need to do. Use the APIs to create fields, or set up different ParamSets.

However, with configSets we only did half the job, because we still don’t have a single well understood way of handling Jars and other resources. We have many ways of doing it. Which generates constant user confusion and contributes to the perspective that “Solr is hard to use”.

Right now, across the Solr landscape I can think of many ways of adding “external” files to my Solr:

1) Classic ./lib as a place to put things.
2) The new to me solr.allow.unsafe.resourceloading=true approach
3) The userfiles directory in Solr accessed by streaming expressions load function.
4) The “package store” for packages located in file store
5) The blob store .system concept from before the package store
6) the LTR feature store (which I guess is backed by ZK but could be on the disk as well through more hoops...
7) Layering stuff in directly via Docker build files

These are each a little different, with varying levels of support.

Let’s figure out how we can include a resource that is 10 KB, 1 MB or 1 GB and not have to think about ZooKeeper or any of the other implementation details of backing that. Let’s figure out where the package manager is letting us down and keep working on it.



> On Jan 22, 2021, at 12:16 AM, David Smiley <dsmiley@apache.org> wrote:
>
> Summary: I've been contemplating a simple enhancement to how SolrCloud resolves files in a configSet: when a file isn't in ZooKeeper, fallback resolution to the same-named configset on the file system (which normally is ignored in SolrCloud today). A further fallback to _default on the filesystem could be useful as well. The mutable space is always ZK if you edit a schema or configOverlay.json or whatever.
>
> My primary motivation is allowing for upgrades to plugins, configs, or Solr itself to be easier in some scenarios (certainly not all!). Imagine that you've got configOverlay.json (with some handlers defined) & params.json & schema.xml in ZK, and solrconfig.xml on the file system, plus some partial xml file of schema field types that is "xi:include"-ed by schema.xml. Assume that a custom Solr Docker image is used including custom plugins, and with this configSet baked in. One day you add some new token filters, add a new Lucene merge policy, and remove some outdated update request processor. You do plugin code changes and xi:included field type changes and edit solrconfig.xml, and build this into your latest company Solr Docker image, and you get it deployed using Kubernetes. Those changes can be safe to deploy without touching any ZK resident configSet. Other changes might not be (e.g. removing a field type that is referenced, etc. or doing changes to analyzed text that are too incompatible requiring a re-index) but my point is that some are, and this would be easier.
>
> An additional motivation is storing large relatively static common resources on the file system. Where I work, I've got over a gig of them :-). This can be worked around with solr.allow.unsafe.resourceloading=true but... it'd be nice to not have to resort to that.
>
> Another benefit would be to make it easier to separate one's own configuration with that of the _default configSet you took from Solr when starting a new project. Resolving differences and then doing Solr upgrades was a common task I had to do as a consultant and my own Solr upgrades. Granted this is possible today but perhaps if this overlay was emphasized/embraced more, it would lead to this outcome. It's still a problem that a bare-bones solrconfig.xml & schema.xml are either too bare-bones or say too much, and it's a separate issue for Solr to improve that.
>
> Probably secondary related issue: If the SolrCloud configSet ZK node were to be optional instead of required (thus assume the configSet is entirely on the file system), it would bring other benefits. It would allow users to use the "file store" or some network mounted storage (NFS) as the configSet location. It would accelerate experimentation with SolrCloud in docker locally. The biggest PITA anyone notices when first exploring SolrCloud is that configs are fundamentally not on the file system despite you seeing them there; it's all in ZK. And there's no super convenient way to edit the configuration, not even a web UI. Using the file system for configSets would be especially nice when doing local SolrCloud experimentation in Docker, eliminating an annoying configSet deployment step.
>
> I plan to file an issue of course but I think this deserved a dev list discussion.
>
> I know the new package manager could help with my primary motivating use-case, but I think at present there are too many obstacles there, at least at present. A file system fallback is a simple thing by comparison.
>
> Question: Does the k8s Solr Operator do anything to make configSet & plugin upgrades better?
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley <http://www.linkedin.com/in/davidwsmiley>
_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
Re: [DISCUSS] ConfigSet ZK to file system fallback [ In reply to ]
I'm in agreement with Eric here that fewer ways (or at least a clearer
default way) of supplying resources would be better. Additionally, it
should be easy to specify that this resource that I've shared should be
loaded on a per SolrCore or per node basis (or even better per collection
present on the node, accessible under a standard name to replicas belonging
to that collection?). Not many cases beyond the simplest single collection
install few shards where you want a 1GB resource to be duplicated in memory
across N cores running on the same node, though obviously there's ample
cases where the 10k stop words file is meant to differ across collections.

As it stands Eric's list seems like something that should be in the
documentation somewhere just so people can properly troubleshoot where
something they don't expect to be loaded is getting loaded from, or why
their attempts to load something new aren't working... especially if it
were ordered to show the precedence of these options.

As for ease of editing configurations, I've long felt that this should be
possible via the admin UI though there's been much worry about security
implications there. Personally, I think that those concerns are resolvable,
but have not found time to make that case. Aside from that I think we need
to support tooling to enable easy management of config sets rather than
expanding the possible number of places the configurations might get loaded
from.

Several years ago I wrote a plugin for gradle that is very very basic, but
after some configuration so that it can see zookeeper, it will happily pull
configs down and push them up for you which is convenient for keeping
configs under version control during development. There's LOTS to improve
there, most especially adding support to manage multiple configs at a time,
and I had hoped that folks would use it and have suggestions,
contributions, but I've got no indication that anyone but me uses it. (
https://github.com/nsoft/solr-gradle)

-Gus

On Fri, Jan 22, 2021 at 8:19 AM Eric Pugh <epugh@opensourceconnections.com>
wrote:

> There is a lot in here ;-).
>
> With the caveat that I don’t have recent experience that many of you do
> with massive solr clusters, I think that we need to commit to fewer, not
> more, ways of maintaining the supporting resources that these clusters
> need.. I’d like to see ways of managing our Solr clusters that encourage
> easy change and experimentation, and encourage us to separate the physical
> layer (version of Solr, networking setup, packages used) from the logical
> layer (individual collections and their supporting code and resources).
>
> I think the configSet was a huge jump forward.. My workflow is to think
> 1) What’s unusual about this Solr setup? What is the physical layer need
> to be? Special package? Special code? Build a Docker image.
> 2) Fire up a three node Solr cluster, wait till it’s up and responsive via
> checking APIs.
> 3) Now think about my specific use case. What collections do I need? Is
> it just 1, or is it 5 or 10 collections. Are they on the same configSet or
> different. Great, zip up the configSet and pop it into Solr via APIs.
> 4) Create the collections in the shapes I need with the APIs, and now
> start iterating on what I need to do. Use the APIs to create fields, or
> set up different ParamSets.
>
> However, with configSets we only did half the job, because we still don’t
> have a single well understood way of handling Jars and other resources. We
> have many ways of doing it. Which generates constant user confusion and
> contributes to the perspective that “Solr is hard to use”.
>
> Right now, across the Solr landscape I can think of many ways of adding
> “external” files to my Solr:
>
> 1) Classic ./lib as a place to put things.
> 2) The new to me solr.allow.unsafe.resourceloading=true approach
> 3) The userfiles directory in Solr accessed by streaming expressions load
> function.
> 4) The “package store” for packages located in file store
> 5) The blob store .system concept from before the package store
> 6) the LTR feature store (which I guess is backed by ZK but could be on
> the disk as well through more hoops...
> 7) Layering stuff in directly via Docker build files
>
> These are each a little different, with varying levels of support.
>
> Let’s figure out how we can include a resource that is 10 KB, 1 MB or 1 GB
> and not have to think about ZooKeeper or any of the other implementation
> details of backing that. Let’s figure out where the package manager is
> letting us down and keep working on it.
>
>
>
> On Jan 22, 2021, at 12:16 AM, David Smiley <dsmiley@apache.org> wrote:
>
> Summary: I've been contemplating a simple enhancement to how SolrCloud
> resolves files in a configSet: when a file isn't in ZooKeeper, fallback
> resolution to the same-named configset on the file system (which normally
> is ignored in SolrCloud today). A further fallback to _default on the
> filesystem could be useful as well. The mutable space is always ZK if you
> edit a schema or configOverlay.json or whatever.
>
> My primary motivation is allowing for upgrades to plugins, configs, or
> Solr itself to be easier in some scenarios (certainly not all!). Imagine
> that you've got configOverlay.json (with some handlers defined) &
> params.json & schema.xml in ZK, and solrconfig.xml on the file system, plus
> some partial xml file of schema field types that is "xi:include"-ed by
> schema.xml. Assume that a custom Solr Docker image is used including
> custom plugins, and with this configSet baked in. One day you add some new
> token filters, add a new Lucene merge policy, and remove some outdated
> update request processor. You do plugin code changes and xi:included
> field type changes and edit solrconfig.xml, and build this into your latest
> company Solr Docker image, and you get it deployed using Kubernetes. Those
> changes can be safe to deploy without touching any ZK resident configSet.
> Other changes might not be (e.g. removing a field type that is referenced,
> etc. or doing changes to analyzed text that are too incompatible requiring
> a re-index) but my point is that some are, and this would be easier.
>
> An additional motivation is storing large relatively static common
> resources on the file system. Where I work, I've got over a gig of them
> :-). This can be worked around with solr.allow.unsafe.resourceloading=true
> but... it'd be nice to not have to resort to that.
>
> Another benefit would be to make it easier to separate one's own
> configuration with that of the _default configSet you took from Solr when
> starting a new project. Resolving differences and then doing Solr upgrades
> was a common task I had to do as a consultant and my own Solr upgrades.
> Granted this is possible today but perhaps if this overlay was
> emphasized/embraced more, it would lead to this outcome. It's still a
> problem that a bare-bones solrconfig.xml & schema.xml are either too
> bare-bones or say too much, and it's a separate issue for Solr to improve
> that.
>
> Probably secondary related issue: If the SolrCloud configSet ZK node were
> to be optional instead of required (thus assume the configSet is entirely
> on the file system), it would bring other benefits. It would allow users
> to use the "file store" or some network mounted storage (NFS) as the
> configSet location. It would accelerate experimentation with SolrCloud in
> docker locally. The biggest PITA anyone notices when first exploring
> SolrCloud is that configs are fundamentally not on the file system despite
> you seeing them there; it's all in ZK. And there's no super convenient way
> to edit the configuration, not even a web UI. Using the file system for
> configSets would be especially nice when doing local SolrCloud
> experimentation in Docker, eliminating an annoying configSet deployment
> step.
>
> I plan to file an issue of course but I think this deserved a dev list
> discussion.
>
> I know the new package manager could help with my primary motivating
> use-case, but I think at present there are too many obstacles there, at
> least at present. A file system fallback is a simple thing by comparison.
>
> Question: Does the k8s Solr Operator do anything to make configSet &
> plugin upgrades better?
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> _______________________
> *Eric Pugh **| *Founder & CEO | OpenSource Connections, LLC | 434.466.1467
> | http://www.opensourceconnections.com | My Free/Busy
> <http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed
> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>
>

--
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)
Re: [DISCUSS] ConfigSet ZK to file system fallback [ In reply to ]
An aspect that would be interesting to consider IMO is upgrade and
configuration changes.
For example a collection in use across Solr version upgrade might require
different configuration (config set) with the old and new Solr versions.
Solr itself can require changes in config across updates.

Backward compatibility is the usual answer (the new code continues working
with the old config that can be updated once all nodes have been deployed)
but this imposes constraints on new code.
If there was a way for the new Solr code to "magically" use a different
config set for the collection (and for Solr config in general) there would
be more freedom to add or change features, change default behavior across
Solr versions etc.

Ilan

On Sat 23 Jan 2021 at 22:22, Gus Heck <gus.heck@gmail.com> wrote:

> I'm in agreement with Eric here that fewer ways (or at least a clearer
> default way) of supplying resources would be better. Additionally, it
> should be easy to specify that this resource that I've shared should be
> loaded on a per SolrCore or per node basis (or even better per collection
> present on the node, accessible under a standard name to replicas belonging
> to that collection?). Not many cases beyond the simplest single collection
> install few shards where you want a 1GB resource to be duplicated in memory
> across N cores running on the same node, though obviously there's ample
> cases where the 10k stop words file is meant to differ across collections.
>
> As it stands Eric's list seems like something that should be in the
> documentation somewhere just so people can properly troubleshoot where
> something they don't expect to be loaded is getting loaded from, or why
> their attempts to load something new aren't working... especially if it
> were ordered to show the precedence of these options.
>
> As for ease of editing configurations, I've long felt that this should be
> possible via the admin UI though there's been much worry about security
> implications there. Personally, I think that those concerns are resolvable,
> but have not found time to make that case. Aside from that I think we need
> to support tooling to enable easy management of config sets rather than
> expanding the possible number of places the configurations might get loaded
> from.
>
> Several years ago I wrote a plugin for gradle that is very very basic, but
> after some configuration so that it can see zookeeper, it will happily pull
> configs down and push them up for you which is convenient for keeping
> configs under version control during development. There's LOTS to improve
> there, most especially adding support to manage multiple configs at a time,
> and I had hoped that folks would use it and have suggestions,
> contributions, but I've got no indication that anyone but me uses it. (
> https://github.com/nsoft/solr-gradle)
>
> -Gus
>
> On Fri, Jan 22, 2021 at 8:19 AM Eric Pugh <epugh@opensourceconnections.com>
> wrote:
>
>> There is a lot in here ;-).
>>
>> With the caveat that I don’t have recent experience that many of you do
>> with massive solr clusters, I think that we need to commit to fewer, not
>> more, ways of maintaining the supporting resources that these clusters
>> need.. I’d like to see ways of managing our Solr clusters that encourage
>> easy change and experimentation, and encourage us to separate the physical
>> layer (version of Solr, networking setup, packages used) from the logical
>> layer (individual collections and their supporting code and resources).
>>
>> I think the configSet was a huge jump forward.. My workflow is to think
>> 1) What’s unusual about this Solr setup? What is the physical layer need
>> to be? Special package? Special code? Build a Docker image.
>> 2) Fire up a three node Solr cluster, wait till it’s up and responsive
>> via checking APIs.
>> 3) Now think about my specific use case. What collections do I need?
>> Is it just 1, or is it 5 or 10 collections. Are they on the same configSet
>> or different. Great, zip up the configSet and pop it into Solr via APIs.
>>
>> 4) Create the collections in the shapes I need with the APIs, and now
>> start iterating on what I need to do. Use the APIs to create fields, or
>> set up different ParamSets.
>>
>> However, with configSets we only did half the job, because we still don’t
>> have a single well understood way of handling Jars and other resources. We
>> have many ways of doing it. Which generates constant user confusion and
>> contributes to the perspective that “Solr is hard to use”.
>>
>> Right now, across the Solr landscape I can think of many ways of adding
>> “external” files to my Solr:
>>
>> 1) Classic ./lib as a place to put things.
>> 2) The new to me solr.allow.unsafe.resourceloading=true approach
>> 3) The userfiles directory in Solr accessed by streaming expressions load
>> function.
>> 4) The “package store” for packages located in file store
>> 5) The blob store .system concept from before the package store
>> 6) the LTR feature store (which I guess is backed by ZK but could be on
>> the disk as well through more hoops...
>> 7) Layering stuff in directly via Docker build files
>>
>> These are each a little different, with varying levels of support.
>>
>> Let’s figure out how we can include a resource that is 10 KB, 1 MB or 1
>> GB and not have to think about ZooKeeper or any of the other implementation
>> details of backing that. Let’s figure out where the package manager is
>> letting us down and keep working on it.
>>
>>
>>
>> On Jan 22, 2021, at 12:16 AM, David Smiley <dsmiley@apache.org> wrote:
>>
>> Summary: I've been contemplating a simple enhancement to how SolrCloud
>> resolves files in a configSet: when a file isn't in ZooKeeper, fallback
>> resolution to the same-named configset on the file system (which normally
>> is ignored in SolrCloud today). A further fallback to _default on the
>> filesystem could be useful as well. The mutable space is always ZK if you
>> edit a schema or configOverlay.json or whatever.
>>
>> My primary motivation is allowing for upgrades to plugins, configs, or
>> Solr itself to be easier in some scenarios (certainly not all!). Imagine
>> that you've got configOverlay.json (with some handlers defined) &
>> params.json & schema.xml in ZK, and solrconfig.xml on the file system, plus
>> some partial xml file of schema field types that is "xi:include"-ed by
>> schema.xml. Assume that a custom Solr Docker image is used including
>> custom plugins, and with this configSet baked in. One day you add some new
>> token filters, add a new Lucene merge policy, and remove some outdated
>> update request processor. You do plugin code changes and xi:included
>> field type changes and edit solrconfig.xml, and build this into your latest
>> company Solr Docker image, and you get it deployed using Kubernetes. Those
>> changes can be safe to deploy without touching any ZK resident configSet.
>> Other changes might not be (e.g. removing a field type that is referenced,
>> etc. or doing changes to analyzed text that are too incompatible requiring
>> a re-index) but my point is that some are, and this would be easier.
>>
>> An additional motivation is storing large relatively static common
>> resources on the file system. Where I work, I've got over a gig of them
>> :-). This can be worked around with solr.allow.unsafe.resourceloading=true
>> but... it'd be nice to not have to resort to that.
>>
>> Another benefit would be to make it easier to separate one's own
>> configuration with that of the _default configSet you took from Solr when
>> starting a new project. Resolving differences and then doing Solr upgrades
>> was a common task I had to do as a consultant and my own Solr upgrades.
>> Granted this is possible today but perhaps if this overlay was
>> emphasized/embraced more, it would lead to this outcome. It's still a
>> problem that a bare-bones solrconfig.xml & schema.xml are either too
>> bare-bones or say too much, and it's a separate issue for Solr to improve
>> that.
>>
>> Probably secondary related issue: If the SolrCloud configSet ZK node were
>> to be optional instead of required (thus assume the configSet is entirely
>> on the file system), it would bring other benefits. It would allow users
>> to use the "file store" or some network mounted storage (NFS) as the
>> configSet location. It would accelerate experimentation with SolrCloud in
>> docker locally. The biggest PITA anyone notices when first exploring
>> SolrCloud is that configs are fundamentally not on the file system despite
>> you seeing them there; it's all in ZK. And there's no super convenient way
>> to edit the configuration, not even a web UI. Using the file system for
>> configSets would be especially nice when doing local SolrCloud
>> experimentation in Docker, eliminating an annoying configSet deployment
>> step.
>>
>> I plan to file an issue of course but I think this deserved a dev list
>> discussion.
>>
>> I know the new package manager could help with my primary motivating
>> use-case, but I think at present there are too many obstacles there, at
>> least at present. A file system fallback is a simple thing by comparison.
>>
>> Question: Does the k8s Solr Operator do anything to make configSet &
>> plugin upgrades better?
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> _______________________
>> *Eric Pugh **| *Founder & CEO | OpenSource Connections, LLC | 434.466.1467
>> | http://www.opensourceconnections.com | My Free/Busy
>> <http://tinyurl.com/eric-cal>
>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed
>> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>> This e-mail and all contents, including attachments, is considered to be
>> Company Confidential unless explicitly stated otherwise, regardless
>> of whether attachments are marked as such.
>>
>>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>
Re: [DISCUSS] ConfigSet ZK to file system fallback [ In reply to ]
I'm not entirely sure how to react to the feedback. Maybe in listing
multiple benefits and a follow-on proposal, I inadvertently opened doors to
distracting points. I know I can be guilty of scope creep. My proposal
has no impact on where JARs go, and so let's not discuss lib directories,
the package store, or LTR's feature store either which my proposal is not
related to, ok? My proposal doesn't even add a new configuration place
that doesn't already exist.

Let me try to express this proposal through a different angle / lens that I
think is more clear and motivating than the first:

Each physical Solr node (perhaps a Docker image) is composed of Solr's
code, perhaps some plugin code too, and some configuration files with some
settings. Baked into any code are settings with a default value. There
are trivial primitive settings like an integer for "maxMergeAtOnce" on
TieredMergePolicy, and there are more aggregate settings, like what the
default MergePolicy is. Sometimes the default changes from one release to
the next, or new settings get added or go away (albeit rarely). Let's just
consider SolrCloud.

... Let's say you need to make a settings change. ...

For changes specified in solrconfig.xml (generalizable to any file in the
configSet, really), you MUST deploy this to ZooKeeper. That sucks when the
configuration might only make sense for some nodes. Most likely you are
doing an upgrade in which you can't simply change the Solr nodes in an
instant, but perhaps some nodes are simply different (different hardware?
-- SSDs vs HDDs). Upgrades can be orchestrated but it's more complex when
there is ZK resident configuration, and it will impose annoying
restrictions on the underlying code (i.e. back-compat concerns). By having
a "physical layer configuration" (borrowing Eric's terminology), we can tie
some settings to this layer while still having a higher level layer. I
proposed one way of doing this; I'd be happy to discuss others.

I'd like to extend the same argument to solr.xml, a node level
configuration file. Here, at least there is already _some_ flexibility --
you can supply solr.xml with the physical layer (the Docker image) *OR* in
ZooKeeper. But IMO it's not ideal because it's either-or.. Some
configuration might make sense with the physical node, and some at the
cluster node. Ideally IMO, we'd have a way to blend both such that the
deployer chooses where the configuration makes sense based on their cluster.

WDYT?

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Sun, Jan 24, 2021 at 6:08 AM Ilan Ginzburg <ilansolr@gmail.com> wrote:

> An aspect that would be interesting to consider IMO is upgrade and
> configuration changes.
> For example a collection in use across Solr version upgrade might require
> different configuration (config set) with the old and new Solr versions.
> Solr itself can require changes in config across updates.
>
> Backward compatibility is the usual answer (the new code continues working
> with the old config that can be updated once all nodes have been deployed)
> but this imposes constraints on new code.
> If there was a way for the new Solr code to "magically" use a different
> config set for the collection (and for Solr config in general) there would
> be more freedom to add or change features, change default behavior across
> Solr versions etc.
>
> Ilan
>
> On Sat 23 Jan 2021 at 22:22, Gus Heck <gus.heck@gmail.com> wrote:
>
>> I'm in agreement with Eric here that fewer ways (or at least a clearer
>> default way) of supplying resources would be better. Additionally, it
>> should be easy to specify that this resource that I've shared should be
>> loaded on a per SolrCore or per node basis (or even better per collection
>> present on the node, accessible under a standard name to replicas belonging
>> to that collection?). Not many cases beyond the simplest single collection
>> install few shards where you want a 1GB resource to be duplicated in memory
>> across N cores running on the same node, though obviously there's ample
>> cases where the 10k stop words file is meant to differ across collections.
>>
>> As it stands Eric's list seems like something that should be in the
>> documentation somewhere just so people can properly troubleshoot where
>> something they don't expect to be loaded is getting loaded from, or why
>> their attempts to load something new aren't working... especially if it
>> were ordered to show the precedence of these options.
>>
>> As for ease of editing configurations, I've long felt that this should be
>> possible via the admin UI though there's been much worry about security
>> implications there. Personally, I think that those concerns are resolvable,
>> but have not found time to make that case. Aside from that I think we need
>> to support tooling to enable easy management of config sets rather than
>> expanding the possible number of places the configurations might get loaded
>> from.
>>
>> Several years ago I wrote a plugin for gradle that is very very basic,
>> but after some configuration so that it can see zookeeper, it will happily
>> pull configs down and push them up for you which is convenient for keeping
>> configs under version control during development. There's LOTS to improve
>> there, most especially adding support to manage multiple configs at a time,
>> and I had hoped that folks would use it and have suggestions,
>> contributions, but I've got no indication that anyone but me uses it. (
>> https://github.com/nsoft/solr-gradle)
>>
>> -Gus
>>
>> On Fri, Jan 22, 2021 at 8:19 AM Eric Pugh <
>> epugh@opensourceconnections.com> wrote:
>>
>>> There is a lot in here ;-).
>>>
>>> With the caveat that I don’t have recent experience that many of you do
>>> with massive solr clusters, I think that we need to commit to fewer, not
>>> more, ways of maintaining the supporting resources that these clusters
>>> need.. I’d like to see ways of managing our Solr clusters that encourage
>>> easy change and experimentation, and encourage us to separate the physical
>>> layer (version of Solr, networking setup, packages used) from the logical
>>> layer (individual collections and their supporting code and resources).
>>>
>>> I think the configSet was a huge jump forward.. My workflow is to
>>> think
>>> 1) What’s unusual about this Solr setup? What is the physical layer
>>> need to be? Special package? Special code? Build a Docker image.
>>> 2) Fire up a three node Solr cluster, wait till it’s up and responsive
>>> via checking APIs.
>>> 3) Now think about my specific use case. What collections do I need?
>>> Is it just 1, or is it 5 or 10 collections. Are they on the same configSet
>>> or different. Great, zip up the configSet and pop it into Solr via APIs.
>>>
>>> 4) Create the collections in the shapes I need with the APIs, and now
>>> start iterating on what I need to do. Use the APIs to create fields, or
>>> set up different ParamSets.
>>>
>>> However, with configSets we only did half the job, because we still
>>> don’t have a single well understood way of handling Jars and other
>>> resources. We have many ways of doing it. Which generates constant user
>>> confusion and contributes to the perspective that “Solr is hard to use”.
>>>
>>> Right now, across the Solr landscape I can think of many ways of adding
>>> “external” files to my Solr:
>>>
>>> 1) Classic ./lib as a place to put things.
>>> 2) The new to me solr.allow.unsafe.resourceloading=true approach
>>> 3) The userfiles directory in Solr accessed by streaming expressions
>>> load function.
>>> 4) The “package store” for packages located in file store
>>> 5) The blob store .system concept from before the package store
>>> 6) the LTR feature store (which I guess is backed by ZK but could be on
>>> the disk as well through more hoops...
>>> 7) Layering stuff in directly via Docker build files
>>>
>>> These are each a little different, with varying levels of support.
>>>
>>> Let’s figure out how we can include a resource that is 10 KB, 1 MB or 1
>>> GB and not have to think about ZooKeeper or any of the other implementation
>>> details of backing that. Let’s figure out where the package manager is
>>> letting us down and keep working on it.
>>>
>>>
>>>
>>> On Jan 22, 2021, at 12:16 AM, David Smiley <dsmiley@apache.org> wrote:
>>>
>>> Summary: I've been contemplating a simple enhancement to how SolrCloud
>>> resolves files in a configSet: when a file isn't in ZooKeeper, fallback
>>> resolution to the same-named configset on the file system (which normally
>>> is ignored in SolrCloud today). A further fallback to _default on the
>>> filesystem could be useful as well. The mutable space is always ZK if you
>>> edit a schema or configOverlay.json or whatever.
>>>
>>> My primary motivation is allowing for upgrades to plugins, configs, or
>>> Solr itself to be easier in some scenarios (certainly not all!). Imagine
>>> that you've got configOverlay.json (with some handlers defined) &
>>> params.json & schema.xml in ZK, and solrconfig.xml on the file system, plus
>>> some partial xml file of schema field types that is "xi:include"-ed by
>>> schema.xml. Assume that a custom Solr Docker image is used including
>>> custom plugins, and with this configSet baked in. One day you add some new
>>> token filters, add a new Lucene merge policy, and remove some outdated
>>> update request processor. You do plugin code changes and xi:included
>>> field type changes and edit solrconfig.xml, and build this into your latest
>>> company Solr Docker image, and you get it deployed using Kubernetes. Those
>>> changes can be safe to deploy without touching any ZK resident configSet.
>>> Other changes might not be (e.g. removing a field type that is referenced,
>>> etc. or doing changes to analyzed text that are too incompatible requiring
>>> a re-index) but my point is that some are, and this would be easier.
>>>
>>> An additional motivation is storing large relatively static common
>>> resources on the file system. Where I work, I've got over a gig of them
>>> :-). This can be worked around with solr.allow.unsafe.resourceloading=true
>>> but... it'd be nice to not have to resort to that.
>>>
>>> Another benefit would be to make it easier to separate one's own
>>> configuration with that of the _default configSet you took from Solr when
>>> starting a new project. Resolving differences and then doing Solr upgrades
>>> was a common task I had to do as a consultant and my own Solr upgrades.
>>> Granted this is possible today but perhaps if this overlay was
>>> emphasized/embraced more, it would lead to this outcome. It's still a
>>> problem that a bare-bones solrconfig.xml & schema.xml are either too
>>> bare-bones or say too much, and it's a separate issue for Solr to improve
>>> that.
>>>
>>> Probably secondary related issue: If the SolrCloud configSet ZK node
>>> were to be optional instead of required (thus assume the configSet is
>>> entirely on the file system), it would bring other benefits. It would
>>> allow users to use the "file store" or some network mounted storage (NFS)
>>> as the configSet location. It would accelerate experimentation with
>>> SolrCloud in docker locally. The biggest PITA anyone notices when first
>>> exploring SolrCloud is that configs are fundamentally not on the file
>>> system despite you seeing them there; it's all in ZK. And there's no super
>>> convenient way to edit the configuration, not even a web UI. Using the
>>> file system for configSets would be especially nice when doing local
>>> SolrCloud experimentation in Docker, eliminating an annoying configSet
>>> deployment step.
>>>
>>> I plan to file an issue of course but I think this deserved a dev list
>>> discussion.
>>>
>>> I know the new package manager could help with my primary motivating
>>> use-case, but I think at present there are too many obstacles there, at
>>> least at present. A file system fallback is a simple thing by comparison.
>>>
>>> Question: Does the k8s Solr Operator do anything to make configSet &
>>> plugin upgrades better?
>>>
>>> ~ David Smiley
>>> Apache Lucene/Solr Search Developer
>>> http://www.linkedin.com/in/davidwsmiley
>>>
>>>
>>> _______________________
>>> *Eric Pugh **| *Founder & CEO | OpenSource Connections, LLC | 434.466.1467
>>> | http://www.opensourceconnections.com | My Free/Busy
>>> <http://tinyurl.com/eric-cal>
>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed
>>> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>>> This e-mail and all contents, including attachments, is considered to be
>>> Company Confidential unless explicitly stated otherwise, regardless
>>> of whether attachments are marked as such.
>>>
>>>
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>>
>
Re: [DISCUSS] ConfigSet ZK to file system fallback [ In reply to ]
Thanks for bringing this up, David. I thought about this same situation
before, but I think I never convinced myself in one way or another :p. As I
mentioned in many other emails, I think the infrastructure and the node
configuration (such as solr.xml) needs to be local (at least, needs to be
able to be local and not forced on ZooKeeper) for various reasons.
The same reasons exist for configsets: safe upgrades, or possible
node-specific configuration, as you mentioned. But Configsets have another
layer of complexity in my mind, which is, you don't know where you'll need
them... because you don't (necessarily) know where replicas of a collection
are going to be created. True that this is not a problem in the Docker
image situation you are describing, or if handled with care, but how can
Solr make sure of it?

But I think it's a valuable feature to explore. Maybe the configset needs
to exist in ZooKeeper and have some sort of flag (similar to secure=true)
where it could say "local=true", and then fail Solr instances to start if
the configset is not present or something? Otherwise the collection
creation and replica addition operations may need to know where configsets
are present, etc. I'm wondering if this mix you are proposing of some files
in ZooKeeper and some files local wouldn't complicate things too much...
not sure.

Tomás

On Mon, Jan 25, 2021 at 3:15 PM David Smiley <dsmiley@apache.org> wrote:

> I'm not entirely sure how to react to the feedback. Maybe in listing
> multiple benefits and a follow-on proposal, I inadvertently opened doors to
> distracting points. I know I can be guilty of scope creep. My proposal
> has no impact on where JARs go, and so let's not discuss lib directories,
> the package store, or LTR's feature store either which my proposal is not
> related to, ok? My proposal doesn't even add a new configuration place
> that doesn't already exist.
>
> Let me try to express this proposal through a different angle / lens that
> I think is more clear and motivating than the first:
>
> Each physical Solr node (perhaps a Docker image) is composed of Solr's
> code, perhaps some plugin code too, and some configuration files with some
> settings. Baked into any code are settings with a default value. There
> are trivial primitive settings like an integer for "maxMergeAtOnce" on
> TieredMergePolicy, and there are more aggregate settings, like what the
> default MergePolicy is. Sometimes the default changes from one release to
> the next, or new settings get added or go away (albeit rarely). Let's just
> consider SolrCloud.
>
> ... Let's say you need to make a settings change. ...
>
> For changes specified in solrconfig.xml (generalizable to any file in the
> configSet, really), you MUST deploy this to ZooKeeper. That sucks when the
> configuration might only make sense for some nodes. Most likely you are
> doing an upgrade in which you can't simply change the Solr nodes in an
> instant, but perhaps some nodes are simply different (different hardware?
> -- SSDs vs HDDs). Upgrades can be orchestrated but it's more complex when
> there is ZK resident configuration, and it will impose annoying
> restrictions on the underlying code (i.e. back-compat concerns). By having
> a "physical layer configuration" (borrowing Eric's terminology), we can tie
> some settings to this layer while still having a higher level layer. I
> proposed one way of doing this; I'd be happy to discuss others.
>
> I'd like to extend the same argument to solr.xml, a node level
> configuration file. Here, at least there is already _some_ flexibility --
> you can supply solr.xml with the physical layer (the Docker image) *OR* in
> ZooKeeper. But IMO it's not ideal because it's either-or.. Some
> configuration might make sense with the physical node, and some at the
> cluster node. Ideally IMO, we'd have a way to blend both such that the
> deployer chooses where the configuration makes sense based on their cluster.
>
> WDYT?
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Sun, Jan 24, 2021 at 6:08 AM Ilan Ginzburg <ilansolr@gmail.com> wrote:
>
>> An aspect that would be interesting to consider IMO is upgrade and
>> configuration changes.
>> For example a collection in use across Solr version upgrade might require
>> different configuration (config set) with the old and new Solr versions.
>> Solr itself can require changes in config across updates.
>>
>> Backward compatibility is the usual answer (the new code continues
>> working with the old config that can be updated once all nodes have been
>> deployed) but this imposes constraints on new code.
>> If there was a way for the new Solr code to "magically" use a different
>> config set for the collection (and for Solr config in general) there would
>> be more freedom to add or change features, change default behavior across
>> Solr versions etc.
>>
>> Ilan
>>
>> On Sat 23 Jan 2021 at 22:22, Gus Heck <gus.heck@gmail.com> wrote:
>>
>>> I'm in agreement with Eric here that fewer ways (or at least a clearer
>>> default way) of supplying resources would be better. Additionally, it
>>> should be easy to specify that this resource that I've shared should be
>>> loaded on a per SolrCore or per node basis (or even better per collection
>>> present on the node, accessible under a standard name to replicas belonging
>>> to that collection?). Not many cases beyond the simplest single collection
>>> install few shards where you want a 1GB resource to be duplicated in memory
>>> across N cores running on the same node, though obviously there's ample
>>> cases where the 10k stop words file is meant to differ across collections.
>>>
>>> As it stands Eric's list seems like something that should be in the
>>> documentation somewhere just so people can properly troubleshoot where
>>> something they don't expect to be loaded is getting loaded from, or why
>>> their attempts to load something new aren't working... especially if it
>>> were ordered to show the precedence of these options.
>>>
>>> As for ease of editing configurations, I've long felt that this should
>>> be possible via the admin UI though there's been much worry about security
>>> implications there. Personally, I think that those concerns are resolvable,
>>> but have not found time to make that case. Aside from that I think we need
>>> to support tooling to enable easy management of config sets rather than
>>> expanding the possible number of places the configurations might get loaded
>>> from.
>>>
>>> Several years ago I wrote a plugin for gradle that is very very basic,
>>> but after some configuration so that it can see zookeeper, it will happily
>>> pull configs down and push them up for you which is convenient for keeping
>>> configs under version control during development. There's LOTS to improve
>>> there, most especially adding support to manage multiple configs at a time,
>>> and I had hoped that folks would use it and have suggestions,
>>> contributions, but I've got no indication that anyone but me uses it. (
>>> https://github.com/nsoft/solr-gradle)
>>>
>>> -Gus
>>>
>>> On Fri, Jan 22, 2021 at 8:19 AM Eric Pugh <
>>> epugh@opensourceconnections.com> wrote:
>>>
>>>> There is a lot in here ;-).
>>>>
>>>> With the caveat that I don’t have recent experience that many of you do
>>>> with massive solr clusters, I think that we need to commit to fewer, not
>>>> more, ways of maintaining the supporting resources that these clusters
>>>> need.. I’d like to see ways of managing our Solr clusters that encourage
>>>> easy change and experimentation, and encourage us to separate the physical
>>>> layer (version of Solr, networking setup, packages used) from the logical
>>>> layer (individual collections and their supporting code and resources).
>>>>
>>>> I think the configSet was a huge jump forward.. My workflow is to
>>>> think
>>>> 1) What’s unusual about this Solr setup? What is the physical layer
>>>> need to be? Special package? Special code? Build a Docker image.
>>>> 2) Fire up a three node Solr cluster, wait till it’s up and responsive
>>>> via checking APIs.
>>>> 3) Now think about my specific use case. What collections do I need?
>>>> Is it just 1, or is it 5 or 10 collections. Are they on the same configSet
>>>> or different. Great, zip up the configSet and pop it into Solr via APIs.
>>>>
>>>> 4) Create the collections in the shapes I need with the APIs, and now
>>>> start iterating on what I need to do. Use the APIs to create fields, or
>>>> set up different ParamSets.
>>>>
>>>> However, with configSets we only did half the job, because we still
>>>> don’t have a single well understood way of handling Jars and other
>>>> resources. We have many ways of doing it. Which generates constant user
>>>> confusion and contributes to the perspective that “Solr is hard to use”.
>>>>
>>>> Right now, across the Solr landscape I can think of many ways of adding
>>>> “external” files to my Solr:
>>>>
>>>> 1) Classic ./lib as a place to put things.
>>>> 2) The new to me solr.allow.unsafe.resourceloading=true approach
>>>> 3) The userfiles directory in Solr accessed by streaming expressions
>>>> load function.
>>>> 4) The “package store” for packages located in file store
>>>> 5) The blob store .system concept from before the package store
>>>> 6) the LTR feature store (which I guess is backed by ZK but could be on
>>>> the disk as well through more hoops...
>>>> 7) Layering stuff in directly via Docker build files
>>>>
>>>> These are each a little different, with varying levels of support.
>>>>
>>>> Let’s figure out how we can include a resource that is 10 KB, 1 MB or 1
>>>> GB and not have to think about ZooKeeper or any of the other implementation
>>>> details of backing that. Let’s figure out where the package manager is
>>>> letting us down and keep working on it.
>>>>
>>>>
>>>>
>>>> On Jan 22, 2021, at 12:16 AM, David Smiley <dsmiley@apache.org> wrote:
>>>>
>>>> Summary: I've been contemplating a simple enhancement to how SolrCloud
>>>> resolves files in a configSet: when a file isn't in ZooKeeper, fallback
>>>> resolution to the same-named configset on the file system (which normally
>>>> is ignored in SolrCloud today). A further fallback to _default on the
>>>> filesystem could be useful as well. The mutable space is always ZK if you
>>>> edit a schema or configOverlay.json or whatever.
>>>>
>>>> My primary motivation is allowing for upgrades to plugins, configs, or
>>>> Solr itself to be easier in some scenarios (certainly not all!). Imagine
>>>> that you've got configOverlay.json (with some handlers defined) &
>>>> params.json & schema.xml in ZK, and solrconfig.xml on the file system, plus
>>>> some partial xml file of schema field types that is "xi:include"-ed by
>>>> schema.xml. Assume that a custom Solr Docker image is used including
>>>> custom plugins, and with this configSet baked in. One day you add some new
>>>> token filters, add a new Lucene merge policy, and remove some outdated
>>>> update request processor. You do plugin code changes and xi:included
>>>> field type changes and edit solrconfig.xml, and build this into your latest
>>>> company Solr Docker image, and you get it deployed using Kubernetes. Those
>>>> changes can be safe to deploy without touching any ZK resident configSet.
>>>> Other changes might not be (e.g. removing a field type that is referenced,
>>>> etc. or doing changes to analyzed text that are too incompatible requiring
>>>> a re-index) but my point is that some are, and this would be easier.
>>>>
>>>> An additional motivation is storing large relatively static common
>>>> resources on the file system. Where I work, I've got over a gig of them
>>>> :-). This can be worked around with solr.allow.unsafe.resourceloading=true
>>>> but... it'd be nice to not have to resort to that.
>>>>
>>>> Another benefit would be to make it easier to separate one's own
>>>> configuration with that of the _default configSet you took from Solr when
>>>> starting a new project. Resolving differences and then doing Solr upgrades
>>>> was a common task I had to do as a consultant and my own Solr upgrades.
>>>> Granted this is possible today but perhaps if this overlay was
>>>> emphasized/embraced more, it would lead to this outcome. It's still a
>>>> problem that a bare-bones solrconfig.xml & schema.xml are either too
>>>> bare-bones or say too much, and it's a separate issue for Solr to improve
>>>> that.
>>>>
>>>> Probably secondary related issue: If the SolrCloud configSet ZK node
>>>> were to be optional instead of required (thus assume the configSet is
>>>> entirely on the file system), it would bring other benefits. It would
>>>> allow users to use the "file store" or some network mounted storage (NFS)
>>>> as the configSet location. It would accelerate experimentation with
>>>> SolrCloud in docker locally. The biggest PITA anyone notices when first
>>>> exploring SolrCloud is that configs are fundamentally not on the file
>>>> system despite you seeing them there; it's all in ZK. And there's no super
>>>> convenient way to edit the configuration, not even a web UI. Using the
>>>> file system for configSets would be especially nice when doing local
>>>> SolrCloud experimentation in Docker, eliminating an annoying configSet
>>>> deployment step.
>>>>
>>>> I plan to file an issue of course but I think this deserved a dev list
>>>> discussion.
>>>>
>>>> I know the new package manager could help with my primary motivating
>>>> use-case, but I think at present there are too many obstacles there, at
>>>> least at present. A file system fallback is a simple thing by comparison.
>>>>
>>>> Question: Does the k8s Solr Operator do anything to make configSet &
>>>> plugin upgrades better?
>>>>
>>>> ~ David Smiley
>>>> Apache Lucene/Solr Search Developer
>>>> http://www.linkedin.com/in/davidwsmiley
>>>>
>>>>
>>>> _______________________
>>>> *Eric Pugh **| *Founder & CEO | OpenSource Connections, LLC | 434.466.1467
>>>> | http://www.opensourceconnections.com | My Free/Busy
>>>> <http://tinyurl.com/eric-cal>
>>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed
>>>> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>>>> This e-mail and all contents, including attachments, is considered to
>>>> be Company Confidential unless explicitly stated otherwise, regardless
>>>> of whether attachments are marked as such.
>>>>
>>>>
>>>
>>> --
>>> http://www.needhamsoftware.com (work)
>>> http://www.the111shift.com (play)
>>>
>>
Re: [DISCUSS] ConfigSet ZK to file system fallback [ In reply to ]
On Tue, Jan 26, 2021 at 1:27 PM Tomás Fernández Löbbe <tomasflobbe@gmail.com>
wrote:

> Thanks for bringing this up, David. I thought about this same situation
> before, but I think I never convinced myself in one way or another :p. As I
> mentioned in many other emails, I think the infrastructure and the node
> configuration (such as solr.xml) needs to be local (at least, needs to be
> able to be local and not forced on ZooKeeper) for various reasons.
>

I agree 100%. I think the key part there is having *choice* for each
configuration element, and not one dictated by Solr as to what belongs
where. The implementation of it needn't be complicated; it's a
straight-forward idea to have the same format with conceptual layer /
aggregation of them.


> The same reasons exist for configsets: safe upgrades, or possible
> node-specific configuration, as you mentioned. But Configsets have another
> layer of complexity in my mind, which is, you don't know where you'll need
> them... because you don't (necessarily) know where replicas of a collection
> are going to be created. True that this is not a problem in the Docker
> image situation you are describing, or if handled with care, but how can
> Solr make sure of it?
>

Ehh; I am not suggesting that configSets belong local, which would be a
step backwards -- we put them in ZK for a reason right now :-) I'm
suggesting we have *both* for the same configSet, where the deployer can
choose which element is node resident vs cluster/ZK resident. Thanks to
existing Solr features like configOverlay.json and/or XML xi:include plus
one small addition of fallback resolution of configSet files from ZK to the
local node, we'd get this ability. (see my first email).

We have a very limited ability to accomplish the broad idea today -- Java
system properties with variable substitution in our files. But of course
it's very limited what you can do with that, and it feels abusive to push
it too far. It's fine for individual tunables (e.g. an integer) but not
more aggregate things like a complete MergePolicy configuration or an
analysis chain in a schema.

We have another vaguely similar thing conceptually in Solr today --
ImplicitPlugins.json. Probably only a few of you have heard of it. It's
baked into solr-core's JAR. Take a look at it. What if it were a file
that a deployer could easily replace on the node, e.g. to reduce SolrCore
load time or for security or to add something that a company wants all
SolrCores to have? That is along the lines of what this email thread is
about: How can a Solr cluster deployer make settings changes (to include
registering new plugins) that are either specific to a node and/or should
be so for an entire cluster without each ZK resident configSet having the
config element? *We can come up with ideas but most importantly I want to
validate the notion that this is a desirable thing. *I think we agree,
Thomas, but I'm unsure about Eric & Gus and anyone else for that matter.


> But I think it's a valuable feature to explore. Maybe the configset needs
> to exist in ZooKeeper and have some sort of flag (similar to secure=true)
> where it could say "local=true", and then fail Solr instances to start if
> the configset is not present or something? Otherwise the collection
> creation and replica addition operations may need to know where configsets
> are present, etc. I'm wondering if this mix you are proposing of some files
> in ZooKeeper and some files local wouldn't complicate things too much...
> not sure.
>

I hope my answer above clarifies. It seems you are exploring the ideas of
the latter part of my proposal that I started with "Probably secondary
related issue" (fully file system only configSets)... but I regret adding
this part because apparently it's too distracting to my primary discussion
point.

~ David

>
Re: [DISCUSS] ConfigSet ZK to file system fallback [ In reply to ]
> Ehh; I am not suggesting that configSets belong local, which would be a
step backwards -- we put them in ZK for a reason right now :-) I'm
suggesting we have both for the same configSet, where the deployer can
choose which element is node resident vs cluster/ZK resident. Thanks to
existing Solr features like configOverlay.json and/or XML xi:include plus
one small addition of fallback resolution of configSet files from ZK to the
local node, we'd get this ability. (see my first email).

To be clear, I didn't suggest we move all configsets to be local. I'm just
saying that having a local configset has those issues I mentioned.

The point I was trying to make is that, having a single configset loading
from both, local and zk may be confusing for the user and cause issues that
may be difficult to track: Which file is Solr really reading right now? is
it the local one or the remote one? Is there a local one in a node or not?
is it being correctly overridden? How do I ensure that I always have a
local version of a file to override the remote?

So, I'm thinking that if we want to support this feature, a cleaner
approach could be to just have a type of configset that's defined as
"local", and then it belongs to the local filesystem. We can just prevent a
node from starting if it's supposed to have a configset that doesn't have.
It's 100% clear where a config file is being read from, etc. Maybe the
"configOverlay.json" is an exception and should live in ZooKeeper (and
never locally) for the config API to work, but having just "default to
local when a file is not in ZooKeeper" just confuses things IMO.

On Tue, Jan 26, 2021 at 8:38 PM David Smiley <dsmiley@apache.org> wrote:

> On Tue, Jan 26, 2021 at 1:27 PM Tomás Fernández Löbbe <
> tomasflobbe@gmail.com> wrote:
>
>> Thanks for bringing this up, David. I thought about this same situation
>> before, but I think I never convinced myself in one way or another :p. As I
>> mentioned in many other emails, I think the infrastructure and the node
>> configuration (such as solr.xml) needs to be local (at least, needs to be
>> able to be local and not forced on ZooKeeper) for various reasons.
>>
>
> I agree 100%. I think the key part there is having *choice* for each
> configuration element, and not one dictated by Solr as to what belongs
> where. The implementation of it needn't be complicated; it's a
> straight-forward idea to have the same format with conceptual layer /
> aggregation of them.
>
>
>> The same reasons exist for configsets: safe upgrades, or possible
>> node-specific configuration, as you mentioned. But Configsets have another
>> layer of complexity in my mind, which is, you don't know where you'll need
>> them... because you don't (necessarily) know where replicas of a collection
>> are going to be created. True that this is not a problem in the Docker
>> image situation you are describing, or if handled with care, but how can
>> Solr make sure of it?
>>
>
> Ehh; I am not suggesting that configSets belong local, which would be a
> step backwards -- we put them in ZK for a reason right now :-) I'm
> suggesting we have *both* for the same configSet, where the deployer can
> choose which element is node resident vs cluster/ZK resident. Thanks to
> existing Solr features like configOverlay.json and/or XML xi:include plus
> one small addition of fallback resolution of configSet files from ZK to the
> local node, we'd get this ability. (see my first email).
>
> We have a very limited ability to accomplish the broad idea today -- Java
> system properties with variable substitution in our files. But of course
> it's very limited what you can do with that, and it feels abusive to push
> it too far. It's fine for individual tunables (e.g. an integer) but not
> more aggregate things like a complete MergePolicy configuration or an
> analysis chain in a schema.
>
> We have another vaguely similar thing conceptually in Solr today --
> ImplicitPlugins.json. Probably only a few of you have heard of it. It's
> baked into solr-core's JAR. Take a look at it. What if it were a file
> that a deployer could easily replace on the node, e.g. to reduce SolrCore
> load time or for security or to add something that a company wants all
> SolrCores to have? That is along the lines of what this email thread is
> about: How can a Solr cluster deployer make settings changes (to include
> registering new plugins) that are either specific to a node and/or should
> be so for an entire cluster without each ZK resident configSet having the
> config element? *We can come up with ideas but most importantly I want
> to validate the notion that this is a desirable thing. *I think we
> agree, Thomas, but I'm unsure about Eric & Gus and anyone else for that
> matter.
>
>
>> But I think it's a valuable feature to explore. Maybe the configset needs
>> to exist in ZooKeeper and have some sort of flag (similar to secure=true)
>> where it could say "local=true", and then fail Solr instances to start if
>> the configset is not present or something? Otherwise the collection
>> creation and replica addition operations may need to know where configsets
>> are present, etc. I'm wondering if this mix you are proposing of some files
>> in ZooKeeper and some files local wouldn't complicate things too much...
>> not sure.
>>
>
> I hope my answer above clarifies. It seems you are exploring the ideas of
> the latter part of my proposal that I started with "Probably secondary
> related issue" (fully file system only configSets)... but I regret adding
> this part because apparently it's too distracting to my primary discussion
> point.
>
> ~ David
>
>>
Re: [DISCUSS] ConfigSet ZK to file system fallback [ In reply to ]
It sounds like the issue is that we need both a "per node config" and a
"per collection" config. This could all be in zookeeper, and with a clear
well documented precedence order (node wins) for any attributes that
overlap... would even make sense to have names for nodes that were not
literal machine urls for this so that one could move a node to a different
machine... node goes down, (listed as down by zookeeper) node comes up
claiming name, if the name is a down node, bingo new node gets the same
config as the old node. New node coming up and finding the name taken by a
live node could wait for N ticks before giving up or could fail immediately.

Node names could be supplied at startup, or assigned automatically...

Probably want to have a default node config, and the ability to write
configs for node names that don't (yet) exist...

Just a thought... sounds good to me because a view of ZK still shows you
all the configurations, zk is still the one source of truth. What I don't
want is multiple sources of truthishness.

On Thu, Feb 4, 2021 at 12:23 PM Tomás Fernández Löbbe <tomasflobbe@gmail.com>
wrote:

> > Ehh; I am not suggesting that configSets belong local, which would be a
> step backwards -- we put them in ZK for a reason right now :-) I'm
> suggesting we have both for the same configSet, where the deployer can
> choose which element is node resident vs cluster/ZK resident. Thanks to
> existing Solr features like configOverlay.json and/or XML xi:include plus
> one small addition of fallback resolution of configSet files from ZK to the
> local node, we'd get this ability. (see my first email).
>
> To be clear, I didn't suggest we move all configsets to be local. I'm just
> saying that having a local configset has those issues I mentioned.
>
> The point I was trying to make is that, having a single configset loading
> from both, local and zk may be confusing for the user and cause issues that
> may be difficult to track: Which file is Solr really reading right now? is
> it the local one or the remote one? Is there a local one in a node or not?
> is it being correctly overridden? How do I ensure that I always have a
> local version of a file to override the remote?
>
> So, I'm thinking that if we want to support this feature, a cleaner
> approach could be to just have a type of configset that's defined as
> "local", and then it belongs to the local filesystem. We can just prevent a
> node from starting if it's supposed to have a configset that doesn't have.
> It's 100% clear where a config file is being read from, etc. Maybe the
> "configOverlay.json" is an exception and should live in ZooKeeper (and
> never locally) for the config API to work, but having just "default to
> local when a file is not in ZooKeeper" just confuses things IMO.
>
> On Tue, Jan 26, 2021 at 8:38 PM David Smiley <dsmiley@apache.org> wrote:
>
>> On Tue, Jan 26, 2021 at 1:27 PM Tomás Fernández Löbbe <
>> tomasflobbe@gmail.com> wrote:
>>
>>> Thanks for bringing this up, David. I thought about this same situation
>>> before, but I think I never convinced myself in one way or another :p. As I
>>> mentioned in many other emails, I think the infrastructure and the node
>>> configuration (such as solr.xml) needs to be local (at least, needs to be
>>> able to be local and not forced on ZooKeeper) for various reasons.
>>>
>>
>> I agree 100%. I think the key part there is having *choice* for each
>> configuration element, and not one dictated by Solr as to what belongs
>> where. The implementation of it needn't be complicated; it's a
>> straight-forward idea to have the same format with conceptual layer /
>> aggregation of them.
>>
>>
>>> The same reasons exist for configsets: safe upgrades, or possible
>>> node-specific configuration, as you mentioned. But Configsets have another
>>> layer of complexity in my mind, which is, you don't know where you'll need
>>> them... because you don't (necessarily) know where replicas of a collection
>>> are going to be created. True that this is not a problem in the Docker
>>> image situation you are describing, or if handled with care, but how can
>>> Solr make sure of it?
>>>
>>
>> Ehh; I am not suggesting that configSets belong local, which would be a
>> step backwards -- we put them in ZK for a reason right now :-) I'm
>> suggesting we have *both* for the same configSet, where the deployer can
>> choose which element is node resident vs cluster/ZK resident. Thanks to
>> existing Solr features like configOverlay.json and/or XML xi:include plus
>> one small addition of fallback resolution of configSet files from ZK to the
>> local node, we'd get this ability. (see my first email).
>>
>> We have a very limited ability to accomplish the broad idea today -- Java
>> system properties with variable substitution in our files. But of course
>> it's very limited what you can do with that, and it feels abusive to push
>> it too far. It's fine for individual tunables (e.g. an integer) but not
>> more aggregate things like a complete MergePolicy configuration or an
>> analysis chain in a schema.
>>
>> We have another vaguely similar thing conceptually in Solr today --
>> ImplicitPlugins.json. Probably only a few of you have heard of it. It's
>> baked into solr-core's JAR. Take a look at it. What if it were a file
>> that a deployer could easily replace on the node, e.g. to reduce SolrCore
>> load time or for security or to add something that a company wants all
>> SolrCores to have? That is along the lines of what this email thread is
>> about: How can a Solr cluster deployer make settings changes (to include
>> registering new plugins) that are either specific to a node and/or should
>> be so for an entire cluster without each ZK resident configSet having the
>> config element? *We can come up with ideas but most importantly I want
>> to validate the notion that this is a desirable thing. *I think we
>> agree, Thomas, but I'm unsure about Eric & Gus and anyone else for that
>> matter.
>>
>>
>>> But I think it's a valuable feature to explore. Maybe the configset
>>> needs to exist in ZooKeeper and have some sort of flag (similar to
>>> secure=true) where it could say "local=true", and then fail Solr instances
>>> to start if the configset is not present or something? Otherwise the
>>> collection creation and replica addition operations may need to know where
>>> configsets are present, etc. I'm wondering if this mix you are proposing of
>>> some files in ZooKeeper and some files local wouldn't complicate things too
>>> much... not sure.
>>>
>>
>> I hope my answer above clarifies. It seems you are exploring the ideas
>> of the latter part of my proposal that I started with "Probably secondary
>> related issue" (fully file system only configSets)... but I regret adding
>> this part because apparently it's too distracting to my primary discussion
>> point.
>>
>> ~ David
>>
>>>

--
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)
Re: [DISCUSS] ConfigSet ZK to file system fallback [ In reply to ]
Hi,

I can see need for such flexibility, but I'm also worried that we complicate things and make debugging harder etc.

If I'm not mistaken, the current logic in ZkResourceLoader is to first look in ZK, and if it is not found, look in local disk(?).

I'd prefer it being an explicit fallback or resolution order instead of hardcoded magick.
I.e. able to configure a configset search path such as ["local", "zk", "somethingelse"]. This would make resource loader prefer local files even if they exist in ZK.

Longer term it would be nice isolate ZK away, and make config sources fully pluggable.
You could then address a too large resource explicitly such as <filter .. words="filestore://stopwords-en.txt"> or local:/stopwords-en.txt etc.

Jan

> 27. jan. 2021 kl. 05:38 skrev David Smiley <dsmiley@apache.org>:
>
> On Tue, Jan 26, 2021 at 1:27 PM Tomás Fernández Löbbe <tomasflobbe@gmail.com <mailto:tomasflobbe@gmail.com>> wrote:
> Thanks for bringing this up, David. I thought about this same situation before, but I think I never convinced myself in one way or another :p. As I mentioned in many other emails, I think the infrastructure and the node configuration (such as solr.xml) needs to be local (at least, needs to be able to be local and not forced on ZooKeeper) for various reasons.
>
> I agree 100%. I think the key part there is having choice for each configuration element, and not one dictated by Solr as to what belongs where. The implementation of it needn't be complicated; it's a straight-forward idea to have the same format with conceptual layer / aggregation of them.
>
> The same reasons exist for configsets: safe upgrades, or possible node-specific configuration, as you mentioned. But Configsets have another layer of complexity in my mind, which is, you don't know where you'll need them... because you don't (necessarily) know where replicas of a collection are going to be created. True that this is not a problem in the Docker image situation you are describing, or if handled with care, but how can Solr make sure of it?
>
> Ehh; I am not suggesting that configSets belong local, which would be a step backwards -- we put them in ZK for a reason right now :-) I'm suggesting we have both for the same configSet, where the deployer can choose which element is node resident vs cluster/ZK resident. Thanks to existing Solr features like configOverlay.json and/or XML xi:include plus one small addition of fallback resolution of configSet files from ZK to the local node, we'd get this ability. (see my first email).
>
> We have a very limited ability to accomplish the broad idea today -- Java system properties with variable substitution in our files. But of course it's very limited what you can do with that, and it feels abusive to push it too far. It's fine for individual tunables (e.g. an integer) but not more aggregate things like a complete MergePolicy configuration or an analysis chain in a schema.
>
> We have another vaguely similar thing conceptually in Solr today -- ImplicitPlugins.json. Probably only a few of you have heard of it. It's baked into solr-core's JAR. Take a look at it. What if it were a file that a deployer could easily replace on the node, e.g. to reduce SolrCore load time or for security or to add something that a company wants all SolrCores to have? That is along the lines of what this email thread is about: How can a Solr cluster deployer make settings changes (to include registering new plugins) that are either specific to a node and/or should be so for an entire cluster without each ZK resident configSet having the config element? We can come up with ideas but most importantly I want to validate the notion that this is a desirable thing. I think we agree, Thomas, but I'm unsure about Eric & Gus and anyone else for that matter.
>
> But I think it's a valuable feature to explore. Maybe the configset needs to exist in ZooKeeper and have some sort of flag (similar to secure=true) where it could say "local=true", and then fail Solr instances to start if the configset is not present or something? Otherwise the collection creation and replica addition operations may need to know where configsets are present, etc. I'm wondering if this mix you are proposing of some files in ZooKeeper and some files local wouldn't complicate things too much... not sure.
>
> I hope my answer above clarifies. It seems you are exploring the ideas of the latter part of my proposal that I started with "Probably secondary related issue" (fully file system only configSets)... but I regret adding this part because apparently it's too distracting to my primary discussion point.
>
> ~ David
Re: [DISCUSS] ConfigSet ZK to file system fallback [ In reply to ]
>
> I'd prefer it being an explicit fallback or resolution order instead of
> hardcoded magick.
> I.e. able to configure a configset search path such as ["local", "zk",
> "somethingelse"]. This would make resource loader prefer local files even
> if they exist in ZK.
>
>
This actually winds up being a convention vs configuration thing I think.
Complexity is situational sometimes. Having a fallback convention means
that when you are experienced, you can walk into any install and know
what's going on, but if you are new, there's a learning curve. On the other
hand if we allow resolution order to be configurable, then one never knows
what's going on until you've first got an answer to "how's it been
configured". This can sometimes be a little simpler for first timers,
except that certain configurations might be a bad idea, and then they won't
see the rope until they are all tangled up in it. So for a consultant, or
new hire with experience the convention path is simpler, because one always
looks at specific things in a specific order that is already known. The
configuration path is only sometimes simpler for new users.

However, what I'd propose is that we have a precedence order for the
"levels" of configuration, and a single "source" for configuration.... if
we need to make that source configurable so be it, but all "primary
configuration" should come from a single source, for a given cluster.

To put it another way I'm not fond of any "fallback" in where config comes
from.

By "Primary configuration" I mean the solr specific xml/json/whatever ...
The "Primary Configuration" could of course point to resources required
elsewhere, but those should be things like jar files or SSO systems,
whereas the configuration artifacts that are solr specific should come from
one source.


--
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)
Re: [DISCUSS] ConfigSet ZK to file system fallback [ In reply to ]
On Thu, Feb 4, 2021 at 12:23 PM Tomás Fernández Löbbe <tomasflobbe@gmail.com>
wrote:

> The point I was trying to make is that, having a single configset loading
> from both, local and zk may be confusing for the user and cause issues that
> may be difficult to track: Which file is Solr really reading right now? is
> it the local one or the remote one? Is there a local one in a node or not?
> is it being correctly overridden? How do I ensure that I always have a
> local version of a file to override the remote?
>

Fair point -- it is less clear than today. I suppose anything we come up
with will be :-)


> So, I'm thinking that if we want to support this feature, a cleaner
> approach could be to just have a type of configset that's defined as
> "local", and then it belongs to the local filesystem. We can just prevent a
> node from starting if it's supposed to have a configset that doesn't have.
> It's 100% clear where a config file is being read from, etc. Maybe the
> "configOverlay.json" is an exception and should live in ZooKeeper (and
> never locally) for the config API to work, but having just "default to
> local when a file is not in ZooKeeper" just confuses things IMO.
>

Hmmm, okay. While I agree configOverlay.json & params.json would always
belong in ZK... for the rest, it's debatable. Can we get the schema there
too if it's a "managed schema"? What about resource files (e.g.
synonyms)? Whatever the answers are there, it would solve my primary
motivation -- an easier upgrade path, at least where I work.

I spoke with Ilan a couple weeks ago about this and he proposed an
interesting idea: Put a simple version number on the configSet, and let
them live in either ZK or local. The greater version number chooses which
wins; the other is ignored. This is somewhat similar to your idea.

Still... I'd prefer some way to establish defaults for specific
configuration elements that live on the node, while letting the other
aspects continue to reside in ZK (or have the option of resolving local as
well). In my mind, this is just about making Solr's existing defaults in
the code become configurable. It's a different way of looking at things
than saying where does this or that file live. For example, imagine a node
resident default configSet that is effectively the default that all
configSets are overlayed on top of. Field types, analyzers, merge
policies, request handlers -- it could define whatever it felt is needed.
Then the ZK part is what is specific to a configSet for a given search app,
and it doesn't need to specify the organization-wide settings. My original
proposal doesn't quite do this directly because I thought of a cheap hack
in concert with some other Solr features that'd suffice for my aims. But
maybe I should propose more explicitly a node-local default configSet,
designed to make setting defaults simple/easy in one place and specific to
a node. One might call this configSet inheritance. I think it would lead
to configSets that are simpler to read/maintain because they would only
contain what an app needs, and not the organization-wide needs and/or Solr
defaults. WDYT?