https://issues.apache.org/jira/browse/SOLR-14245
There was a *production outage* at *odd hours* at my (and Noble's) client,
due to this above change in Solr 8.5 onwards by *Andrzej Bialecki*.
In short, there is some bug in Solr where a replica gets "null" as the
node_name (upon invocation of a collection API command). On the rare
occasions where we encountered such situations in the past, the replica
would be unavailable and the system would work fine overall. However, this
change (which introduces strict validation of errors while *reading*
Replica objects) now means that if such a situation arises (where some
Solr's APIs itself results in node_name being null in a state.json), all
SolrJ clients and all Solr nodes will go for a toss (possibly crash, and
not start back up).
This change was rushed in, *without any discussions or review*, without
extensive testing for the failures it will cause on existing systems where
cluster state is messed up but system is running, and *without any
consideration for the impact on users*.
Noble and I are of the opinion that this change should be *reverted
immediately*, considering the impact to users. However, there is *strong
disagreement on Andrzej's part*.
*Mistakes* happen, but *doubling down on them irrationally* [1] will
destroy the reputation of the project, let alone the peace of mind of those
who are running Solr in production.
Does someone have any thoughts or opinions?
[1] -
https://issues.apache.org/jira/browse/SOLR-14245?focusedCommentId=17346758&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17346758
There was a *production outage* at *odd hours* at my (and Noble's) client,
due to this above change in Solr 8.5 onwards by *Andrzej Bialecki*.
In short, there is some bug in Solr where a replica gets "null" as the
node_name (upon invocation of a collection API command). On the rare
occasions where we encountered such situations in the past, the replica
would be unavailable and the system would work fine overall. However, this
change (which introduces strict validation of errors while *reading*
Replica objects) now means that if such a situation arises (where some
Solr's APIs itself results in node_name being null in a state.json), all
SolrJ clients and all Solr nodes will go for a toss (possibly crash, and
not start back up).
This change was rushed in, *without any discussions or review*, without
extensive testing for the failures it will cause on existing systems where
cluster state is messed up but system is running, and *without any
consideration for the impact on users*.
Noble and I are of the opinion that this change should be *reverted
immediately*, considering the impact to users. However, there is *strong
disagreement on Andrzej's part*.
*Mistakes* happen, but *doubling down on them irrationally* [1] will
destroy the reputation of the project, let alone the peace of mind of those
who are running Solr in production.
Does someone have any thoughts or opinions?
[1] -
https://issues.apache.org/jira/browse/SOLR-14245?focusedCommentId=17346758&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17346758