Mailing List Archive

Doubling down on our mistakes?
https://issues.apache.org/jira/browse/SOLR-14245

There was a *production outage* at *odd hours* at my (and Noble's) client,
due to this above change in Solr 8.5 onwards by *Andrzej Bialecki*.

In short, there is some bug in Solr where a replica gets "null" as the
node_name (upon invocation of a collection API command). On the rare
occasions where we encountered such situations in the past, the replica
would be unavailable and the system would work fine overall. However, this
change (which introduces strict validation of errors while *reading*
Replica objects) now means that if such a situation arises (where some
Solr's APIs itself results in node_name being null in a state.json), all
SolrJ clients and all Solr nodes will go for a toss (possibly crash, and
not start back up).

This change was rushed in, *without any discussions or review*, without
extensive testing for the failures it will cause on existing systems where
cluster state is messed up but system is running, and *without any
consideration for the impact on users*.

Noble and I are of the opinion that this change should be *reverted
immediately*, considering the impact to users. However, there is *strong
disagreement on Andrzej's part*.

*Mistakes* happen, but *doubling down on them irrationally* [1] will
destroy the reputation of the project, let alone the peace of mind of those
who are running Solr in production.

Does someone have any thoughts or opinions?

[1] -
https://issues.apache.org/jira/browse/SOLR-14245?focusedCommentId=17346758&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17346758
Re: Doubling down on our mistakes? [ In reply to ]
I have some thoughts, if I may.... please stop acting so quickly and *violently* at every annoyance you have around here. Your trigger finger and attitude, -Ishan- are disruptive to _community_ and lack the basic tactful pleasantries.

Calling a respected, and quite long time badass committer, and all around incredible ninja (*bows, respect, @ab) that he, "irrational", is acting to destroy the reputation of the project and the civility of a public and forever archived mailing list. Not to mention the other comments that I personally have taken offense to in reading your and Noble's "attack" this morning. My feelings are hurt, and I'm greatly disappointed in your words, quick attacking off the cuff regularly rude (IMO) because you happened to have a bad day. It is not his fault your cluster went down... it is the lack of testing and care taken on *YOUR SIDE*.

Peace,
Erik


> On May 18, 2021, at 7:26 AM, Ishan Chattopadhyaya <ichattopadhyaya@gmail.com> wrote:
>
> https://issues.apache.org/jira/browse/SOLR-14245 <https://issues.apache.org/jira/browse/SOLR-14245>
>
> There was a production outage at odd hours at my (and Noble's) client, due to this above change in Solr 8.5 onwards by Andrzej Bialecki.
>
> In short, there is some bug in Solr where a replica gets "null" as the node_name (upon invocation of a collection API command). On the rare occasions where we encountered such situations in the past, the replica would be unavailable and the system would work fine overall. However, this change (which introduces strict validation of errors while *reading* Replica objects) now means that if such a situation arises (where some Solr's APIs itself results in node_name being null in a state.json), all SolrJ clients and all Solr nodes will go for a toss (possibly crash, and not start back up).
>
> This change was rushed in, without any discussions or review, without extensive testing for the failures it will cause on existing systems where cluster state is messed up but system is running, and without any consideration for the impact on users.
>
> Noble and I are of the opinion that this change should be reverted immediately, considering the impact to users. However, there is strong disagreement on Andrzej's part.
>
> Mistakes happen, but doubling down on them irrationally [1] will destroy the reputation of the project, let alone the peace of mind of those who are running Solr in production.
>
> Does someone have any thoughts or opinions?
>
> [1] - https://issues.apache.org/jira/browse/SOLR-14245?focusedCommentId=17346758&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17346758 <https://issues.apache.org/jira/browse/SOLR-14245?focusedCommentId=17346758&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17346758>
Re: Doubling down on our mistakes? [ In reply to ]
Ishan, as I pointed out in Jira I don’t care for you implying that I have evil intentions, I resent also your implication that I’m behaving irrationally or don’t care for the users. Those of you who are interested may read the comments in Jira and judge for themselves.

You conveniently don’t mention that I WITHDREW my objection, and instead proposed a lenient validation (but validation nonetheless!). It’s easy to scream “revert! revert!” but it actually takes some consideration to properly address the original purpose of this change - that is, detecting and avoiding the corruption of replica state. Let’s focus on this and not on pointing fingers.

As for the production outage - I’m sorry this happened to you. As I hope you and Noble and others are sorry for other inadvertently introduced bugs, which I’m sure brought down many clusters at inconvenient hours...


> On 18 May 2021, at 13:26, Ishan Chattopadhyaya <ichattopadhyaya@gmail.com> wrote:
>
> https://issues.apache.org/jira/browse/SOLR-14245 <https://issues.apache.org/jira/browse/SOLR-14245>
>
> There was a production outage at odd hours at my (and Noble's) client, due to this above change in Solr 8.5 onwards by Andrzej Bialecki.
>
> In short, there is some bug in Solr where a replica gets "null" as the node_name (upon invocation of a collection API command). On the rare occasions where we encountered such situations in the past, the replica would be unavailable and the system would work fine overall. However, this change (which introduces strict validation of errors while *reading* Replica objects) now means that if such a situation arises (where some Solr's APIs itself results in node_name being null in a state.json), all SolrJ clients and all Solr nodes will go for a toss (possibly crash, and not start back up).
>
> This change was rushed in, without any discussions or review, without extensive testing for the failures it will cause on existing systems where cluster state is messed up but system is running, and without any consideration for the impact on users.
>
> Noble and I are of the opinion that this change should be reverted immediately, considering the impact to users. However, there is strong disagreement on Andrzej's part.
>
> Mistakes happen, but doubling down on them irrationally [1] will destroy the reputation of the project, let alone the peace of mind of those who are running Solr in production.
>
> Does someone have any thoughts or opinions?
>
> [1] - https://issues.apache.org/jira/browse/SOLR-14245?focusedCommentId=17346758&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17346758 <https://issues.apache.org/jira/browse/SOLR-14245?focusedCommentId=17346758&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17346758>
Re: Doubling down on our mistakes? [ In reply to ]
Hi AB,
Please accept my I apologies for the heated discussion. The objective
was not that at all.

I saw the real damage that was caused to our client. It was
devastating. We were a little worried about the same happening to
another user who might upgrade.

So, I suggested a revert.

Whatever happened after that was due to the stress the situation has
put us in. We asked our client to upgrade from 8.4 and to 8.8 and the
cluster had a meltdown.

Please let's forget about what has transpired in the JIRA and let's
get back to saving the next user from such a meltdown

1) warn our users from upgrading from 8.4 (if they have not already done it)
2) revert this change and do a break fix release
3) Fix the actual bug that caused the null node_name in the first place

regards
Noble Paul


On Tue, May 18, 2021 at 10:22 PM Andrzej Bia?ecki <ab@getopt.org> wrote:
>
> Ishan, as I pointed out in Jira I don’t care for you implying that I have evil intentions, I resent also your implication that I’m behaving irrationally or don’t care for the users. Those of you who are interested may read the comments in Jira and judge for themselves.
>
> You conveniently don’t mention that I WITHDREW my objection, and instead proposed a lenient validation (but validation nonetheless!). It’s easy to scream “revert! revert!” but it actually takes some consideration to properly address the original purpose of this change - that is, detecting and avoiding the corruption of replica state. Let’s focus on this and not on pointing fingers.
>
> As for the production outage - I’m sorry this happened to you. As I hope you and Noble and others are sorry for other inadvertently introduced bugs, which I’m sure brought down many clusters at inconvenient hours...
>
>
> On 18 May 2021, at 13:26, Ishan Chattopadhyaya <ichattopadhyaya@gmail.com> wrote:
>
> https://issues.apache.org/jira/browse/SOLR-14245
>
> There was a production outage at odd hours at my (and Noble's) client, due to this above change in Solr 8.5 onwards by Andrzej Bialecki.
>
> In short, there is some bug in Solr where a replica gets "null" as the node_name (upon invocation of a collection API command). On the rare occasions where we encountered such situations in the past, the replica would be unavailable and the system would work fine overall. However, this change (which introduces strict validation of errors while *reading* Replica objects) now means that if such a situation arises (where some Solr's APIs itself results in node_name being null in a state.json), all SolrJ clients and all Solr nodes will go for a toss (possibly crash, and not start back up).
>
> This change was rushed in, without any discussions or review, without extensive testing for the failures it will cause on existing systems where cluster state is messed up but system is running, and without any consideration for the impact on users.
>
> Noble and I are of the opinion that this change should be reverted immediately, considering the impact to users. However, there is strong disagreement on Andrzej's part.
>
> Mistakes happen, but doubling down on them irrationally [1] will destroy the reputation of the project, let alone the peace of mind of those who are running Solr in production.
>
> Does someone have any thoughts or opinions?
>
> [1] - https://issues.apache.org/jira/browse/SOLR-14245?focusedCommentId=17346758&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17346758
>
>


--
-----------------------------------------------------
Noble Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Doubling down on our mistakes? [ In reply to ]
I apologize for the harsh words, and personally to Andrzej for hurting your
feelings. I had no such intentions.

> You conveniently don’t mention that I WITHDREW my objection, and instead
proposed a lenient validation (but validation nonetheless!).
Yes, let me mention that you agreed in principal to reduce the impact of
the change (even though not completely revert it). I welcome that and thank
you for that. By the time you replied on JIRA, I had already sent this mail.

> I see no urgency at all in this matter. This can be handled as day-to-day
bug fixing as usual.
I think this requires an immediate notification to all users to be aware of
this situation before upgrading. Also, an immediate breakfix should be
helpful for them.

> My feelings are hurt, and I'm greatly disappointed in your words, quick
attacking off the cuff regularly rude (IMO) because you happened to have a
bad day.
I apologize.

How I saw things is that we have a commitment to our users to give them
good quality software that they can rely on. My intention was not to attack
Andrzej personally, but to bring about collective awareness regarding this
problem: that we, as a community, don't care enough for our users. We need
to get better at testing, get better at reviews, better at benchmarks, etc.
Individually, we all have the best of intentions, and obviously so does
Andrzej. However, we need to get better, and I wanted this to be a starting
point in that conversation. Clearly, I was carried over and I apologize for
that.

On Tue, May 18, 2021 at 5:52 PM Andrzej Bia?ecki <ab@getopt.org> wrote:

> Ishan, as I pointed out in Jira I don’t care for you implying that I have
> evil intentions, I resent also your implication that I’m behaving
> irrationally or don’t care for the users. Those of you who are interested
> may read the comments in Jira and judge for themselves.
>
> You conveniently don’t mention that I WITHDREW my objection, and instead
> proposed a lenient validation (but validation nonetheless!). It’s easy to
> scream “revert! revert!” but it actually takes some consideration to
> properly address the original purpose of this change - that is, detecting
> and avoiding the corruption of replica state. Let’s focus on this and not
> on pointing fingers.
>
> As for the production outage - I’m sorry this happened to you. As I hope
> you and Noble and others are sorry for other inadvertently introduced bugs,
> which I’m sure brought down many clusters at inconvenient hours...
>
>
> On 18 May 2021, at 13:26, Ishan Chattopadhyaya <ichattopadhyaya@gmail.com>
> wrote:
>
> https://issues.apache.org/jira/browse/SOLR-14245
>
> There was a *production outage* at *odd hours* at my (and Noble's)
> client, due to this above change in Solr 8.5 onwards by *Andrzej Bialecki*
> .
>
> In short, there is some bug in Solr where a replica gets "null" as the
> node_name (upon invocation of a collection API command). On the rare
> occasions where we encountered such situations in the past, the replica
> would be unavailable and the system would work fine overall. However, this
> change (which introduces strict validation of errors while *reading*
> Replica objects) now means that if such a situation arises (where some
> Solr's APIs itself results in node_name being null in a state.json), all
> SolrJ clients and all Solr nodes will go for a toss (possibly crash, and
> not start back up).
>
> This change was rushed in, *without any discussions or review*, without
> extensive testing for the failures it will cause on existing systems where
> cluster state is messed up but system is running, and *without any
> consideration for the impact on users*.
>
> Noble and I are of the opinion that this change should be *reverted
> immediately*, considering the impact to users. However, there is *strong
> disagreement on Andrzej's part*.
>
> *Mistakes* happen, but *doubling down on them irrationally* [1] will
> destroy the reputation of the project, let alone the peace of mind of those
> who are running Solr in production.
>
> Does someone have any thoughts or opinions?
>
> [1] -
> https://issues.apache.org/jira/browse/SOLR-14245?focusedCommentId=17346758&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17346758
>
>
>
Re: Doubling down on our mistakes? [ In reply to ]
To everyone following this e-mail thread.

The Project Management Committees have discussed the matter and would like to draw attention to "Statement from the Solr and Lucene PMC regarding recent Code of Conduct violations" posted to this list today, and linked below:

https://lists.apache.org/thread.html/r9875b53aeaebca8678ee0127562d8a35c7938906fbd318ac17ba011d%40%3Cdev.solr.apache.org%3E <https://lists.apache.org/thread.html/r9875b53aeaebca8678ee0127562d8a35c7938906fbd318ac17ba011d@%3Cdev.solr.apache.org%3E>
Jan Høydahl
Solr PMC Chair

> 21. mai 2021 kl. 05:52 skrev David Smiley <dsmiley@apache.org>:
>
> I removed dev@lucene.apache.org <mailto:dev@lucene.apache.org> from my response here. Please everyone do the same and don't email both Lucene & Solr at the same time. I recall that's an old best practice / rule in general -- never address an email to more than one list.
>
> I agree 100% with Erick. It's shameful and looks bad on our community and it's just so not necessary. It's a clear code-of-conduct violation. I hope Andrzej is "okay" emotionally; I'd be a mess in his shoes. At least the apologies are very reasonable to me; I was expecting Ishan/Noble to dig their heels in (as I witnessed some months ago) and I'm relieved not to see that.
>
> The internal complexity of Solr (esp. SolrCloud) is very high; it's difficult to make changes and not have some worry that maybe a change has some ill effect. Yet we can't simply not touch it. The irony here is that the change in question was targeted directly at improving the quality of Solr; I love those types of changes, honestly.
>
> Perhaps Solr getting it's own Docker images as part of the project may lead to automated Solr-upgrade testing to catch compatibility bugs? Maybe that might be done at the K8S Solr Operator level integration tests since I'm guessing the Operator facilitates upgrades already?
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley <http://www.linkedin.com/in/davidwsmiley>
>
> On Tue, May 18, 2021 at 8:54 AM Ishan Chattopadhyaya <ichattopadhyaya@gmail.com <mailto:ichattopadhyaya@gmail.com>> wrote:
> I apologize for the harsh words, and personally to Andrzej for hurting your feelings. I had no such intentions.
>
> > You conveniently don’t mention that I WITHDREW my objection, and instead proposed a lenient validation (but validation nonetheless!).
> Yes, let me mention that you agreed in principal to reduce the impact of the change (even though not completely revert it). I welcome that and thank you for that. By the time you replied on JIRA, I had already sent this mail.
>
> > I see no urgency at all in this matter. This can be handled as day-to-day bug fixing as usual.
> I think this requires an immediate notification to all users to be aware of this situation before upgrading. Also, an immediate breakfix should be helpful for them.
>
> > My feelings are hurt, and I'm greatly disappointed in your words, quick attacking off the cuff regularly rude (IMO) because you happened to have a bad day.
> I apologize.
>
> How I saw things is that we have a commitment to our users to give them good quality software that they can rely on. My intention was not to attack Andrzej personally, but to bring about collective awareness regarding this problem: that we, as a community, don't care enough for our users. We need to get better at testing, get better at reviews, better at benchmarks, etc. Individually, we all have the best of intentions, and obviously so does Andrzej. However, we need to get better, and I wanted this to be a starting point in that conversation. Clearly, I was carried over and I apologize for that.
>
> On Tue, May 18, 2021 at 5:52 PM Andrzej Bia?ecki <ab@getopt.org <mailto:ab@getopt.org>> wrote:
> Ishan, as I pointed out in Jira I don’t care for you implying that I have evil intentions, I resent also your implication that I’m behaving irrationally or don’t care for the users. Those of you who are interested may read the comments in Jira and judge for themselves.
>
> You conveniently don’t mention that I WITHDREW my objection, and instead proposed a lenient validation (but validation nonetheless!). It’s easy to scream “revert! revert!” but it actually takes some consideration to properly address the original purpose of this change - that is, detecting and avoiding the corruption of replica state. Let’s focus on this and not on pointing fingers.
>
> As for the production outage - I’m sorry this happened to you. As I hope you and Noble and others are sorry for other inadvertently introduced bugs, which I’m sure brought down many clusters at inconvenient hours...
>
>
>> On 18 May 2021, at 13:26, Ishan Chattopadhyaya <ichattopadhyaya@gmail.com <mailto:ichattopadhyaya@gmail.com>> wrote:
>>
>> https://issues.apache.org/jira/browse/SOLR-14245 <https://issues.apache.org/jira/browse/SOLR-14245>
>>
>> There was a production outage at odd hours at my (and Noble's) client, due to this above change in Solr 8.5 onwards by Andrzej Bialecki.
>>
>> In short, there is some bug in Solr where a replica gets "null" as the node_name (upon invocation of a collection API command). On the rare occasions where we encountered such situations in the past, the replica would be unavailable and the system would work fine overall. However, this change (which introduces strict validation of errors while *reading* Replica objects) now means that if such a situation arises (where some Solr's APIs itself results in node_name being null in a state.json), all SolrJ clients and all Solr nodes will go for a toss (possibly crash, and not start back up).
>>
>> This change was rushed in, without any discussions or review, without extensive testing for the failures it will cause on existing systems where cluster state is messed up but system is running, and without any consideration for the impact on users.
>>
>> Noble and I are of the opinion that this change should be reverted immediately, considering the impact to users. However, there is strong disagreement on Andrzej's part.
>>
>> Mistakes happen, but doubling down on them irrationally [1] will destroy the reputation of the project, let alone the peace of mind of those who are running Solr in production.
>>
>> Does someone have any thoughts or opinions?
>>
>> [1] - https://issues.apache.org/jira/browse/SOLR-14245?focusedCommentId=17346758&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17346758 <https://issues.apache.org/jira/browse/SOLR-14245?focusedCommentId=17346758&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17346758>