Mailing List Archive

OverseerStatusTest recent failures
I encountered a failure from OverseerStatusTest locally. According to our
test failure trends, this guy only just recently started failing ~4-5% of
the time, but previously was fine. Only master branch.

http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.cloud.OverseerStatusTest.test

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley
Re: OverseerStatusTest recent failures [ In reply to ]
Thank you David for reporting this.

Seems due to my recent changes. I reproduce the failure locally and will
look at this tomorrow.

With the distributed cluster state updates i've introduced a randomization
for using either Overseer based cluster state updates or distributed
cluster state updates in tests. This failure seems to happen in the
distributed state update case. I suspect it is due to Overseer returning
less stats than expected by the test (which is expected: Overseer cannot
return stats about cluster state updates if it does not handle cluster
state updates).

The following line in the logs tells that the run is using distributed
cluster state:
972874 INFO (jetty-launcher-8973-thread-2) [ ]
o.a.s.c.DistributedClusterStateUpdater Creating
DistributedClusterStateUpdater with useDistributedStateUpdate=true. Solr
will be using distributed cluster state updates.

Ilan


On Sat, Feb 20, 2021 at 3:00 PM David Smiley <dsmiley@apache.org> wrote:

> I encountered a failure from OverseerStatusTest locally. According to our
> test failure trends, this guy only just recently started failing ~4-5% of
> the time, but previously was fine. Only master branch.
>
>
> http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.cloud.OverseerStatusTest.test
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
Re: OverseerStatusTest recent failures [ In reply to ]
Indeed the issue is due to my changes.

In OverseerStatusCmd I've skipped some stat collection when running in
distributed cluster state updates mode because I thought these were only
stats related to cluster state updates.
Obviously that was too aggressive and some of the stats are related to the
Collection API.

I will make sure to skip returning only the stats that are related to
cluster state updater and restore returning collection api stats (when
running in distributed cluster updates mode, otherwise all stats are
returned).

Tomorrow...

Ilan

On Sun, Feb 21, 2021 at 12:22 AM Ilan Ginzburg <ilansolr@gmail.com> wrote:

> Thank you David for reporting this.
>
> Seems due to my recent changes. I reproduce the failure locally and will
> look at this tomorrow.
>
> With the distributed cluster state updates i've introduced a randomization
> for using either Overseer based cluster state updates or distributed
> cluster state updates in tests. This failure seems to happen in the
> distributed state update case. I suspect it is due to Overseer returning
> less stats than expected by the test (which is expected: Overseer cannot
> return stats about cluster state updates if it does not handle cluster
> state updates).
>
> The following line in the logs tells that the run is using distributed
> cluster state:
> 972874 INFO (jetty-launcher-8973-thread-2) [ ]
> o.a.s.c.DistributedClusterStateUpdater Creating
> DistributedClusterStateUpdater with useDistributedStateUpdate=true. Solr
> will be using distributed cluster state updates.
>
> Ilan
>
>
> On Sat, Feb 20, 2021 at 3:00 PM David Smiley <dsmiley@apache.org> wrote:
>
>> I encountered a failure from OverseerStatusTest locally. According to
>> our test failure trends, this guy only just recently started failing ~4-5%
>> of the time, but previously was fine. Only master branch.
>>
>>
>> http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.cloud.OverseerStatusTest.test
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>
Re: OverseerStatusTest recent failures [ In reply to ]
Interesting. Do you have a guess as to why the failures there are ~5% and
not 100% reproducible?

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Sat, Feb 20, 2021 at 6:41 PM Ilan Ginzburg <ilansolr@gmail.com> wrote:

> Indeed the issue is due to my changes.
>
> In OverseerStatusCmd I've skipped some stat collection when running in
> distributed cluster state updates mode because I thought these were only
> stats related to cluster state updates.
> Obviously that was too aggressive and some of the stats are related to the
> Collection API.
>
> I will make sure to skip returning only the stats that are related to
> cluster state updater and restore returning collection api stats (when
> running in distributed cluster updates mode, otherwise all stats are
> returned).
>
> Tomorrow...
>
> Ilan
>
> On Sun, Feb 21, 2021 at 12:22 AM Ilan Ginzburg <ilansolr@gmail.com> wrote:
>
>> Thank you David for reporting this.
>>
>> Seems due to my recent changes. I reproduce the failure locally and will
>> look at this tomorrow.
>>
>> With the distributed cluster state updates i've introduced a
>> randomization for using either Overseer based cluster state updates or
>> distributed cluster state updates in tests. This failure seems to happen in
>> the distributed state update case. I suspect it is due to Overseer
>> returning less stats than expected by the test (which is expected: Overseer
>> cannot return stats about cluster state updates if it does not handle
>> cluster state updates).
>>
>> The following line in the logs tells that the run is using distributed
>> cluster state:
>> 972874 INFO (jetty-launcher-8973-thread-2) [ ]
>> o.a.s.c.DistributedClusterStateUpdater Creating
>> DistributedClusterStateUpdater with useDistributedStateUpdate=true. Solr
>> will be using distributed cluster state updates.
>>
>> Ilan
>>
>>
>> On Sat, Feb 20, 2021 at 3:00 PM David Smiley <dsmiley@apache.org> wrote:
>>
>>> I encountered a failure from OverseerStatusTest locally. According to
>>> our test failure trends, this guy only just recently started failing ~4-5%
>>> of the time, but previously was fine. Only master branch.
>>>
>>>
>>> http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.cloud.OverseerStatusTest.test
>>>
>>> ~ David Smiley
>>> Apache Lucene/Solr Search Developer
>>> http://www.linkedin.com/in/davidwsmiley
>>>
>>
Re: OverseerStatusTest recent failures [ In reply to ]
Yes Marcus this is the commit.

David I would have expected 50% failures, as 50% of the runs use
distributed updates. I’ll try to understand better as I fix the issue.

Ilan

On Sun 21 Feb 2021 at 06:17, David Smiley <dsmiley@apache.org> wrote:

> Interesting. Do you have a guess as to why the failures there are ~5% and
> not 100% reproducible?
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Sat, Feb 20, 2021 at 6:41 PM Ilan Ginzburg <ilansolr@gmail.com> wrote:
>
>> Indeed the issue is due to my changes.
>>
>> In OverseerStatusCmd I've skipped some stat collection when running in
>> distributed cluster state updates mode because I thought these were only
>> stats related to cluster state updates.
>> Obviously that was too aggressive and some of the stats are related to
>> the Collection API.
>>
>> I will make sure to skip returning only the stats that are related to
>> cluster state updater and restore returning collection api stats (when
>> running in distributed cluster updates mode, otherwise all stats are
>> returned).
>>
>> Tomorrow...
>>
>> Ilan
>>
>> On Sun, Feb 21, 2021 at 12:22 AM Ilan Ginzburg <ilansolr@gmail.com>
>> wrote:
>>
>>> Thank you David for reporting this.
>>>
>>> Seems due to my recent changes. I reproduce the failure locally and will
>>> look at this tomorrow.
>>>
>>> With the distributed cluster state updates i've introduced a
>>> randomization for using either Overseer based cluster state updates or
>>> distributed cluster state updates in tests. This failure seems to happen in
>>> the distributed state update case. I suspect it is due to Overseer
>>> returning less stats than expected by the test (which is expected: Overseer
>>> cannot return stats about cluster state updates if it does not handle
>>> cluster state updates).
>>>
>>> The following line in the logs tells that the run is using distributed
>>> cluster state:
>>> 972874 INFO (jetty-launcher-8973-thread-2) [ ]
>>> o.a.s.c.DistributedClusterStateUpdater Creating
>>> DistributedClusterStateUpdater with useDistributedStateUpdate=true. Solr
>>> will be using distributed cluster state updates.
>>>
>>> Ilan
>>>
>>>
>>> On Sat, Feb 20, 2021 at 3:00 PM David Smiley <dsmiley@apache.org> wrote:
>>>
>>>> I encountered a failure from OverseerStatusTest locally. According to
>>>> our test failure trends, this guy only just recently started failing ~4-5%
>>>> of the time, but previously was fine. Only master branch.
>>>>
>>>>
>>>> http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.cloud.OverseerStatusTest.test
>>>>
>>>> ~ David Smiley
>>>> Apache Lucene/Solr Search Developer
>>>> http://www.linkedin.com/in/davidwsmiley
>>>>
>>>
Re: OverseerStatusTest recent failures [ In reply to ]
Searching in my jenkins folder for failures of this test (label:jenkins
"FAILED: org.apache.solr.cloud.OverseerStatusTest.test") 26 emails match.
Searching for all jenkins master builds emails since the first failure
email found above (2 days ago), I see 40 messages.
26 over 40 is not far from the expected 50% failure rate.
I believe the ratio in the graph you sent David (currently at 5.7%) is
averaged over a week, and includes failures from all branches (did some
other stats on jenkins emails that tend to confirm this assumption).

On Sun, Feb 21, 2021 at 10:53 AM Ilan Ginzburg <ilansolr@gmail.com> wrote:

> Yes Marcus this is the commit.
>
> David I would have expected 50% failures, as 50% of the runs use
> distributed updates. I’ll try to understand better as I fix the issue.
>
> Ilan
>
> On Sun 21 Feb 2021 at 06:17, David Smiley <dsmiley@apache.org> wrote:
>
>> Interesting. Do you have a guess as to why the failures there are ~5%
>> and not 100% reproducible?
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Sat, Feb 20, 2021 at 6:41 PM Ilan Ginzburg <ilansolr@gmail.com> wrote:
>>
>>> Indeed the issue is due to my changes.
>>>
>>> In OverseerStatusCmd I've skipped some stat collection when running in
>>> distributed cluster state updates mode because I thought these were only
>>> stats related to cluster state updates.
>>> Obviously that was too aggressive and some of the stats are related to
>>> the Collection API.
>>>
>>> I will make sure to skip returning only the stats that are related to
>>> cluster state updater and restore returning collection api stats (when
>>> running in distributed cluster updates mode, otherwise all stats are
>>> returned).
>>>
>>> Tomorrow...
>>>
>>> Ilan
>>>
>>> On Sun, Feb 21, 2021 at 12:22 AM Ilan Ginzburg <ilansolr@gmail.com>
>>> wrote:
>>>
>>>> Thank you David for reporting this.
>>>>
>>>> Seems due to my recent changes. I reproduce the failure locally and
>>>> will look at this tomorrow.
>>>>
>>>> With the distributed cluster state updates i've introduced a
>>>> randomization for using either Overseer based cluster state updates or
>>>> distributed cluster state updates in tests. This failure seems to happen in
>>>> the distributed state update case. I suspect it is due to Overseer
>>>> returning less stats than expected by the test (which is expected: Overseer
>>>> cannot return stats about cluster state updates if it does not handle
>>>> cluster state updates).
>>>>
>>>> The following line in the logs tells that the run is using distributed
>>>> cluster state:
>>>> 972874 INFO (jetty-launcher-8973-thread-2) [ ]
>>>> o.a.s.c.DistributedClusterStateUpdater Creating
>>>> DistributedClusterStateUpdater with useDistributedStateUpdate=true. Solr
>>>> will be using distributed cluster state updates.
>>>>
>>>> Ilan
>>>>
>>>>
>>>> On Sat, Feb 20, 2021 at 3:00 PM David Smiley <dsmiley@apache.org>
>>>> wrote:
>>>>
>>>>> I encountered a failure from OverseerStatusTest locally. According to
>>>>> our test failure trends, this guy only just recently started failing ~4-5%
>>>>> of the time, but previously was fine. Only master branch.
>>>>>
>>>>>
>>>>> http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.cloud.OverseerStatusTest.test
>>>>>
>>>>> ~ David Smiley
>>>>> Apache Lucene/Solr Search Developer
>>>>> http://www.linkedin.com/in/davidwsmiley
>>>>>
>>>>
Re: OverseerStatusTest recent failures [ In reply to ]
I have fixed the issue. A PR is out
https://github.com/apache/lucene-solr/pull/2410/files.
Most of the work was documenting what stats are actually returned. Now
OverseerStatusCmd has more comment lines than code lines.

Will merge it shortly.

Ilan



On Sun, Feb 21, 2021 at 6:05 PM Ilan Ginzburg <ilansolr@gmail.com> wrote:

> Searching in my jenkins folder for failures of this test (label:jenkins
> "FAILED: org.apache.solr.cloud.OverseerStatusTest.test") 26 emails match.
> Searching for all jenkins master builds emails since the first failure
> email found above (2 days ago), I see 40 messages.
> 26 over 40 is not far from the expected 50% failure rate.
> I believe the ratio in the graph you sent David (currently at 5.7%) is
> averaged over a week, and includes failures from all branches (did some
> other stats on jenkins emails that tend to confirm this assumption).
>
> On Sun, Feb 21, 2021 at 10:53 AM Ilan Ginzburg <ilansolr@gmail.com> wrote:
>
>> Yes Marcus this is the commit.
>>
>> David I would have expected 50% failures, as 50% of the runs use
>> distributed updates. I’ll try to understand better as I fix the issue.
>>
>> Ilan
>>
>> On Sun 21 Feb 2021 at 06:17, David Smiley <dsmiley@apache.org> wrote:
>>
>>> Interesting. Do you have a guess as to why the failures there are ~5%
>>> and not 100% reproducible?
>>>
>>> ~ David Smiley
>>> Apache Lucene/Solr Search Developer
>>> http://www.linkedin.com/in/davidwsmiley
>>>
>>>
>>> On Sat, Feb 20, 2021 at 6:41 PM Ilan Ginzburg <ilansolr@gmail.com>
>>> wrote:
>>>
>>>> Indeed the issue is due to my changes.
>>>>
>>>> In OverseerStatusCmd I've skipped some stat collection when running in
>>>> distributed cluster state updates mode because I thought these were only
>>>> stats related to cluster state updates.
>>>> Obviously that was too aggressive and some of the stats are related to
>>>> the Collection API.
>>>>
>>>> I will make sure to skip returning only the stats that are related to
>>>> cluster state updater and restore returning collection api stats (when
>>>> running in distributed cluster updates mode, otherwise all stats are
>>>> returned).
>>>>
>>>> Tomorrow...
>>>>
>>>> Ilan
>>>>
>>>> On Sun, Feb 21, 2021 at 12:22 AM Ilan Ginzburg <ilansolr@gmail.com>
>>>> wrote:
>>>>
>>>>> Thank you David for reporting this.
>>>>>
>>>>> Seems due to my recent changes. I reproduce the failure locally and
>>>>> will look at this tomorrow.
>>>>>
>>>>> With the distributed cluster state updates i've introduced a
>>>>> randomization for using either Overseer based cluster state updates or
>>>>> distributed cluster state updates in tests. This failure seems to happen in
>>>>> the distributed state update case. I suspect it is due to Overseer
>>>>> returning less stats than expected by the test (which is expected: Overseer
>>>>> cannot return stats about cluster state updates if it does not handle
>>>>> cluster state updates).
>>>>>
>>>>> The following line in the logs tells that the run is using distributed
>>>>> cluster state:
>>>>> 972874 INFO (jetty-launcher-8973-thread-2) [ ]
>>>>> o.a.s.c.DistributedClusterStateUpdater Creating
>>>>> DistributedClusterStateUpdater with useDistributedStateUpdate=true. Solr
>>>>> will be using distributed cluster state updates.
>>>>>
>>>>> Ilan
>>>>>
>>>>>
>>>>> On Sat, Feb 20, 2021 at 3:00 PM David Smiley <dsmiley@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I encountered a failure from OverseerStatusTest locally. According
>>>>>> to our test failure trends, this guy only just recently started failing
>>>>>> ~4-5% of the time, but previously was fine. Only master branch.
>>>>>>
>>>>>>
>>>>>> http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.cloud.OverseerStatusTest.test
>>>>>>
>>>>>> ~ David Smiley
>>>>>> Apache Lucene/Solr Search Developer
>>>>>> http://www.linkedin.com/in/davidwsmiley
>>>>>>
>>>>>
Re: OverseerStatusTest recent failures [ In reply to ]
Ah; that makes total sense; thanks.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Sun, Feb 21, 2021 at 12:06 PM Ilan Ginzburg <ilansolr@gmail.com> wrote:

> Searching in my jenkins folder for failures of this test (label:jenkins
> "FAILED: org.apache.solr.cloud.OverseerStatusTest.test") 26 emails match.
> Searching for all jenkins master builds emails since the first failure
> email found above (2 days ago), I see 40 messages.
> 26 over 40 is not far from the expected 50% failure rate.
> I believe the ratio in the graph you sent David (currently at 5.7%) is
> averaged over a week, and includes failures from all branches (did some
> other stats on jenkins emails that tend to confirm this assumption).
>
> On Sun, Feb 21, 2021 at 10:53 AM Ilan Ginzburg <ilansolr@gmail.com> wrote:
>
>> Yes Marcus this is the commit.
>>
>> David I would have expected 50% failures, as 50% of the runs use
>> distributed updates. I’ll try to understand better as I fix the issue.
>>
>> Ilan
>>
>> On Sun 21 Feb 2021 at 06:17, David Smiley <dsmiley@apache.org> wrote:
>>
>>> Interesting. Do you have a guess as to why the failures there are ~5%
>>> and not 100% reproducible?
>>>
>>> ~ David Smiley
>>> Apache Lucene/Solr Search Developer
>>> http://www.linkedin.com/in/davidwsmiley
>>>
>>>
>>> On Sat, Feb 20, 2021 at 6:41 PM Ilan Ginzburg <ilansolr@gmail.com>
>>> wrote:
>>>
>>>> Indeed the issue is due to my changes.
>>>>
>>>> In OverseerStatusCmd I've skipped some stat collection when running in
>>>> distributed cluster state updates mode because I thought these were only
>>>> stats related to cluster state updates.
>>>> Obviously that was too aggressive and some of the stats are related to
>>>> the Collection API.
>>>>
>>>> I will make sure to skip returning only the stats that are related to
>>>> cluster state updater and restore returning collection api stats (when
>>>> running in distributed cluster updates mode, otherwise all stats are
>>>> returned).
>>>>
>>>> Tomorrow...
>>>>
>>>> Ilan
>>>>
>>>> On Sun, Feb 21, 2021 at 12:22 AM Ilan Ginzburg <ilansolr@gmail.com>
>>>> wrote:
>>>>
>>>>> Thank you David for reporting this.
>>>>>
>>>>> Seems due to my recent changes. I reproduce the failure locally and
>>>>> will look at this tomorrow.
>>>>>
>>>>> With the distributed cluster state updates i've introduced a
>>>>> randomization for using either Overseer based cluster state updates or
>>>>> distributed cluster state updates in tests. This failure seems to happen in
>>>>> the distributed state update case. I suspect it is due to Overseer
>>>>> returning less stats than expected by the test (which is expected: Overseer
>>>>> cannot return stats about cluster state updates if it does not handle
>>>>> cluster state updates).
>>>>>
>>>>> The following line in the logs tells that the run is using distributed
>>>>> cluster state:
>>>>> 972874 INFO (jetty-launcher-8973-thread-2) [ ]
>>>>> o.a.s.c.DistributedClusterStateUpdater Creating
>>>>> DistributedClusterStateUpdater with useDistributedStateUpdate=true. Solr
>>>>> will be using distributed cluster state updates.
>>>>>
>>>>> Ilan
>>>>>
>>>>>
>>>>> On Sat, Feb 20, 2021 at 3:00 PM David Smiley <dsmiley@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I encountered a failure from OverseerStatusTest locally. According
>>>>>> to our test failure trends, this guy only just recently started failing
>>>>>> ~4-5% of the time, but previously was fine. Only master branch.
>>>>>>
>>>>>>
>>>>>> http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.cloud.OverseerStatusTest.test
>>>>>>
>>>>>> ~ David Smiley
>>>>>> Apache Lucene/Solr Search Developer
>>>>>> http://www.linkedin.com/in/davidwsmiley
>>>>>>
>>>>>