Mailing List Archive

Re: [JENKINS] Lucene-9.x-MacOSX (64bit/jdk-18) - Build # 978 - Unstable!
A test timed out. I've beasted with the same settings but can't
reproduce. Either JVM bug somewhere or cosmic interference...

Dawid

On Wed, Aug 24, 2022 at 3:32 AM Policeman Jenkins Server
<jenkins@thetaphi.de> wrote:
>
> Build: https://jenkins.thetaphi.de/job/Lucene-9.x-MacOSX/978/
> Java: 64bit/jdk-18 -XX:+UseCompressedOops -XX:+UseSerialGC
>
> 2 tests failed.
> FAILED: org.apache.lucene.analysis.ko.TestKoreanReadingFormFilter.testRandomData
>
> Error Message:
> java.lang.Exception: Test abandoned because suite timeout was reached.
>
> Stack Trace:
> java.lang.Exception: Test abandoned because suite timeout was reached.
> at __randomizedtesting.SeedInfo.seed([9AA6F3EBA279C5BA]:0)
>
>
> FAILED: org.apache.lucene.analysis.ko.TestKoreanReadingFormFilter.classMethod
>
> Error Message:
> java.lang.Exception: Suite timeout exceeded (>= 7200000 msec).
>
> Stack Trace:
> java.lang.Exception: Suite timeout exceeded (>= 7200000 msec).
> at __randomizedtesting.SeedInfo.seed([9AA6F3EBA279C5BA]:0)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: builds-unsubscribe@lucene.apache.org
> For additional commands, e-mail: builds-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: [JENKINS] Lucene-9.x-MacOSX (64bit/jdk-18) - Build # 978 - Unstable! [ In reply to ]
Hi Dawid, I looked at this and also https://github.com/apache/lucene/issues/7687

If you look at the instances and how sporadic they are, the problem
could be caused by TimeoutSuite using wall-clock time in
com.carrotsearch.randomizedtesting? Especially in virtual machines,
wall-clock time can be extremely inaccurate when you spin them up,
then there's a big correction (via NTP or VM agent).

I have no proof this is what is happening, except to say, I think it
would be better if randomizedtesting used monotonic time (nanoTime)
rather than wall-clock time (currentTimeMillis). It would make it more
robust.


On Wed, Aug 24, 2022 at 4:48 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:
>
> A test timed out. I've beasted with the same settings but can't
> reproduce. Either JVM bug somewhere or cosmic interference...
>
> Dawid
>
> On Wed, Aug 24, 2022 at 3:32 AM Policeman Jenkins Server
> <jenkins@thetaphi.de> wrote:
> >
> > Build: https://jenkins.thetaphi.de/job/Lucene-9.x-MacOSX/978/
> > Java: 64bit/jdk-18 -XX:+UseCompressedOops -XX:+UseSerialGC
> >
> > 2 tests failed.
> > FAILED: org.apache.lucene.analysis.ko.TestKoreanReadingFormFilter.testRandomData
> >
> > Error Message:
> > java.lang.Exception: Test abandoned because suite timeout was reached.
> >
> > Stack Trace:
> > java.lang.Exception: Test abandoned because suite timeout was reached.
> > at __randomizedtesting.SeedInfo.seed([9AA6F3EBA279C5BA]:0)
> >
> >
> > FAILED: org.apache.lucene.analysis.ko.TestKoreanReadingFormFilter.classMethod
> >
> > Error Message:
> > java.lang.Exception: Suite timeout exceeded (>= 7200000 msec).
> >
> > Stack Trace:
> > java.lang.Exception: Suite timeout exceeded (>= 7200000 msec).
> > at __randomizedtesting.SeedInfo.seed([9AA6F3EBA279C5BA]:0)
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: builds-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: builds-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: [JENKINS] Lucene-9.x-MacOSX (64bit/jdk-18) - Build # 978 - Unstable! [ In reply to ]
Damn. I know about it but never had it happen to me. You're right in
that it could be a reason and it's definitely one of the aspects I can
take off the checklist. It looks strange because those timeouts are
fairly high - the time correction would indeed have to be significant
for this to fail (and in the middle of the process?!). Anyway, I'll
look into this - thanks for the pointer!

Dawid

On Wed, Aug 24, 2022 at 1:39 PM Robert Muir <rcmuir@gmail.com> wrote:
>
> Hi Dawid, I looked at this and also https://github.com/apache/lucene/issues/7687
>
> If you look at the instances and how sporadic they are, the problem
> could be caused by TimeoutSuite using wall-clock time in
> com.carrotsearch.randomizedtesting? Especially in virtual machines,
> wall-clock time can be extremely inaccurate when you spin them up,
> then there's a big correction (via NTP or VM agent).
>
> I have no proof this is what is happening, except to say, I think it
> would be better if randomizedtesting used monotonic time (nanoTime)
> rather than wall-clock time (currentTimeMillis). It would make it more
> robust.
>
>
> On Wed, Aug 24, 2022 at 4:48 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:
> >
> > A test timed out. I've beasted with the same settings but can't
> > reproduce. Either JVM bug somewhere or cosmic interference...
> >
> > Dawid
> >
> > On Wed, Aug 24, 2022 at 3:32 AM Policeman Jenkins Server
> > <jenkins@thetaphi.de> wrote:
> > >
> > > Build: https://jenkins.thetaphi.de/job/Lucene-9.x-MacOSX/978/
> > > Java: 64bit/jdk-18 -XX:+UseCompressedOops -XX:+UseSerialGC
> > >
> > > 2 tests failed.
> > > FAILED: org.apache.lucene.analysis.ko.TestKoreanReadingFormFilter.testRandomData
> > >
> > > Error Message:
> > > java.lang.Exception: Test abandoned because suite timeout was reached.
> > >
> > > Stack Trace:
> > > java.lang.Exception: Test abandoned because suite timeout was reached.
> > > at __randomizedtesting.SeedInfo.seed([9AA6F3EBA279C5BA]:0)
> > >
> > >
> > > FAILED: org.apache.lucene.analysis.ko.TestKoreanReadingFormFilter.classMethod
> > >
> > > Error Message:
> > > java.lang.Exception: Suite timeout exceeded (>= 7200000 msec).
> > >
> > > Stack Trace:
> > > java.lang.Exception: Suite timeout exceeded (>= 7200000 msec).
> > > at __randomizedtesting.SeedInfo.seed([9AA6F3EBA279C5BA]:0)
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: builds-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: builds-help@lucene.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: [JENKINS] Lucene-9.x-MacOSX (64bit/jdk-18) - Build # 978 - Unstable! [ In reply to ]
If we look at the 7687 issue, there's definitely some that can be
explained by unruly tests randomly behaving badly. But a few of those
(such as simple stemmer tests) look suspicious to me.
I've fought the issue with my own tests (non-java) and its amazing how
much stuff can break, if it relies on wall-clock time and the clock
gets stepped. I'm talking about basic 20-year old mature C code too :)
It is also surprising how large these clock corrections can be with
virtual machines.

To really confirm it, we'd need "system logs" as well to correlate the
NTP activity with the failure. With virtualbox jenkins builds, I do
this by enabling a serial console to file, and configure syslog to log
to /dev/console. And this "system log file" is just another artifact
that jenkins saves away for debugging. That's how i found the problem
in my own tests.

On Wed, Aug 24, 2022 at 9:08 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:
>
> Damn. I know about it but never had it happen to me. You're right in
> that it could be a reason and it's definitely one of the aspects I can
> take off the checklist. It looks strange because those timeouts are
> fairly high - the time correction would indeed have to be significant
> for this to fail (and in the middle of the process?!). Anyway, I'll
> look into this - thanks for the pointer!
>
> Dawid
>
> On Wed, Aug 24, 2022 at 1:39 PM Robert Muir <rcmuir@gmail.com> wrote:
> >
> > Hi Dawid, I looked at this and also https://github.com/apache/lucene/issues/7687
> >
> > If you look at the instances and how sporadic they are, the problem
> > could be caused by TimeoutSuite using wall-clock time in
> > com.carrotsearch.randomizedtesting? Especially in virtual machines,
> > wall-clock time can be extremely inaccurate when you spin them up,
> > then there's a big correction (via NTP or VM agent).
> >
> > I have no proof this is what is happening, except to say, I think it
> > would be better if randomizedtesting used monotonic time (nanoTime)
> > rather than wall-clock time (currentTimeMillis). It would make it more
> > robust.
> >
> >
> > On Wed, Aug 24, 2022 at 4:48 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:
> > >
> > > A test timed out. I've beasted with the same settings but can't
> > > reproduce. Either JVM bug somewhere or cosmic interference...
> > >
> > > Dawid
> > >
> > > On Wed, Aug 24, 2022 at 3:32 AM Policeman Jenkins Server
> > > <jenkins@thetaphi.de> wrote:
> > > >
> > > > Build: https://jenkins.thetaphi.de/job/Lucene-9.x-MacOSX/978/
> > > > Java: 64bit/jdk-18 -XX:+UseCompressedOops -XX:+UseSerialGC
> > > >
> > > > 2 tests failed.
> > > > FAILED: org.apache.lucene.analysis.ko.TestKoreanReadingFormFilter.testRandomData
> > > >
> > > > Error Message:
> > > > java.lang.Exception: Test abandoned because suite timeout was reached.
> > > >
> > > > Stack Trace:
> > > > java.lang.Exception: Test abandoned because suite timeout was reached.
> > > > at __randomizedtesting.SeedInfo.seed([9AA6F3EBA279C5BA]:0)
> > > >
> > > >
> > > > FAILED: org.apache.lucene.analysis.ko.TestKoreanReadingFormFilter.classMethod
> > > >
> > > > Error Message:
> > > > java.lang.Exception: Suite timeout exceeded (>= 7200000 msec).
> > > >
> > > > Stack Trace:
> > > > java.lang.Exception: Suite timeout exceeded (>= 7200000 msec).
> > > > at __randomizedtesting.SeedInfo.seed([9AA6F3EBA279C5BA]:0)
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: builds-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: builds-help@lucene.apache.org
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: dev-help@lucene.apache.org
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: [JENKINS] Lucene-9.x-MacOSX (64bit/jdk-18) - Build # 978 - Unstable! [ In reply to ]
Hi,

this is the MacOS virtualbox. This one often hast timeshifts caused by
Virtualbox and the NTP daemon of OSX is bullshit (no chrony).

Actually earlier versions of MacOS had a bug in their OS libc
segfaulting the app to crash on backwards jumps of wall time, which was
fixed a few years ago. Now it looks like sometimes only Gradle/Java
hangs because of this. Macos and backwards-jumping time do not fit well!
Maybe a reason why Apple does not like their OS virtualized :-) Their
bullshit kernel only works for 100% INTEL CPUs with all hardware
behaving exactly in order to time.

Uwe

Am 24.08.2022 um 15:07 schrieb Dawid Weiss:
> Damn. I know about it but never had it happen to me. You're right in
> that it could be a reason and it's definitely one of the aspects I can
> take off the checklist. It looks strange because those timeouts are
> fairly high - the time correction would indeed have to be significant
> for this to fail (and in the middle of the process?!). Anyway, I'll
> look into this - thanks for the pointer!
>
> Dawid
>
> On Wed, Aug 24, 2022 at 1:39 PM Robert Muir <rcmuir@gmail.com> wrote:
>> Hi Dawid, I looked at this and also https://github.com/apache/lucene/issues/7687
>>
>> If you look at the instances and how sporadic they are, the problem
>> could be caused by TimeoutSuite using wall-clock time in
>> com.carrotsearch.randomizedtesting? Especially in virtual machines,
>> wall-clock time can be extremely inaccurate when you spin them up,
>> then there's a big correction (via NTP or VM agent).
>>
>> I have no proof this is what is happening, except to say, I think it
>> would be better if randomizedtesting used monotonic time (nanoTime)
>> rather than wall-clock time (currentTimeMillis). It would make it more
>> robust.
>>
>>
>> On Wed, Aug 24, 2022 at 4:48 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:
>>> A test timed out. I've beasted with the same settings but can't
>>> reproduce. Either JVM bug somewhere or cosmic interference...
>>>
>>> Dawid
>>>
>>> On Wed, Aug 24, 2022 at 3:32 AM Policeman Jenkins Server
>>> <jenkins@thetaphi.de> wrote:
>>>> Build: https://jenkins.thetaphi.de/job/Lucene-9.x-MacOSX/978/
>>>> Java: 64bit/jdk-18 -XX:+UseCompressedOops -XX:+UseSerialGC
>>>>
>>>> 2 tests failed.
>>>> FAILED: org.apache.lucene.analysis.ko.TestKoreanReadingFormFilter.testRandomData
>>>>
>>>> Error Message:
>>>> java.lang.Exception: Test abandoned because suite timeout was reached.
>>>>
>>>> Stack Trace:
>>>> java.lang.Exception: Test abandoned because suite timeout was reached.
>>>> at __randomizedtesting.SeedInfo.seed([9AA6F3EBA279C5BA]:0)
>>>>
>>>>
>>>> FAILED: org.apache.lucene.analysis.ko.TestKoreanReadingFormFilter.classMethod
>>>>
>>>> Error Message:
>>>> java.lang.Exception: Suite timeout exceeded (>= 7200000 msec).
>>>>
>>>> Stack Trace:
>>>> java.lang.Exception: Suite timeout exceeded (>= 7200000 msec).
>>>> at __randomizedtesting.SeedInfo.seed([9AA6F3EBA279C5BA]:0)
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: builds-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: builds-help@lucene.apache.org
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: uwe@thetaphi.de


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: [JENKINS] Lucene-9.x-MacOSX (64bit/jdk-18) - Build # 978 - Unstable! [ In reply to ]
On Wed, Aug 24, 2022 at 11:40 AM Uwe Schindler <uwe@thetaphi.de> wrote:
>
> Hi,
>
> this is the MacOS virtualbox. This one often hast timeshifts caused by
> Virtualbox and the NTP daemon of OSX is bullshit (no chrony).
>
> Actually earlier versions of MacOS had a bug in their OS libc
> segfaulting the app to crash on backwards jumps of wall time, which was
> fixed a few years ago. Now it looks like sometimes only Gradle/Java
> hangs because of this. Macos and backwards-jumping time do not fit well!
> Maybe a reason why Apple does not like their OS virtualized :-) Their
> bullshit kernel only works for 100% INTEL CPUs with all hardware
> behaving exactly in order to time.
>

Honestly, some of it is the virtualbox, too. Once you eliminate or
workaround wall-clock time and just deal with monotonic time, there
can still be annoying issues with just monotonic time. With a linux
guest, you'll see strange stuff, such as kernel's softlockup detector
trip a lot when this happens. There are corresponding errors printed
in the vbox logging too. I set VBOX_RELEASE_LOG_DEST to allow
archiving the virtualbox VM log for jenkins pickup along with other
logs: it helps with debugging shit like this. For linux guest, I
basically exhausted all possible kernel clock sources, and found the
kvm-clock virtualized one that happens by default is the best by far.
I'm guessing MacOS may not support this, which probably makes things
worse there. I found in my environment for linux guests, remaining
timer issues can be greatly improved with a 'vboxmanage setextradata
<vmid> VBoxInternal/TM/TSCModeSwitchAllowed 0'. Don't ask me what it
does :)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org