Mailing List Archive: Patch to change murmurhash implementation slightly

Patch to change murmurhash implementation slightly

thomas.dullien at elastic

Apr 24, 2023, 12:34 PM

Post #1 of 27 (549 views)

Hey all,

I've been experimenting with fixing some low-hanging performance fruit in
the ElasticSearch codebase, and came across the fact that the MurmurHash
implementation that is used by ByteRef.hashCode() is reading 4 bytes per
loop iteration (which is likely an artifact from 32-bit architectures,
which are ever-less-important). I made a small fix to change the
implementation to read 8 bytes per loop iteration; I expected a very small
impact (2-3% CPU or so over an indexing run in ElasticSearch), but got a
pretty nontrivial throughput improvement over a few indexing benchmarks.

I tried running Lucene-only benchmarks, and succeeded in running the
example from https://github.com/mikemccand/luceneutil - but I couldn't
figure out how to run indexing benchmarks and how to interpret the results.

Could someone help me in running the benchmarks for the attached patch?

Cheers,
Thomas

Re: Patch to change murmurhash implementation slightly [ In reply to ]

zhai7631 at gmail

Apr 24, 2023, 10:57 PM

Post #2 of 27 (548 views)

I did a quick run with your patch, but since I turned on the CMS as well as
TieredMergePolicy I'm not sure how fair the comparison is. Here's the
result:
Candidate:
Indexer: indexing done (890209 msec); total 33332620 docs
Indexer: waitForMerges done (71622 msec)
Indexer: finished (961877 msec)
Baseline:
Indexer: indexing done (909706 msec); total 33332620 docs
Indexer: waitForMerges done (54775 msec)
Indexer: finished (964528 msec)

For more accurate comparison I guess it's better to use
LogxxMergePolicy and turn off CMS? If you want to run it yourself you can
find the lines I quoted from the log file.

Patrick

On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien
<thomas.dullien@elastic.co.invalid> wrote:

> Hey all,
>
> I've been experimenting with fixing some low-hanging performance fruit in
> the ElasticSearch codebase, and came across the fact that the MurmurHash
> implementation that is used by ByteRef.hashCode() is reading 4 bytes per
> loop iteration (which is likely an artifact from 32-bit architectures,
> which are ever-less-important). I made a small fix to change the
> implementation to read 8 bytes per loop iteration; I expected a very small
> impact (2-3% CPU or so over an indexing run in ElasticSearch), but got a
> pretty nontrivial throughput improvement over a few indexing benchmarks.
>
> I tried running Lucene-only benchmarks, and succeeded in running the
> example from https://github.com/mikemccand/luceneutil - but I couldn't
> figure out how to run indexing benchmarks and how to interpret the results.
>
> Could someone help me in running the benchmarks for the attached patch?
>
> Cheers,
> Thomas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org

Re: Patch to change murmurhash implementation slightly [ In reply to ]

thomas.dullien at elastic

Apr 24, 2023, 11:25 PM

Post #3 of 27 (548 views)

Hey,

I know nothing about what comparisons are fair or not :-).

Could you share a command line for running indexing benchmarks? That'd
already get me started...

Cheers,
Thomas

On Tue, 25 Apr 2023, 07:58 Patrick Zhai, <zhai7631@gmail.com> wrote:

> I did a quick run with your patch, but since I turned on the CMS as well
> as TieredMergePolicy I'm not sure how fair the comparison is. Here's the
> result:
> Candidate:
> Indexer: indexing done (890209 msec); total 33332620 docs
> Indexer: waitForMerges done (71622 msec)
> Indexer: finished (961877 msec)
> Baseline:
> Indexer: indexing done (909706 msec); total 33332620 docs
> Indexer: waitForMerges done (54775 msec)
> Indexer: finished (964528 msec)
>
> For more accurate comparison I guess it's better to use
> LogxxMergePolicy and turn off CMS? If you want to run it yourself you can
> find the lines I quoted from the log file.
>
> Patrick
>
> On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien
> <thomas.dullien@elastic.co.invalid> wrote:
>
>> Hey all,
>>
>> I've been experimenting with fixing some low-hanging performance fruit in
>> the ElasticSearch codebase, and came across the fact that the MurmurHash
>> implementation that is used by ByteRef.hashCode() is reading 4 bytes per
>> loop iteration (which is likely an artifact from 32-bit architectures,
>> which are ever-less-important). I made a small fix to change the
>> implementation to read 8 bytes per loop iteration; I expected a very small
>> impact (2-3% CPU or so over an indexing run in ElasticSearch), but got a
>> pretty nontrivial throughput improvement over a few indexing benchmarks.
>>
>> I tried running Lucene-only benchmarks, and succeeded in running the
>> example from https://github.com/mikemccand/luceneutil - but I couldn't
>> figure out how to run indexing benchmarks and how to interpret the results.
>>
>> Could someone help me in running the benchmarks for the attached patch?
>>
>> Cheers,
>> Thomas
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Patch to change murmurhash implementation slightly [ In reply to ]

rcmuir at gmail

Apr 25, 2023, 4:38 AM

Post #4 of 27 (548 views)

I think the results of the benchmark will depend on the properties of
the indexed terms. For english wikipedia (luceneutil) the average word
length is around 5 bytes so this optimization may not do much.

On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <zhai7631@gmail.com> wrote:
>
> I did a quick run with your patch, but since I turned on the CMS as well as TieredMergePolicy I'm not sure how fair the comparison is. Here's the result:
> Candidate:
> Indexer: indexing done (890209 msec); total 33332620 docs
> Indexer: waitForMerges done (71622 msec)
> Indexer: finished (961877 msec)
> Baseline:
> Indexer: indexing done (909706 msec); total 33332620 docs
> Indexer: waitForMerges done (54775 msec)
> Indexer: finished (964528 msec)
>
> For more accurate comparison I guess it's better to use LogxxMergePolicy and turn off CMS? If you want to run it yourself you can find the lines I quoted from the log file.
>
> Patrick
>
> On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien <thomas.dullien@elastic.co.invalid> wrote:
>>
>> Hey all,
>>
>> I've been experimenting with fixing some low-hanging performance fruit in the ElasticSearch codebase, and came across the fact that the MurmurHash implementation that is used by ByteRef.hashCode() is reading 4 bytes per loop iteration (which is likely an artifact from 32-bit architectures, which are ever-less-important). I made a small fix to change the implementation to read 8 bytes per loop iteration; I expected a very small impact (2-3% CPU or so over an indexing run in ElasticSearch), but got a pretty nontrivial throughput improvement over a few indexing benchmarks.
>>
>> I tried running Lucene-only benchmarks, and succeeded in running the example from https://github.com/mikemccand/luceneutil - but I couldn't figure out how to run indexing benchmarks and how to interpret the results.
>>
>> Could someone help me in running the benchmarks for the attached patch?
>>
>> Cheers,
>> Thomas
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Patch to change murmurhash implementation slightly [ In reply to ]

lucene at mikemccandless

Apr 25, 2023, 5:51 AM

Post #5 of 27 (548 views)

For a truly "pure" indexing test I usually use a single thread for
indexing, and SerialMergeScheduler (using that single thread to also do
single-threaded merging). It makes the indexing take forever lol but it
produces "comparable" results.

But ... this sounds like a great change anyway? Do we really need to gate
it on benchmark results? Do we think there could be a downside e.g. slower
indexing on (the dwindling) 32 bit CPUs?

Mike McCandless

http://blog.mikemccandless.com

On Tue, Apr 25, 2023 at 7:39?AM Robert Muir <rcmuir@gmail.com> wrote:

> I think the results of the benchmark will depend on the properties of
> the indexed terms. For english wikipedia (luceneutil) the average word
> length is around 5 bytes so this optimization may not do much.
>
> On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <zhai7631@gmail.com> wrote:
> >
> > I did a quick run with your patch, but since I turned on the CMS as well
> as TieredMergePolicy I'm not sure how fair the comparison is. Here's the
> result:
> > Candidate:
> > Indexer: indexing done (890209 msec); total 33332620 docs
> > Indexer: waitForMerges done (71622 msec)
> > Indexer: finished (961877 msec)
> > Baseline:
> > Indexer: indexing done (909706 msec); total 33332620 docs
> > Indexer: waitForMerges done (54775 msec)
> > Indexer: finished (964528 msec)
> >
> > For more accurate comparison I guess it's better to use LogxxMergePolicy
> and turn off CMS? If you want to run it yourself you can find the lines I
> quoted from the log file.
> >
> > Patrick
> >
> > On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien <
> thomas.dullien@elastic.co.invalid> wrote:
> >>
> >> Hey all,
> >>
> >> I've been experimenting with fixing some low-hanging performance fruit
> in the ElasticSearch codebase, and came across the fact that the MurmurHash
> implementation that is used by ByteRef.hashCode() is reading 4 bytes per
> loop iteration (which is likely an artifact from 32-bit architectures,
> which are ever-less-important). I made a small fix to change the
> implementation to read 8 bytes per loop iteration; I expected a very small
> impact (2-3% CPU or so over an indexing run in ElasticSearch), but got a
> pretty nontrivial throughput improvement over a few indexing benchmarks.
> >>
> >> I tried running Lucene-only benchmarks, and succeeded in running the
> example from https://github.com/mikemccand/luceneutil - but I couldn't
> figure out how to run indexing benchmarks and how to interpret the results.
> >>
> >> Could someone help me in running the benchmarks for the attached patch?
> >>
> >> Cheers,
> >> Thomas
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Patch to change murmurhash implementation slightly [ In reply to ]

rcmuir at gmail

Apr 25, 2023, 6:06 AM

Post #6 of 27 (548 views)

i think from my perspective it has nothing to do with cpus being
32-bit or 64-bit and more to do with the average length of terms in
most languages being smaller than 8. for the languages with longer
word length, its usually because of complex morphology that most users
would stem away. so doing 4 bytes at a time seems optimal IMO.
languages from nature don't care about your cpu.

On Tue, Apr 25, 2023 at 8:52?AM Michael McCandless
<lucene@mikemccandless.com> wrote:
>
> For a truly "pure" indexing test I usually use a single thread for indexing, and SerialMergeScheduler (using that single thread to also do single-threaded merging). It makes the indexing take forever lol but it produces "comparable" results.
>
> But ... this sounds like a great change anyway? Do we really need to gate it on benchmark results? Do we think there could be a downside e.g. slower indexing on (the dwindling) 32 bit CPUs?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Apr 25, 2023 at 7:39?AM Robert Muir <rcmuir@gmail.com> wrote:
>>
>> I think the results of the benchmark will depend on the properties of
>> the indexed terms. For english wikipedia (luceneutil) the average word
>> length is around 5 bytes so this optimization may not do much.
>>
>> On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <zhai7631@gmail.com> wrote:
>> >
>> > I did a quick run with your patch, but since I turned on the CMS as well as TieredMergePolicy I'm not sure how fair the comparison is. Here's the result:
>> > Candidate:
>> > Indexer: indexing done (890209 msec); total 33332620 docs
>> > Indexer: waitForMerges done (71622 msec)
>> > Indexer: finished (961877 msec)
>> > Baseline:
>> > Indexer: indexing done (909706 msec); total 33332620 docs
>> > Indexer: waitForMerges done (54775 msec)
>> > Indexer: finished (964528 msec)
>> >
>> > For more accurate comparison I guess it's better to use LogxxMergePolicy and turn off CMS? If you want to run it yourself you can find the lines I quoted from the log file.
>> >
>> > Patrick
>> >
>> > On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien <thomas.dullien@elastic.co.invalid> wrote:
>> >>
>> >> Hey all,
>> >>
>> >> I've been experimenting with fixing some low-hanging performance fruit in the ElasticSearch codebase, and came across the fact that the MurmurHash implementation that is used by ByteRef.hashCode() is reading 4 bytes per loop iteration (which is likely an artifact from 32-bit architectures, which are ever-less-important). I made a small fix to change the implementation to read 8 bytes per loop iteration; I expected a very small impact (2-3% CPU or so over an indexing run in ElasticSearch), but got a pretty nontrivial throughput improvement over a few indexing benchmarks.
>> >>
>> >> I tried running Lucene-only benchmarks, and succeeded in running the example from https://github.com/mikemccand/luceneutil - but I couldn't figure out how to run indexing benchmarks and how to interpret the results.
>> >>
>> >> Could someone help me in running the benchmarks for the attached patch?
>> >>
>> >> Cheers,
>> >> Thomas
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Patch to change murmurhash implementation slightly [ In reply to ]

thomas.dullien at elastic

Apr 25, 2023, 6:16 AM

Post #7 of 27 (548 views)

Is average word length <= 4 realistic though? I mean, even the english wiki
corpus has ~5, which would require two calls to the lucene layer instead of
one; e.g. multiple layers of virtual dispatch that are unnecessary?

You're not going to pay any cycles for reading 8 bytes instead of 4 bytes,
so the cost of doing so will be the same - while speeding up in cases where
4 isn't quite enough?

Cheers,
Thomas

On Tue, Apr 25, 2023 at 3:07?PM Robert Muir <rcmuir@gmail.com> wrote:

> i think from my perspective it has nothing to do with cpus being
> 32-bit or 64-bit and more to do with the average length of terms in
> most languages being smaller than 8. for the languages with longer
> word length, its usually because of complex morphology that most users
> would stem away. so doing 4 bytes at a time seems optimal IMO.
> languages from nature don't care about your cpu.
>
> On Tue, Apr 25, 2023 at 8:52?AM Michael McCandless
> <lucene@mikemccandless.com> wrote:
> >
> > For a truly "pure" indexing test I usually use a single thread for
> indexing, and SerialMergeScheduler (using that single thread to also do
> single-threaded merging). It makes the indexing take forever lol but it
> produces "comparable" results.
> >
> > But ... this sounds like a great change anyway? Do we really need to
> gate it on benchmark results? Do we think there could be a downside e.g.
> slower indexing on (the dwindling) 32 bit CPUs?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Tue, Apr 25, 2023 at 7:39?AM Robert Muir <rcmuir@gmail.com> wrote:
> >>
> >> I think the results of the benchmark will depend on the properties of
> >> the indexed terms. For english wikipedia (luceneutil) the average word
> >> length is around 5 bytes so this optimization may not do much.
> >>
> >> On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <zhai7631@gmail.com>
> wrote:
> >> >
> >> > I did a quick run with your patch, but since I turned on the CMS as
> well as TieredMergePolicy I'm not sure how fair the comparison is. Here's
> the result:
> >> > Candidate:
> >> > Indexer: indexing done (890209 msec); total 33332620 docs
> >> > Indexer: waitForMerges done (71622 msec)
> >> > Indexer: finished (961877 msec)
> >> > Baseline:
> >> > Indexer: indexing done (909706 msec); total 33332620 docs
> >> > Indexer: waitForMerges done (54775 msec)
> >> > Indexer: finished (964528 msec)
> >> >
> >> > For more accurate comparison I guess it's better to use
> LogxxMergePolicy and turn off CMS? If you want to run it yourself you can
> find the lines I quoted from the log file.
> >> >
> >> > Patrick
> >> >
> >> > On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien <
> thomas.dullien@elastic.co.invalid> wrote:
> >> >>
> >> >> Hey all,
> >> >>
> >> >> I've been experimenting with fixing some low-hanging performance
> fruit in the ElasticSearch codebase, and came across the fact that the
> MurmurHash implementation that is used by ByteRef.hashCode() is reading 4
> bytes per loop iteration (which is likely an artifact from 32-bit
> architectures, which are ever-less-important). I made a small fix to change
> the implementation to read 8 bytes per loop iteration; I expected a very
> small impact (2-3% CPU or so over an indexing run in ElasticSearch), but
> got a pretty nontrivial throughput improvement over a few indexing
> benchmarks.
> >> >>
> >> >> I tried running Lucene-only benchmarks, and succeeded in running the
> example from https://github.com/mikemccand/luceneutil - but I couldn't
> figure out how to run indexing benchmarks and how to interpret the results.
> >> >>
> >> >> Could someone help me in running the benchmarks for the attached
> patch?
> >> >>
> >> >> Cheers,
> >> >> Thomas
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Patch to change murmurhash implementation slightly [ In reply to ]

rcmuir at gmail

Apr 25, 2023, 6:19 AM

Post #8 of 27 (548 views)

if a word is of length 5, processing 8 bytes at a time isn't going to
speed anything up. there aren't 8 bytes to process.

On Tue, Apr 25, 2023 at 9:17?AM Thomas Dullien
<thomas.dullien@elastic.co.invalid> wrote:
>
> Is average word length <= 4 realistic though? I mean, even the english wiki corpus has ~5, which would require two calls to the lucene layer instead of one; e.g. multiple layers of virtual dispatch that are unnecessary?
>
> You're not going to pay any cycles for reading 8 bytes instead of 4 bytes, so the cost of doing so will be the same - while speeding up in cases where 4 isn't quite enough?
>
> Cheers,
> Thomas
>
> On Tue, Apr 25, 2023 at 3:07?PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>> i think from my perspective it has nothing to do with cpus being
>> 32-bit or 64-bit and more to do with the average length of terms in
>> most languages being smaller than 8. for the languages with longer
>> word length, its usually because of complex morphology that most users
>> would stem away. so doing 4 bytes at a time seems optimal IMO.
>> languages from nature don't care about your cpu.
>>
>> On Tue, Apr 25, 2023 at 8:52?AM Michael McCandless
>> <lucene@mikemccandless.com> wrote:
>> >
>> > For a truly "pure" indexing test I usually use a single thread for indexing, and SerialMergeScheduler (using that single thread to also do single-threaded merging). It makes the indexing take forever lol but it produces "comparable" results.
>> >
>> > But ... this sounds like a great change anyway? Do we really need to gate it on benchmark results? Do we think there could be a downside e.g. slower indexing on (the dwindling) 32 bit CPUs?
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> >
>> > On Tue, Apr 25, 2023 at 7:39?AM Robert Muir <rcmuir@gmail.com> wrote:
>> >>
>> >> I think the results of the benchmark will depend on the properties of
>> >> the indexed terms. For english wikipedia (luceneutil) the average word
>> >> length is around 5 bytes so this optimization may not do much.
>> >>
>> >> On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <zhai7631@gmail.com> wrote:
>> >> >
>> >> > I did a quick run with your patch, but since I turned on the CMS as well as TieredMergePolicy I'm not sure how fair the comparison is. Here's the result:
>> >> > Candidate:
>> >> > Indexer: indexing done (890209 msec); total 33332620 docs
>> >> > Indexer: waitForMerges done (71622 msec)
>> >> > Indexer: finished (961877 msec)
>> >> > Baseline:
>> >> > Indexer: indexing done (909706 msec); total 33332620 docs
>> >> > Indexer: waitForMerges done (54775 msec)
>> >> > Indexer: finished (964528 msec)
>> >> >
>> >> > For more accurate comparison I guess it's better to use LogxxMergePolicy and turn off CMS? If you want to run it yourself you can find the lines I quoted from the log file.
>> >> >
>> >> > Patrick
>> >> >
>> >> > On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien <thomas.dullien@elastic.co.invalid> wrote:
>> >> >>
>> >> >> Hey all,
>> >> >>
>> >> >> I've been experimenting with fixing some low-hanging performance fruit in the ElasticSearch codebase, and came across the fact that the MurmurHash implementation that is used by ByteRef.hashCode() is reading 4 bytes per loop iteration (which is likely an artifact from 32-bit architectures, which are ever-less-important). I made a small fix to change the implementation to read 8 bytes per loop iteration; I expected a very small impact (2-3% CPU or so over an indexing run in ElasticSearch), but got a pretty nontrivial throughput improvement over a few indexing benchmarks.
>> >> >>
>> >> >> I tried running Lucene-only benchmarks, and succeeded in running the example from https://github.com/mikemccand/luceneutil - but I couldn't figure out how to run indexing benchmarks and how to interpret the results.
>> >> >>
>> >> >> Could someone help me in running the benchmarks for the attached patch?
>> >> >>
>> >> >> Cheers,
>> >> >> Thomas
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Patch to change murmurhash implementation slightly [ In reply to ]

thomas.dullien at elastic

Apr 25, 2023, 6:55 AM

Post #9 of 27 (548 views)

Ah, I see what you mean.

You are correct -- the change will not speed up a 5-byte word, but it
*will* speed up all 8+-byte words, at no cost to the shorter words.

On Tue, Apr 25, 2023 at 3:20?PM Robert Muir <rcmuir@gmail.com> wrote:

> if a word is of length 5, processing 8 bytes at a time isn't going to
> speed anything up. there aren't 8 bytes to process.
>
> On Tue, Apr 25, 2023 at 9:17?AM Thomas Dullien
> <thomas.dullien@elastic.co.invalid> wrote:
> >
> > Is average word length <= 4 realistic though? I mean, even the english
> wiki corpus has ~5, which would require two calls to the lucene layer
> instead of one; e.g. multiple layers of virtual dispatch that are
> unnecessary?
> >
> > You're not going to pay any cycles for reading 8 bytes instead of 4
> bytes, so the cost of doing so will be the same - while speeding up in
> cases where 4 isn't quite enough?
> >
> > Cheers,
> > Thomas
> >
> > On Tue, Apr 25, 2023 at 3:07?PM Robert Muir <rcmuir@gmail.com> wrote:
> >>
> >> i think from my perspective it has nothing to do with cpus being
> >> 32-bit or 64-bit and more to do with the average length of terms in
> >> most languages being smaller than 8. for the languages with longer
> >> word length, its usually because of complex morphology that most users
> >> would stem away. so doing 4 bytes at a time seems optimal IMO.
> >> languages from nature don't care about your cpu.
> >>
> >> On Tue, Apr 25, 2023 at 8:52?AM Michael McCandless
> >> <lucene@mikemccandless.com> wrote:
> >> >
> >> > For a truly "pure" indexing test I usually use a single thread for
> indexing, and SerialMergeScheduler (using that single thread to also do
> single-threaded merging). It makes the indexing take forever lol but it
> produces "comparable" results.
> >> >
> >> > But ... this sounds like a great change anyway? Do we really need to
> gate it on benchmark results? Do we think there could be a downside e.g.
> slower indexing on (the dwindling) 32 bit CPUs?
> >> >
> >> > Mike McCandless
> >> >
> >> > http://blog.mikemccandless.com
> >> >
> >> >
> >> > On Tue, Apr 25, 2023 at 7:39?AM Robert Muir <rcmuir@gmail.com> wrote:
> >> >>
> >> >> I think the results of the benchmark will depend on the properties of
> >> >> the indexed terms. For english wikipedia (luceneutil) the average
> word
> >> >> length is around 5 bytes so this optimization may not do much.
> >> >>
> >> >> On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <zhai7631@gmail.com>
> wrote:
> >> >> >
> >> >> > I did a quick run with your patch, but since I turned on the CMS
> as well as TieredMergePolicy I'm not sure how fair the comparison is.
> Here's the result:
> >> >> > Candidate:
> >> >> > Indexer: indexing done (890209 msec); total 33332620 docs
> >> >> > Indexer: waitForMerges done (71622 msec)
> >> >> > Indexer: finished (961877 msec)
> >> >> > Baseline:
> >> >> > Indexer: indexing done (909706 msec); total 33332620 docs
> >> >> > Indexer: waitForMerges done (54775 msec)
> >> >> > Indexer: finished (964528 msec)
> >> >> >
> >> >> > For more accurate comparison I guess it's better to use
> LogxxMergePolicy and turn off CMS? If you want to run it yourself you can
> find the lines I quoted from the log file.
> >> >> >
> >> >> > Patrick
> >> >> >
> >> >> > On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien <
> thomas.dullien@elastic.co.invalid> wrote:
> >> >> >>
> >> >> >> Hey all,
> >> >> >>
> >> >> >> I've been experimenting with fixing some low-hanging performance
> fruit in the ElasticSearch codebase, and came across the fact that the
> MurmurHash implementation that is used by ByteRef.hashCode() is reading 4
> bytes per loop iteration (which is likely an artifact from 32-bit
> architectures, which are ever-less-important). I made a small fix to change
> the implementation to read 8 bytes per loop iteration; I expected a very
> small impact (2-3% CPU or so over an indexing run in ElasticSearch), but
> got a pretty nontrivial throughput improvement over a few indexing
> benchmarks.
> >> >> >>
> >> >> >> I tried running Lucene-only benchmarks, and succeeded in running
> the example from https://github.com/mikemccand/luceneutil - but I
> couldn't figure out how to run indexing benchmarks and how to interpret the
> results.
> >> >> >>
> >> >> >> Could someone help me in running the benchmarks for the attached
> patch?
> >> >> >>
> >> >> >> Cheers,
> >> >> >> Thomas
> >> >> >>
> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>

Re: Patch to change murmurhash implementation slightly [ In reply to ]

rcmuir at gmail

Apr 25, 2023, 7:02 AM

Post #10 of 27 (548 views)

well there is some cost, as it must add additional checks to see if
its longer than 8. in your patch, additional loops. it increases the
method size and may impact inlining and other things. also we can't
forget about correctness, if the hash function does the wrong thing it
could slow everything to a crawl.

On Tue, Apr 25, 2023 at 9:56?AM Thomas Dullien
<thomas.dullien@elastic.co> wrote:
>
> Ah, I see what you mean.
>
> You are correct -- the change will not speed up a 5-byte word, but it *will* speed up all 8+-byte words, at no cost to the shorter words.
>
> On Tue, Apr 25, 2023 at 3:20?PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>> if a word is of length 5, processing 8 bytes at a time isn't going to
>> speed anything up. there aren't 8 bytes to process.
>>
>> On Tue, Apr 25, 2023 at 9:17?AM Thomas Dullien
>> <thomas.dullien@elastic.co.invalid> wrote:
>> >
>> > Is average word length <= 4 realistic though? I mean, even the english wiki corpus has ~5, which would require two calls to the lucene layer instead of one; e.g. multiple layers of virtual dispatch that are unnecessary?
>> >
>> > You're not going to pay any cycles for reading 8 bytes instead of 4 bytes, so the cost of doing so will be the same - while speeding up in cases where 4 isn't quite enough?
>> >
>> > Cheers,
>> > Thomas
>> >
>> > On Tue, Apr 25, 2023 at 3:07?PM Robert Muir <rcmuir@gmail.com> wrote:
>> >>
>> >> i think from my perspective it has nothing to do with cpus being
>> >> 32-bit or 64-bit and more to do with the average length of terms in
>> >> most languages being smaller than 8. for the languages with longer
>> >> word length, its usually because of complex morphology that most users
>> >> would stem away. so doing 4 bytes at a time seems optimal IMO.
>> >> languages from nature don't care about your cpu.
>> >>
>> >> On Tue, Apr 25, 2023 at 8:52?AM Michael McCandless
>> >> <lucene@mikemccandless.com> wrote:
>> >> >
>> >> > For a truly "pure" indexing test I usually use a single thread for indexing, and SerialMergeScheduler (using that single thread to also do single-threaded merging). It makes the indexing take forever lol but it produces "comparable" results.
>> >> >
>> >> > But ... this sounds like a great change anyway? Do we really need to gate it on benchmark results? Do we think there could be a downside e.g. slower indexing on (the dwindling) 32 bit CPUs?
>> >> >
>> >> > Mike McCandless
>> >> >
>> >> > http://blog.mikemccandless.com
>> >> >
>> >> >
>> >> > On Tue, Apr 25, 2023 at 7:39?AM Robert Muir <rcmuir@gmail.com> wrote:
>> >> >>
>> >> >> I think the results of the benchmark will depend on the properties of
>> >> >> the indexed terms. For english wikipedia (luceneutil) the average word
>> >> >> length is around 5 bytes so this optimization may not do much.
>> >> >>
>> >> >> On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <zhai7631@gmail.com> wrote:
>> >> >> >
>> >> >> > I did a quick run with your patch, but since I turned on the CMS as well as TieredMergePolicy I'm not sure how fair the comparison is. Here's the result:
>> >> >> > Candidate:
>> >> >> > Indexer: indexing done (890209 msec); total 33332620 docs
>> >> >> > Indexer: waitForMerges done (71622 msec)
>> >> >> > Indexer: finished (961877 msec)
>> >> >> > Baseline:
>> >> >> > Indexer: indexing done (909706 msec); total 33332620 docs
>> >> >> > Indexer: waitForMerges done (54775 msec)
>> >> >> > Indexer: finished (964528 msec)
>> >> >> >
>> >> >> > For more accurate comparison I guess it's better to use LogxxMergePolicy and turn off CMS? If you want to run it yourself you can find the lines I quoted from the log file.
>> >> >> >
>> >> >> > Patrick
>> >> >> >
>> >> >> > On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien <thomas.dullien@elastic.co.invalid> wrote:
>> >> >> >>
>> >> >> >> Hey all,
>> >> >> >>
>> >> >> >> I've been experimenting with fixing some low-hanging performance fruit in the ElasticSearch codebase, and came across the fact that the MurmurHash implementation that is used by ByteRef.hashCode() is reading 4 bytes per loop iteration (which is likely an artifact from 32-bit architectures, which are ever-less-important). I made a small fix to change the implementation to read 8 bytes per loop iteration; I expected a very small impact (2-3% CPU or so over an indexing run in ElasticSearch), but got a pretty nontrivial throughput improvement over a few indexing benchmarks.
>> >> >> >>
>> >> >> >> I tried running Lucene-only benchmarks, and succeeded in running the example from https://github.com/mikemccand/luceneutil - but I couldn't figure out how to run indexing benchmarks and how to interpret the results.
>> >> >> >>
>> >> >> >> Could someone help me in running the benchmarks for the attached patch?
>> >> >> >>
>> >> >> >> Cheers,
>> >> >> >> Thomas
>> >> >> >>
>> >> >> >> ---------------------------------------------------------------------
>> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Patch to change murmurhash implementation slightly [ In reply to ]

thomas.dullien at elastic

Apr 25, 2023, 7:19 AM

Post #11 of 27 (548 views)

Hey,

I am pretty confident about correctness. The change passes both Lucene and
ES regression tests and my careful reading of the code is pretty certain
that the output is the same. If you want me to randomly test the result for
a few hundred million random strings, I'm happy to do that, too, if you
have other suggestions for correctness testing, let me know.

The change does increase the method size and may impact inlining - but so
does literally any code change, particularly in a JIT'ed environment where
placement of code (and hence things like instruction cache conflicts)
depend on the precise history of execution. The way I understand it, one
deals with this by benchmarking and measuring.

FWIW, several indexing-heavy ES benchmarks show a noticeable improvement in
indexing speed - this is why I was asking about a broad range of Lucene
benchmarks; to verify that this is indeed the case for Lucene-only, too.

Let me know what data you'd like to see to decide whether this patch is a
good idea, and if there is consensus among the Lucene committers that those
are reasonable criteria, I'll work on producing that data.

Cheers,
Thomas

On Tue, Apr 25, 2023 at 4:02?PM Robert Muir <rcmuir@gmail.com> wrote:

> well there is some cost, as it must add additional checks to see if
> its longer than 8. in your patch, additional loops. it increases the
> method size and may impact inlining and other things. also we can't
> forget about correctness, if the hash function does the wrong thing it
> could slow everything to a crawl.
>
> On Tue, Apr 25, 2023 at 9:56?AM Thomas Dullien
> <thomas.dullien@elastic.co> wrote:
> >
> > Ah, I see what you mean.
> >
> > You are correct -- the change will not speed up a 5-byte word, but it
> *will* speed up all 8+-byte words, at no cost to the shorter words.
> >
> > On Tue, Apr 25, 2023 at 3:20?PM Robert Muir <rcmuir@gmail.com> wrote:
> >>
> >> if a word is of length 5, processing 8 bytes at a time isn't going to
> >> speed anything up. there aren't 8 bytes to process.
> >>
> >> On Tue, Apr 25, 2023 at 9:17?AM Thomas Dullien
> >> <thomas.dullien@elastic.co.invalid> wrote:
> >> >
> >> > Is average word length <= 4 realistic though? I mean, even the
> english wiki corpus has ~5, which would require two calls to the lucene
> layer instead of one; e.g. multiple layers of virtual dispatch that are
> unnecessary?
> >> >
> >> > You're not going to pay any cycles for reading 8 bytes instead of 4
> bytes, so the cost of doing so will be the same - while speeding up in
> cases where 4 isn't quite enough?
> >> >
> >> > Cheers,
> >> > Thomas
> >> >
> >> > On Tue, Apr 25, 2023 at 3:07?PM Robert Muir <rcmuir@gmail.com> wrote:
> >> >>
> >> >> i think from my perspective it has nothing to do with cpus being
> >> >> 32-bit or 64-bit and more to do with the average length of terms in
> >> >> most languages being smaller than 8. for the languages with longer
> >> >> word length, its usually because of complex morphology that most
> users
> >> >> would stem away. so doing 4 bytes at a time seems optimal IMO.
> >> >> languages from nature don't care about your cpu.
> >> >>
> >> >> On Tue, Apr 25, 2023 at 8:52?AM Michael McCandless
> >> >> <lucene@mikemccandless.com> wrote:
> >> >> >
> >> >> > For a truly "pure" indexing test I usually use a single thread for
> indexing, and SerialMergeScheduler (using that single thread to also do
> single-threaded merging). It makes the indexing take forever lol but it
> produces "comparable" results.
> >> >> >
> >> >> > But ... this sounds like a great change anyway? Do we really need
> to gate it on benchmark results? Do we think there could be a downside
> e.g. slower indexing on (the dwindling) 32 bit CPUs?
> >> >> >
> >> >> > Mike McCandless
> >> >> >
> >> >> > http://blog.mikemccandless.com
> >> >> >
> >> >> >
> >> >> > On Tue, Apr 25, 2023 at 7:39?AM Robert Muir <rcmuir@gmail.com>
> wrote:
> >> >> >>
> >> >> >> I think the results of the benchmark will depend on the
> properties of
> >> >> >> the indexed terms. For english wikipedia (luceneutil) the average
> word
> >> >> >> length is around 5 bytes so this optimization may not do much.
> >> >> >>
> >> >> >> On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <zhai7631@gmail.com>
> wrote:
> >> >> >> >
> >> >> >> > I did a quick run with your patch, but since I turned on the
> CMS as well as TieredMergePolicy I'm not sure how fair the comparison is.
> Here's the result:
> >> >> >> > Candidate:
> >> >> >> > Indexer: indexing done (890209 msec); total 33332620 docs
> >> >> >> > Indexer: waitForMerges done (71622 msec)
> >> >> >> > Indexer: finished (961877 msec)
> >> >> >> > Baseline:
> >> >> >> > Indexer: indexing done (909706 msec); total 33332620 docs
> >> >> >> > Indexer: waitForMerges done (54775 msec)
> >> >> >> > Indexer: finished (964528 msec)
> >> >> >> >
> >> >> >> > For more accurate comparison I guess it's better to use
> LogxxMergePolicy and turn off CMS? If you want to run it yourself you can
> find the lines I quoted from the log file.
> >> >> >> >
> >> >> >> > Patrick
> >> >> >> >
> >> >> >> > On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien <
> thomas.dullien@elastic.co.invalid> wrote:
> >> >> >> >>
> >> >> >> >> Hey all,
> >> >> >> >>
> >> >> >> >> I've been experimenting with fixing some low-hanging
> performance fruit in the ElasticSearch codebase, and came across the fact
> that the MurmurHash implementation that is used by ByteRef.hashCode() is
> reading 4 bytes per loop iteration (which is likely an artifact from 32-bit
> architectures, which are ever-less-important). I made a small fix to change
> the implementation to read 8 bytes per loop iteration; I expected a very
> small impact (2-3% CPU or so over an indexing run in ElasticSearch), but
> got a pretty nontrivial throughput improvement over a few indexing
> benchmarks.
> >> >> >> >>
> >> >> >> >> I tried running Lucene-only benchmarks, and succeeded in
> running the example from https://github.com/mikemccand/luceneutil - but I
> couldn't figure out how to run indexing benchmarks and how to interpret the
> results.
> >> >> >> >>
> >> >> >> >> Could someone help me in running the benchmarks for the
> attached patch?
> >> >> >> >>
> >> >> >> >> Cheers,
> >> >> >> >> Thomas
> >> >> >> >>
> >> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >> >>
> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >> >>
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
>

Re: Patch to change murmurhash implementation slightly [ In reply to ]

rcmuir at gmail

Apr 25, 2023, 7:21 AM

Post #12 of 27 (548 views)

sure, but "if length > 8 return 1" might pass these same tests too,
yet cause a ton of hash collisions.

I just think if you want to optimize for super-long strings, there
should be a unit test.

On Tue, Apr 25, 2023 at 10:20?AM Thomas Dullien
<thomas.dullien@elastic.co> wrote:
>
> Hey,
>
> I am pretty confident about correctness. The change passes both Lucene and ES regression tests and my careful reading of the code is pretty certain that the output is the same. If you want me to randomly test the result for a few hundred million random strings, I'm happy to do that, too, if you have other suggestions for correctness testing, let me know.
>
> The change does increase the method size and may impact inlining - but so does literally any code change, particularly in a JIT'ed environment where placement of code (and hence things like instruction cache conflicts) depend on the precise history of execution. The way I understand it, one deals with this by benchmarking and measuring.
>
> FWIW, several indexing-heavy ES benchmarks show a noticeable improvement in indexing speed - this is why I was asking about a broad range of Lucene benchmarks; to verify that this is indeed the case for Lucene-only, too.
>
> Let me know what data you'd like to see to decide whether this patch is a good idea, and if there is consensus among the Lucene committers that those are reasonable criteria, I'll work on producing that data.
>
> Cheers,
> Thomas
>
>
>
> On Tue, Apr 25, 2023 at 4:02?PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>> well there is some cost, as it must add additional checks to see if
>> its longer than 8. in your patch, additional loops. it increases the
>> method size and may impact inlining and other things. also we can't
>> forget about correctness, if the hash function does the wrong thing it
>> could slow everything to a crawl.
>>
>> On Tue, Apr 25, 2023 at 9:56?AM Thomas Dullien
>> <thomas.dullien@elastic.co> wrote:
>> >
>> > Ah, I see what you mean.
>> >
>> > You are correct -- the change will not speed up a 5-byte word, but it *will* speed up all 8+-byte words, at no cost to the shorter words.
>> >
>> > On Tue, Apr 25, 2023 at 3:20?PM Robert Muir <rcmuir@gmail.com> wrote:
>> >>
>> >> if a word is of length 5, processing 8 bytes at a time isn't going to
>> >> speed anything up. there aren't 8 bytes to process.
>> >>
>> >> On Tue, Apr 25, 2023 at 9:17?AM Thomas Dullien
>> >> <thomas.dullien@elastic.co.invalid> wrote:
>> >> >
>> >> > Is average word length <= 4 realistic though? I mean, even the english wiki corpus has ~5, which would require two calls to the lucene layer instead of one; e.g. multiple layers of virtual dispatch that are unnecessary?
>> >> >
>> >> > You're not going to pay any cycles for reading 8 bytes instead of 4 bytes, so the cost of doing so will be the same - while speeding up in cases where 4 isn't quite enough?
>> >> >
>> >> > Cheers,
>> >> > Thomas
>> >> >
>> >> > On Tue, Apr 25, 2023 at 3:07?PM Robert Muir <rcmuir@gmail.com> wrote:
>> >> >>
>> >> >> i think from my perspective it has nothing to do with cpus being
>> >> >> 32-bit or 64-bit and more to do with the average length of terms in
>> >> >> most languages being smaller than 8. for the languages with longer
>> >> >> word length, its usually because of complex morphology that most users
>> >> >> would stem away. so doing 4 bytes at a time seems optimal IMO.
>> >> >> languages from nature don't care about your cpu.
>> >> >>
>> >> >> On Tue, Apr 25, 2023 at 8:52?AM Michael McCandless
>> >> >> <lucene@mikemccandless.com> wrote:
>> >> >> >
>> >> >> > For a truly "pure" indexing test I usually use a single thread for indexing, and SerialMergeScheduler (using that single thread to also do single-threaded merging). It makes the indexing take forever lol but it produces "comparable" results.
>> >> >> >
>> >> >> > But ... this sounds like a great change anyway? Do we really need to gate it on benchmark results? Do we think there could be a downside e.g. slower indexing on (the dwindling) 32 bit CPUs?
>> >> >> >
>> >> >> > Mike McCandless
>> >> >> >
>> >> >> > http://blog.mikemccandless.com
>> >> >> >
>> >> >> >
>> >> >> > On Tue, Apr 25, 2023 at 7:39?AM Robert Muir <rcmuir@gmail.com> wrote:
>> >> >> >>
>> >> >> >> I think the results of the benchmark will depend on the properties of
>> >> >> >> the indexed terms. For english wikipedia (luceneutil) the average word
>> >> >> >> length is around 5 bytes so this optimization may not do much.
>> >> >> >>
>> >> >> >> On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <zhai7631@gmail.com> wrote:
>> >> >> >> >
>> >> >> >> > I did a quick run with your patch, but since I turned on the CMS as well as TieredMergePolicy I'm not sure how fair the comparison is. Here's the result:
>> >> >> >> > Candidate:
>> >> >> >> > Indexer: indexing done (890209 msec); total 33332620 docs
>> >> >> >> > Indexer: waitForMerges done (71622 msec)
>> >> >> >> > Indexer: finished (961877 msec)
>> >> >> >> > Baseline:
>> >> >> >> > Indexer: indexing done (909706 msec); total 33332620 docs
>> >> >> >> > Indexer: waitForMerges done (54775 msec)
>> >> >> >> > Indexer: finished (964528 msec)
>> >> >> >> >
>> >> >> >> > For more accurate comparison I guess it's better to use LogxxMergePolicy and turn off CMS? If you want to run it yourself you can find the lines I quoted from the log file.
>> >> >> >> >
>> >> >> >> > Patrick
>> >> >> >> >
>> >> >> >> > On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien <thomas.dullien@elastic.co.invalid> wrote:
>> >> >> >> >>
>> >> >> >> >> Hey all,
>> >> >> >> >>
>> >> >> >> >> I've been experimenting with fixing some low-hanging performance fruit in the ElasticSearch codebase, and came across the fact that the MurmurHash implementation that is used by ByteRef.hashCode() is reading 4 bytes per loop iteration (which is likely an artifact from 32-bit architectures, which are ever-less-important). I made a small fix to change the implementation to read 8 bytes per loop iteration; I expected a very small impact (2-3% CPU or so over an indexing run in ElasticSearch), but got a pretty nontrivial throughput improvement over a few indexing benchmarks.
>> >> >> >> >>
>> >> >> >> >> I tried running Lucene-only benchmarks, and succeeded in running the example from https://github.com/mikemccand/luceneutil - but I couldn't figure out how to run indexing benchmarks and how to interpret the results.
>> >> >> >> >>
>> >> >> >> >> Could someone help me in running the benchmarks for the attached patch?
>> >> >> >> >>
>> >> >> >> >> Cheers,
>> >> >> >> >> Thomas
>> >> >> >> >>
>> >> >> >> >> ---------------------------------------------------------------------
>> >> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >> >>
>> >> >> >> ---------------------------------------------------------------------
>> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >> >>
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Patch to change murmurhash implementation slightly [ In reply to ]

thomas.dullien at elastic

Apr 25, 2023, 7:25 AM

Post #13 of 27 (548 views)

Hey,

so there are unit tests in TestStringHelper.java that test strings of
length greater than 8, and my change passes them. Could you explain what
you want tested?

Cheers,
Thomas

On Tue, Apr 25, 2023 at 4:21?PM Robert Muir <rcmuir@gmail.com> wrote:

> sure, but "if length > 8 return 1" might pass these same tests too,
> yet cause a ton of hash collisions.
>
> I just think if you want to optimize for super-long strings, there
> should be a unit test.
>
> On Tue, Apr 25, 2023 at 10:20?AM Thomas Dullien
> <thomas.dullien@elastic.co> wrote:
> >
> > Hey,
> >
> > I am pretty confident about correctness. The change passes both Lucene
> and ES regression tests and my careful reading of the code is pretty
> certain that the output is the same. If you want me to randomly test the
> result for a few hundred million random strings, I'm happy to do that, too,
> if you have other suggestions for correctness testing, let me know.
> >
> > The change does increase the method size and may impact inlining - but
> so does literally any code change, particularly in a JIT'ed environment
> where placement of code (and hence things like instruction cache conflicts)
> depend on the precise history of execution. The way I understand it, one
> deals with this by benchmarking and measuring.
> >
> > FWIW, several indexing-heavy ES benchmarks show a noticeable improvement
> in indexing speed - this is why I was asking about a broad range of Lucene
> benchmarks; to verify that this is indeed the case for Lucene-only, too.
> >
> > Let me know what data you'd like to see to decide whether this patch is
> a good idea, and if there is consensus among the Lucene committers that
> those are reasonable criteria, I'll work on producing that data.
> >
> > Cheers,
> > Thomas
> >
> >
> >
> > On Tue, Apr 25, 2023 at 4:02?PM Robert Muir <rcmuir@gmail.com> wrote:
> >>
> >> well there is some cost, as it must add additional checks to see if
> >> its longer than 8. in your patch, additional loops. it increases the
> >> method size and may impact inlining and other things. also we can't
> >> forget about correctness, if the hash function does the wrong thing it
> >> could slow everything to a crawl.
> >>
> >> On Tue, Apr 25, 2023 at 9:56?AM Thomas Dullien
> >> <thomas.dullien@elastic.co> wrote:
> >> >
> >> > Ah, I see what you mean.
> >> >
> >> > You are correct -- the change will not speed up a 5-byte word, but it
> *will* speed up all 8+-byte words, at no cost to the shorter words.
> >> >
> >> > On Tue, Apr 25, 2023 at 3:20?PM Robert Muir <rcmuir@gmail.com> wrote:
> >> >>
> >> >> if a word is of length 5, processing 8 bytes at a time isn't going to
> >> >> speed anything up. there aren't 8 bytes to process.
> >> >>
> >> >> On Tue, Apr 25, 2023 at 9:17?AM Thomas Dullien
> >> >> <thomas.dullien@elastic.co.invalid> wrote:
> >> >> >
> >> >> > Is average word length <= 4 realistic though? I mean, even the
> english wiki corpus has ~5, which would require two calls to the lucene
> layer instead of one; e.g. multiple layers of virtual dispatch that are
> unnecessary?
> >> >> >
> >> >> > You're not going to pay any cycles for reading 8 bytes instead of
> 4 bytes, so the cost of doing so will be the same - while speeding up in
> cases where 4 isn't quite enough?
> >> >> >
> >> >> > Cheers,
> >> >> > Thomas
> >> >> >
> >> >> > On Tue, Apr 25, 2023 at 3:07?PM Robert Muir <rcmuir@gmail.com>
> wrote:
> >> >> >>
> >> >> >> i think from my perspective it has nothing to do with cpus being
> >> >> >> 32-bit or 64-bit and more to do with the average length of terms
> in
> >> >> >> most languages being smaller than 8. for the languages with longer
> >> >> >> word length, its usually because of complex morphology that most
> users
> >> >> >> would stem away. so doing 4 bytes at a time seems optimal IMO.
> >> >> >> languages from nature don't care about your cpu.
> >> >> >>
> >> >> >> On Tue, Apr 25, 2023 at 8:52?AM Michael McCandless
> >> >> >> <lucene@mikemccandless.com> wrote:
> >> >> >> >
> >> >> >> > For a truly "pure" indexing test I usually use a single thread
> for indexing, and SerialMergeScheduler (using that single thread to also do
> single-threaded merging). It makes the indexing take forever lol but it
> produces "comparable" results.
> >> >> >> >
> >> >> >> > But ... this sounds like a great change anyway? Do we really
> need to gate it on benchmark results? Do we think there could be a
> downside e.g. slower indexing on (the dwindling) 32 bit CPUs?
> >> >> >> >
> >> >> >> > Mike McCandless
> >> >> >> >
> >> >> >> > http://blog.mikemccandless.com
> >> >> >> >
> >> >> >> >
> >> >> >> > On Tue, Apr 25, 2023 at 7:39?AM Robert Muir <rcmuir@gmail.com>
> wrote:
> >> >> >> >>
> >> >> >> >> I think the results of the benchmark will depend on the
> properties of
> >> >> >> >> the indexed terms. For english wikipedia (luceneutil) the
> average word
> >> >> >> >> length is around 5 bytes so this optimization may not do much.
> >> >> >> >>
> >> >> >> >> On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <
> zhai7631@gmail.com> wrote:
> >> >> >> >> >
> >> >> >> >> > I did a quick run with your patch, but since I turned on the
> CMS as well as TieredMergePolicy I'm not sure how fair the comparison is.
> Here's the result:
> >> >> >> >> > Candidate:
> >> >> >> >> > Indexer: indexing done (890209 msec); total 33332620 docs
> >> >> >> >> > Indexer: waitForMerges done (71622 msec)
> >> >> >> >> > Indexer: finished (961877 msec)
> >> >> >> >> > Baseline:
> >> >> >> >> > Indexer: indexing done (909706 msec); total 33332620 docs
> >> >> >> >> > Indexer: waitForMerges done (54775 msec)
> >> >> >> >> > Indexer: finished (964528 msec)
> >> >> >> >> >
> >> >> >> >> > For more accurate comparison I guess it's better to use
> LogxxMergePolicy and turn off CMS? If you want to run it yourself you can
> find the lines I quoted from the log file.
> >> >> >> >> >
> >> >> >> >> > Patrick
> >> >> >> >> >
> >> >> >> >> > On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien <
> thomas.dullien@elastic.co.invalid> wrote:
> >> >> >> >> >>
> >> >> >> >> >> Hey all,
> >> >> >> >> >>
> >> >> >> >> >> I've been experimenting with fixing some low-hanging
> performance fruit in the ElasticSearch codebase, and came across the fact
> that the MurmurHash implementation that is used by ByteRef.hashCode() is
> reading 4 bytes per loop iteration (which is likely an artifact from 32-bit
> architectures, which are ever-less-important). I made a small fix to change
> the implementation to read 8 bytes per loop iteration; I expected a very
> small impact (2-3% CPU or so over an indexing run in ElasticSearch), but
> got a pretty nontrivial throughput improvement over a few indexing
> benchmarks.
> >> >> >> >> >>
> >> >> >> >> >> I tried running Lucene-only benchmarks, and succeeded in
> running the example from https://github.com/mikemccand/luceneutil - but I
> couldn't figure out how to run indexing benchmarks and how to interpret the
> results.
> >> >> >> >> >>
> >> >> >> >> >> Could someone help me in running the benchmarks for the
> attached patch?
> >> >> >> >> >>
> >> >> >> >> >> Cheers,
> >> >> >> >> >> Thomas
> >> >> >> >> >>
> >> >> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >> >> >>
> >> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >> >> >>
> >> >> >>
> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >> >>
>

Re: Patch to change murmurhash implementation slightly [ In reply to ]

rcmuir at gmail

Apr 25, 2023, 7:27 AM

Post #14 of 27 (548 views)

There is literally one string, all-ascii. This won't fail if all the
shifts and masks are wrong.

About the inlining, i'm not talking about cpu stuff, i'm talking about
java. There are limits to the size of methods that get inlined (e.g.
-XX:MaxInlineSize). If we make this method enormous like this, it may
have performance consequences.

We still haven't seen any performance gain from this. Elasticsearch
putting huge unique IDs into indexed terms doesnt count.

On Tue, Apr 25, 2023 at 10:25?AM Thomas Dullien
<thomas.dullien@elastic.co> wrote:
>
> Hey,
>
> so there are unit tests in TestStringHelper.java that test strings of length greater than 8, and my change passes them. Could you explain what you want tested?
>
> Cheers,
> Thomas
>
> On Tue, Apr 25, 2023 at 4:21?PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>> sure, but "if length > 8 return 1" might pass these same tests too,
>> yet cause a ton of hash collisions.
>>
>> I just think if you want to optimize for super-long strings, there
>> should be a unit test.
>>
>> On Tue, Apr 25, 2023 at 10:20?AM Thomas Dullien
>> <thomas.dullien@elastic.co> wrote:
>> >
>> > Hey,
>> >
>> > I am pretty confident about correctness. The change passes both Lucene and ES regression tests and my careful reading of the code is pretty certain that the output is the same. If you want me to randomly test the result for a few hundred million random strings, I'm happy to do that, too, if you have other suggestions for correctness testing, let me know.
>> >
>> > The change does increase the method size and may impact inlining - but so does literally any code change, particularly in a JIT'ed environment where placement of code (and hence things like instruction cache conflicts) depend on the precise history of execution. The way I understand it, one deals with this by benchmarking and measuring.
>> >
>> > FWIW, several indexing-heavy ES benchmarks show a noticeable improvement in indexing speed - this is why I was asking about a broad range of Lucene benchmarks; to verify that this is indeed the case for Lucene-only, too.
>> >
>> > Let me know what data you'd like to see to decide whether this patch is a good idea, and if there is consensus among the Lucene committers that those are reasonable criteria, I'll work on producing that data.
>> >
>> > Cheers,
>> > Thomas
>> >
>> >
>> >
>> > On Tue, Apr 25, 2023 at 4:02?PM Robert Muir <rcmuir@gmail.com> wrote:
>> >>
>> >> well there is some cost, as it must add additional checks to see if
>> >> its longer than 8. in your patch, additional loops. it increases the
>> >> method size and may impact inlining and other things. also we can't
>> >> forget about correctness, if the hash function does the wrong thing it
>> >> could slow everything to a crawl.
>> >>
>> >> On Tue, Apr 25, 2023 at 9:56?AM Thomas Dullien
>> >> <thomas.dullien@elastic.co> wrote:
>> >> >
>> >> > Ah, I see what you mean.
>> >> >
>> >> > You are correct -- the change will not speed up a 5-byte word, but it *will* speed up all 8+-byte words, at no cost to the shorter words.
>> >> >
>> >> > On Tue, Apr 25, 2023 at 3:20?PM Robert Muir <rcmuir@gmail.com> wrote:
>> >> >>
>> >> >> if a word is of length 5, processing 8 bytes at a time isn't going to
>> >> >> speed anything up. there aren't 8 bytes to process.
>> >> >>
>> >> >> On Tue, Apr 25, 2023 at 9:17?AM Thomas Dullien
>> >> >> <thomas.dullien@elastic.co.invalid> wrote:
>> >> >> >
>> >> >> > Is average word length <= 4 realistic though? I mean, even the english wiki corpus has ~5, which would require two calls to the lucene layer instead of one; e.g. multiple layers of virtual dispatch that are unnecessary?
>> >> >> >
>> >> >> > You're not going to pay any cycles for reading 8 bytes instead of 4 bytes, so the cost of doing so will be the same - while speeding up in cases where 4 isn't quite enough?
>> >> >> >
>> >> >> > Cheers,
>> >> >> > Thomas
>> >> >> >
>> >> >> > On Tue, Apr 25, 2023 at 3:07?PM Robert Muir <rcmuir@gmail.com> wrote:
>> >> >> >>
>> >> >> >> i think from my perspective it has nothing to do with cpus being
>> >> >> >> 32-bit or 64-bit and more to do with the average length of terms in
>> >> >> >> most languages being smaller than 8. for the languages with longer
>> >> >> >> word length, its usually because of complex morphology that most users
>> >> >> >> would stem away. so doing 4 bytes at a time seems optimal IMO.
>> >> >> >> languages from nature don't care about your cpu.
>> >> >> >>
>> >> >> >> On Tue, Apr 25, 2023 at 8:52?AM Michael McCandless
>> >> >> >> <lucene@mikemccandless.com> wrote:
>> >> >> >> >
>> >> >> >> > For a truly "pure" indexing test I usually use a single thread for indexing, and SerialMergeScheduler (using that single thread to also do single-threaded merging). It makes the indexing take forever lol but it produces "comparable" results.
>> >> >> >> >
>> >> >> >> > But ... this sounds like a great change anyway? Do we really need to gate it on benchmark results? Do we think there could be a downside e.g. slower indexing on (the dwindling) 32 bit CPUs?
>> >> >> >> >
>> >> >> >> > Mike McCandless
>> >> >> >> >
>> >> >> >> > http://blog.mikemccandless.com
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > On Tue, Apr 25, 2023 at 7:39?AM Robert Muir <rcmuir@gmail.com> wrote:
>> >> >> >> >>
>> >> >> >> >> I think the results of the benchmark will depend on the properties of
>> >> >> >> >> the indexed terms. For english wikipedia (luceneutil) the average word
>> >> >> >> >> length is around 5 bytes so this optimization may not do much.
>> >> >> >> >>
>> >> >> >> >> On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <zhai7631@gmail.com> wrote:
>> >> >> >> >> >
>> >> >> >> >> > I did a quick run with your patch, but since I turned on the CMS as well as TieredMergePolicy I'm not sure how fair the comparison is. Here's the result:
>> >> >> >> >> > Candidate:
>> >> >> >> >> > Indexer: indexing done (890209 msec); total 33332620 docs
>> >> >> >> >> > Indexer: waitForMerges done (71622 msec)
>> >> >> >> >> > Indexer: finished (961877 msec)
>> >> >> >> >> > Baseline:
>> >> >> >> >> > Indexer: indexing done (909706 msec); total 33332620 docs
>> >> >> >> >> > Indexer: waitForMerges done (54775 msec)
>> >> >> >> >> > Indexer: finished (964528 msec)
>> >> >> >> >> >
>> >> >> >> >> > For more accurate comparison I guess it's better to use LogxxMergePolicy and turn off CMS? If you want to run it yourself you can find the lines I quoted from the log file.
>> >> >> >> >> >
>> >> >> >> >> > Patrick
>> >> >> >> >> >
>> >> >> >> >> > On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien <thomas.dullien@elastic.co.invalid> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> Hey all,
>> >> >> >> >> >>
>> >> >> >> >> >> I've been experimenting with fixing some low-hanging performance fruit in the ElasticSearch codebase, and came across the fact that the MurmurHash implementation that is used by ByteRef.hashCode() is reading 4 bytes per loop iteration (which is likely an artifact from 32-bit architectures, which are ever-less-important). I made a small fix to change the implementation to read 8 bytes per loop iteration; I expected a very small impact (2-3% CPU or so over an indexing run in ElasticSearch), but got a pretty nontrivial throughput improvement over a few indexing benchmarks.
>> >> >> >> >> >>
>> >> >> >> >> >> I tried running Lucene-only benchmarks, and succeeded in running the example from https://github.com/mikemccand/luceneutil - but I couldn't figure out how to run indexing benchmarks and how to interpret the results.
>> >> >> >> >> >>
>> >> >> >> >> >> Could someone help me in running the benchmarks for the attached patch?
>> >> >> >> >> >>
>> >> >> >> >> >> Cheers,
>> >> >> >> >> >> Thomas
>> >> >> >> >> >>
>> >> >> >> >> >> ---------------------------------------------------------------------
>> >> >> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> >> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >> >> >>
>> >> >> >> >> ---------------------------------------------------------------------
>> >> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >> >> >>
>> >> >> >>
>> >> >> >> ---------------------------------------------------------------------
>> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >> >>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Patch to change murmurhash implementation slightly [ In reply to ]

thomas.dullien at elastic

Apr 25, 2023, 8:08 AM

Post #15 of 27 (548 views)

Hey,

I offered to run a large number of random-string-hashes to ensure that the
output is the same pre- and post-change. I can add an arbitrary number of
such tests to TestStringHelper.java, just specify the number you wish.

If your worry is that my change breaches the inlining bytecode limit: Did
you check whether the old version was inlineable or not? The new version is
263 bytecode instructions, the old version was 110. The default inlining
limit appears to be 35 bytecode instructions on cursory checking (I may be
wrong on this, though), so I don't think it was ever inlineable in default
configs.

On your statement "we haven't seen performance gains" -- the starting point
of this thread was a friendly request to please point me to instructions
for running a broad range of Lucene indexing benchmarks, so I can gather
data for further discussion; from my perspective, we haven't even gathered
any data, so obviously we haven't seen any gains.

Cheers,
Thomas

On Tue, Apr 25, 2023 at 4:27?PM Robert Muir <rcmuir@gmail.com> wrote:

> There is literally one string, all-ascii. This won't fail if all the
> shifts and masks are wrong.
>
> About the inlining, i'm not talking about cpu stuff, i'm talking about
> java. There are limits to the size of methods that get inlined (e.g.
> -XX:MaxInlineSize). If we make this method enormous like this, it may
> have performance consequences.
>
> We still haven't seen any performance gain from this. Elasticsearch
> putting huge unique IDs into indexed terms doesnt count.
>
> On Tue, Apr 25, 2023 at 10:25?AM Thomas Dullien
> <thomas.dullien@elastic.co> wrote:
> >
> > Hey,
> >
> > so there are unit tests in TestStringHelper.java that test strings of
> length greater than 8, and my change passes them. Could you explain what
> you want tested?
> >
> > Cheers,
> > Thomas
> >
> > On Tue, Apr 25, 2023 at 4:21?PM Robert Muir <rcmuir@gmail.com> wrote:
> >>
> >> sure, but "if length > 8 return 1" might pass these same tests too,
> >> yet cause a ton of hash collisions.
> >>
> >> I just think if you want to optimize for super-long strings, there
> >> should be a unit test.
> >>
> >> On Tue, Apr 25, 2023 at 10:20?AM Thomas Dullien
> >> <thomas.dullien@elastic.co> wrote:
> >> >
> >> > Hey,
> >> >
> >> > I am pretty confident about correctness. The change passes both
> Lucene and ES regression tests and my careful reading of the code is pretty
> certain that the output is the same. If you want me to randomly test the
> result for a few hundred million random strings, I'm happy to do that, too,
> if you have other suggestions for correctness testing, let me know.
> >> >
> >> > The change does increase the method size and may impact inlining -
> but so does literally any code change, particularly in a JIT'ed environment
> where placement of code (and hence things like instruction cache conflicts)
> depend on the precise history of execution. The way I understand it, one
> deals with this by benchmarking and measuring.
> >> >
> >> > FWIW, several indexing-heavy ES benchmarks show a noticeable
> improvement in indexing speed - this is why I was asking about a broad
> range of Lucene benchmarks; to verify that this is indeed the case for
> Lucene-only, too.
> >> >
> >> > Let me know what data you'd like to see to decide whether this patch
> is a good idea, and if there is consensus among the Lucene committers that
> those are reasonable criteria, I'll work on producing that data.
> >> >
> >> > Cheers,
> >> > Thomas
> >> >
> >> >
> >> >
> >> > On Tue, Apr 25, 2023 at 4:02?PM Robert Muir <rcmuir@gmail.com> wrote:
> >> >>
> >> >> well there is some cost, as it must add additional checks to see if
> >> >> its longer than 8. in your patch, additional loops. it increases the
> >> >> method size and may impact inlining and other things. also we can't
> >> >> forget about correctness, if the hash function does the wrong thing
> it
> >> >> could slow everything to a crawl.
> >> >>
> >> >> On Tue, Apr 25, 2023 at 9:56?AM Thomas Dullien
> >> >> <thomas.dullien@elastic.co> wrote:
> >> >> >
> >> >> > Ah, I see what you mean.
> >> >> >
> >> >> > You are correct -- the change will not speed up a 5-byte word, but
> it *will* speed up all 8+-byte words, at no cost to the shorter words.
> >> >> >
> >> >> > On Tue, Apr 25, 2023 at 3:20?PM Robert Muir <rcmuir@gmail.com>
> wrote:
> >> >> >>
> >> >> >> if a word is of length 5, processing 8 bytes at a time isn't
> going to
> >> >> >> speed anything up. there aren't 8 bytes to process.
> >> >> >>
> >> >> >> On Tue, Apr 25, 2023 at 9:17?AM Thomas Dullien
> >> >> >> <thomas.dullien@elastic.co.invalid> wrote:
> >> >> >> >
> >> >> >> > Is average word length <= 4 realistic though? I mean, even the
> english wiki corpus has ~5, which would require two calls to the lucene
> layer instead of one; e.g. multiple layers of virtual dispatch that are
> unnecessary?
> >> >> >> >
> >> >> >> > You're not going to pay any cycles for reading 8 bytes instead
> of 4 bytes, so the cost of doing so will be the same - while speeding up in
> cases where 4 isn't quite enough?
> >> >> >> >
> >> >> >> > Cheers,
> >> >> >> > Thomas
> >> >> >> >
> >> >> >> > On Tue, Apr 25, 2023 at 3:07?PM Robert Muir <rcmuir@gmail.com>
> wrote:
> >> >> >> >>
> >> >> >> >> i think from my perspective it has nothing to do with cpus
> being
> >> >> >> >> 32-bit or 64-bit and more to do with the average length of
> terms in
> >> >> >> >> most languages being smaller than 8. for the languages with
> longer
> >> >> >> >> word length, its usually because of complex morphology that
> most users
> >> >> >> >> would stem away. so doing 4 bytes at a time seems optimal IMO.
> >> >> >> >> languages from nature don't care about your cpu.
> >> >> >> >>
> >> >> >> >> On Tue, Apr 25, 2023 at 8:52?AM Michael McCandless
> >> >> >> >> <lucene@mikemccandless.com> wrote:
> >> >> >> >> >
> >> >> >> >> > For a truly "pure" indexing test I usually use a single
> thread for indexing, and SerialMergeScheduler (using that single thread to
> also do single-threaded merging). It makes the indexing take forever lol
> but it produces "comparable" results.
> >> >> >> >> >
> >> >> >> >> > But ... this sounds like a great change anyway? Do we
> really need to gate it on benchmark results? Do we think there could be a
> downside e.g. slower indexing on (the dwindling) 32 bit CPUs?
> >> >> >> >> >
> >> >> >> >> > Mike McCandless
> >> >> >> >> >
> >> >> >> >> > http://blog.mikemccandless.com
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > On Tue, Apr 25, 2023 at 7:39?AM Robert Muir <
> rcmuir@gmail.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >> I think the results of the benchmark will depend on the
> properties of
> >> >> >> >> >> the indexed terms. For english wikipedia (luceneutil) the
> average word
> >> >> >> >> >> length is around 5 bytes so this optimization may not do
> much.
> >> >> >> >> >>
> >> >> >> >> >> On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <
> zhai7631@gmail.com> wrote:
> >> >> >> >> >> >
> >> >> >> >> >> > I did a quick run with your patch, but since I turned on
> the CMS as well as TieredMergePolicy I'm not sure how fair the comparison
> is. Here's the result:
> >> >> >> >> >> > Candidate:
> >> >> >> >> >> > Indexer: indexing done (890209 msec); total 33332620 docs
> >> >> >> >> >> > Indexer: waitForMerges done (71622 msec)
> >> >> >> >> >> > Indexer: finished (961877 msec)
> >> >> >> >> >> > Baseline:
> >> >> >> >> >> > Indexer: indexing done (909706 msec); total 33332620 docs
> >> >> >> >> >> > Indexer: waitForMerges done (54775 msec)
> >> >> >> >> >> > Indexer: finished (964528 msec)
> >> >> >> >> >> >
> >> >> >> >> >> > For more accurate comparison I guess it's better to use
> LogxxMergePolicy and turn off CMS? If you want to run it yourself you can
> find the lines I quoted from the log file.
> >> >> >> >> >> >
> >> >> >> >> >> > Patrick
> >> >> >> >> >> >
> >> >> >> >> >> > On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien <
> thomas.dullien@elastic.co.invalid> wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> Hey all,
> >> >> >> >> >> >>
> >> >> >> >> >> >> I've been experimenting with fixing some low-hanging
> performance fruit in the ElasticSearch codebase, and came across the fact
> that the MurmurHash implementation that is used by ByteRef.hashCode() is
> reading 4 bytes per loop iteration (which is likely an artifact from 32-bit
> architectures, which are ever-less-important). I made a small fix to change
> the implementation to read 8 bytes per loop iteration; I expected a very
> small impact (2-3% CPU or so over an indexing run in ElasticSearch), but
> got a pretty nontrivial throughput improvement over a few indexing
> benchmarks.
> >> >> >> >> >> >>
> >> >> >> >> >> >> I tried running Lucene-only benchmarks, and succeeded in
> running the example from https://github.com/mikemccand/luceneutil - but I
> couldn't figure out how to run indexing benchmarks and how to interpret the
> results.
> >> >> >> >> >> >>
> >> >> >> >> >> >> Could someone help me in running the benchmarks for the
> attached patch?
> >> >> >> >> >> >>
> >> >> >> >> >> >> Cheers,
> >> >> >> >> >> >> Thomas
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> >> >> >> To unsubscribe, e-mail:
> dev-unsubscribe@lucene.apache.org
> >> >> >> >> >> >> For additional commands, e-mail:
> dev-help@lucene.apache.org
> >> >> >> >> >>
> >> >> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >> >> >>
>

Re: Patch to change murmurhash implementation slightly [ In reply to ]

rcmuir at gmail

Apr 25, 2023, 8:15 AM

Post #16 of 27 (548 views)

i dont think we need a ton of random strings. But if you want to
optimize for strings of length 8, at a minimum there should be very
simple tests ensuring correctness for some boundary conditions (e.g.
string of length exactly 8). i would also strongly recommend testing
non-ascii since java is a language with signed integer types so it may
be susceptible to bugs where the input bytes have the "sign bit" set.

IMO this could be 2 simple unit tests.

usually at least with these kinds of algorithms you can also find
published "test vectors" that intend to seek out the corner cases. if
these exist for murmurhash, we should fold them in too.

On Tue, Apr 25, 2023 at 11:08?AM Thomas Dullien
<thomas.dullien@elastic.co> wrote:
>
> Hey,
>
> I offered to run a large number of random-string-hashes to ensure that the output is the same pre- and post-change. I can add an arbitrary number of such tests to TestStringHelper.java, just specify the number you wish.
>
> If your worry is that my change breaches the inlining bytecode limit: Did you check whether the old version was inlineable or not? The new version is 263 bytecode instructions, the old version was 110. The default inlining limit appears to be 35 bytecode instructions on cursory checking (I may be wrong on this, though), so I don't think it was ever inlineable in default configs.
>
> On your statement "we haven't seen performance gains" -- the starting point of this thread was a friendly request to please point me to instructions for running a broad range of Lucene indexing benchmarks, so I can gather data for further discussion; from my perspective, we haven't even gathered any data, so obviously we haven't seen any gains.
>
> Cheers,
> Thomas
>
> On Tue, Apr 25, 2023 at 4:27?PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>> There is literally one string, all-ascii. This won't fail if all the
>> shifts and masks are wrong.
>>
>> About the inlining, i'm not talking about cpu stuff, i'm talking about
>> java. There are limits to the size of methods that get inlined (e.g.
>> -XX:MaxInlineSize). If we make this method enormous like this, it may
>> have performance consequences.
>>
>> We still haven't seen any performance gain from this. Elasticsearch
>> putting huge unique IDs into indexed terms doesnt count.
>>
>> On Tue, Apr 25, 2023 at 10:25?AM Thomas Dullien
>> <thomas.dullien@elastic.co> wrote:
>> >
>> > Hey,
>> >
>> > so there are unit tests in TestStringHelper.java that test strings of length greater than 8, and my change passes them. Could you explain what you want tested?
>> >
>> > Cheers,
>> > Thomas
>> >
>> > On Tue, Apr 25, 2023 at 4:21?PM Robert Muir <rcmuir@gmail.com> wrote:
>> >>
>> >> sure, but "if length > 8 return 1" might pass these same tests too,
>> >> yet cause a ton of hash collisions.
>> >>
>> >> I just think if you want to optimize for super-long strings, there
>> >> should be a unit test.
>> >>
>> >> On Tue, Apr 25, 2023 at 10:20?AM Thomas Dullien
>> >> <thomas.dullien@elastic.co> wrote:
>> >> >
>> >> > Hey,
>> >> >
>> >> > I am pretty confident about correctness. The change passes both Lucene and ES regression tests and my careful reading of the code is pretty certain that the output is the same. If you want me to randomly test the result for a few hundred million random strings, I'm happy to do that, too, if you have other suggestions for correctness testing, let me know.
>> >> >
>> >> > The change does increase the method size and may impact inlining - but so does literally any code change, particularly in a JIT'ed environment where placement of code (and hence things like instruction cache conflicts) depend on the precise history of execution. The way I understand it, one deals with this by benchmarking and measuring.
>> >> >
>> >> > FWIW, several indexing-heavy ES benchmarks show a noticeable improvement in indexing speed - this is why I was asking about a broad range of Lucene benchmarks; to verify that this is indeed the case for Lucene-only, too.
>> >> >
>> >> > Let me know what data you'd like to see to decide whether this patch is a good idea, and if there is consensus among the Lucene committers that those are reasonable criteria, I'll work on producing that data.
>> >> >
>> >> > Cheers,
>> >> > Thomas
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Apr 25, 2023 at 4:02?PM Robert Muir <rcmuir@gmail.com> wrote:
>> >> >>
>> >> >> well there is some cost, as it must add additional checks to see if
>> >> >> its longer than 8. in your patch, additional loops. it increases the
>> >> >> method size and may impact inlining and other things. also we can't
>> >> >> forget about correctness, if the hash function does the wrong thing it
>> >> >> could slow everything to a crawl.
>> >> >>
>> >> >> On Tue, Apr 25, 2023 at 9:56?AM Thomas Dullien
>> >> >> <thomas.dullien@elastic.co> wrote:
>> >> >> >
>> >> >> > Ah, I see what you mean.
>> >> >> >
>> >> >> > You are correct -- the change will not speed up a 5-byte word, but it *will* speed up all 8+-byte words, at no cost to the shorter words.
>> >> >> >
>> >> >> > On Tue, Apr 25, 2023 at 3:20?PM Robert Muir <rcmuir@gmail.com> wrote:
>> >> >> >>
>> >> >> >> if a word is of length 5, processing 8 bytes at a time isn't going to
>> >> >> >> speed anything up. there aren't 8 bytes to process.
>> >> >> >>
>> >> >> >> On Tue, Apr 25, 2023 at 9:17?AM Thomas Dullien
>> >> >> >> <thomas.dullien@elastic.co.invalid> wrote:
>> >> >> >> >
>> >> >> >> > Is average word length <= 4 realistic though? I mean, even the english wiki corpus has ~5, which would require two calls to the lucene layer instead of one; e.g. multiple layers of virtual dispatch that are unnecessary?
>> >> >> >> >
>> >> >> >> > You're not going to pay any cycles for reading 8 bytes instead of 4 bytes, so the cost of doing so will be the same - while speeding up in cases where 4 isn't quite enough?
>> >> >> >> >
>> >> >> >> > Cheers,
>> >> >> >> > Thomas
>> >> >> >> >
>> >> >> >> > On Tue, Apr 25, 2023 at 3:07?PM Robert Muir <rcmuir@gmail.com> wrote:
>> >> >> >> >>
>> >> >> >> >> i think from my perspective it has nothing to do with cpus being
>> >> >> >> >> 32-bit or 64-bit and more to do with the average length of terms in
>> >> >> >> >> most languages being smaller than 8. for the languages with longer
>> >> >> >> >> word length, its usually because of complex morphology that most users
>> >> >> >> >> would stem away. so doing 4 bytes at a time seems optimal IMO.
>> >> >> >> >> languages from nature don't care about your cpu.
>> >> >> >> >>
>> >> >> >> >> On Tue, Apr 25, 2023 at 8:52?AM Michael McCandless
>> >> >> >> >> <lucene@mikemccandless.com> wrote:
>> >> >> >> >> >
>> >> >> >> >> > For a truly "pure" indexing test I usually use a single thread for indexing, and SerialMergeScheduler (using that single thread to also do single-threaded merging). It makes the indexing take forever lol but it produces "comparable" results.
>> >> >> >> >> >
>> >> >> >> >> > But ... this sounds like a great change anyway? Do we really need to gate it on benchmark results? Do we think there could be a downside e.g. slower indexing on (the dwindling) 32 bit CPUs?
>> >> >> >> >> >
>> >> >> >> >> > Mike McCandless
>> >> >> >> >> >
>> >> >> >> >> > http://blog.mikemccandless.com
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> > On Tue, Apr 25, 2023 at 7:39?AM Robert Muir <rcmuir@gmail.com> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> I think the results of the benchmark will depend on the properties of
>> >> >> >> >> >> the indexed terms. For english wikipedia (luceneutil) the average word
>> >> >> >> >> >> length is around 5 bytes so this optimization may not do much.
>> >> >> >> >> >>
>> >> >> >> >> >> On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <zhai7631@gmail.com> wrote:
>> >> >> >> >> >> >
>> >> >> >> >> >> > I did a quick run with your patch, but since I turned on the CMS as well as TieredMergePolicy I'm not sure how fair the comparison is. Here's the result:
>> >> >> >> >> >> > Candidate:
>> >> >> >> >> >> > Indexer: indexing done (890209 msec); total 33332620 docs
>> >> >> >> >> >> > Indexer: waitForMerges done (71622 msec)
>> >> >> >> >> >> > Indexer: finished (961877 msec)
>> >> >> >> >> >> > Baseline:
>> >> >> >> >> >> > Indexer: indexing done (909706 msec); total 33332620 docs
>> >> >> >> >> >> > Indexer: waitForMerges done (54775 msec)
>> >> >> >> >> >> > Indexer: finished (964528 msec)
>> >> >> >> >> >> >
>> >> >> >> >> >> > For more accurate comparison I guess it's better to use LogxxMergePolicy and turn off CMS? If you want to run it yourself you can find the lines I quoted from the log file.
>> >> >> >> >> >> >
>> >> >> >> >> >> > Patrick
>> >> >> >> >> >> >
>> >> >> >> >> >> > On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien <thomas.dullien@elastic.co.invalid> wrote:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Hey all,
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> I've been experimenting with fixing some low-hanging performance fruit in the ElasticSearch codebase, and came across the fact that the MurmurHash implementation that is used by ByteRef.hashCode() is reading 4 bytes per loop iteration (which is likely an artifact from 32-bit architectures, which are ever-less-important). I made a small fix to change the implementation to read 8 bytes per loop iteration; I expected a very small impact (2-3% CPU or so over an indexing run in ElasticSearch), but got a pretty nontrivial throughput improvement over a few indexing benchmarks.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> I tried running Lucene-only benchmarks, and succeeded in running the example from https://github.com/mikemccand/luceneutil - but I couldn't figure out how to run indexing benchmarks and how to interpret the results.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Could someone help me in running the benchmarks for the attached patch?
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Cheers,
>> >> >> >> >> >> >> Thomas
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> ---------------------------------------------------------------------
>> >> >> >> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> >> >> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >> >> >> >>
>> >> >> >> >> >> ---------------------------------------------------------------------
>> >> >> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> >> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> ---------------------------------------------------------------------
>> >> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >> >> >>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Patch to change murmurhash implementation slightly [ In reply to ]

thomas.dullien at elastic

Apr 25, 2023, 9:04 AM

Post #17 of 27 (548 views)

Hey,

ok, I've done some digging: Unfortunately, MurmurHash3 does not publish
official test vectors, see the following URLs:
https://github.com/aappleby/smhasher/issues/6
https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958
There is a link to a pastebin entry in the first issue, which leads to
https://pastebin.com/kkggV9Vx

Now, the test vectors in that pastebin do not match either the output of
pre-change Lucene's murmur3, nor the output of the Python mmh3 package.
That said, the pre-change Lucene and the mmh3 package agree, just not with
the published list.

There *are* test vectors in the source code for the mmh3 python package,
which I could use, or cook up a set of bespoke ones, or both (I share the
concern about 8-byte boundaries and signedness).
https://github.com/hajimes/mmh3/blob/3bf1e5aef777d701305c1be7ad0550e093038902/test_mmh3.py#L75

Cheers,
Thomas

On Tue, Apr 25, 2023 at 5:15?PM Robert Muir <rcmuir@gmail.com> wrote:

> i dont think we need a ton of random strings. But if you want to
> optimize for strings of length 8, at a minimum there should be very
> simple tests ensuring correctness for some boundary conditions (e.g.
> string of length exactly 8). i would also strongly recommend testing
> non-ascii since java is a language with signed integer types so it may
> be susceptible to bugs where the input bytes have the "sign bit" set.
>
> IMO this could be 2 simple unit tests.
>
> usually at least with these kinds of algorithms you can also find
> published "test vectors" that intend to seek out the corner cases. if
> these exist for murmurhash, we should fold them in too.
>
> On Tue, Apr 25, 2023 at 11:08?AM Thomas Dullien
> <thomas.dullien@elastic.co> wrote:
> >
> > Hey,
> >
> > I offered to run a large number of random-string-hashes to ensure that
> the output is the same pre- and post-change. I can add an arbitrary number
> of such tests to TestStringHelper.java, just specify the number you wish.
> >
> > If your worry is that my change breaches the inlining bytecode limit:
> Did you check whether the old version was inlineable or not? The new
> version is 263 bytecode instructions, the old version was 110. The default
> inlining limit appears to be 35 bytecode instructions on cursory checking
> (I may be wrong on this, though), so I don't think it was ever inlineable
> in default configs.
> >
> > On your statement "we haven't seen performance gains" -- the starting
> point of this thread was a friendly request to please point me to
> instructions for running a broad range of Lucene indexing benchmarks, so I
> can gather data for further discussion; from my perspective, we haven't
> even gathered any data, so obviously we haven't seen any gains.
> >
> > Cheers,
> > Thomas
> >
> > On Tue, Apr 25, 2023 at 4:27?PM Robert Muir <rcmuir@gmail.com> wrote:
> >>
> >> There is literally one string, all-ascii. This won't fail if all the
> >> shifts and masks are wrong.
> >>
> >> About the inlining, i'm not talking about cpu stuff, i'm talking about
> >> java. There are limits to the size of methods that get inlined (e.g.
> >> -XX:MaxInlineSize). If we make this method enormous like this, it may
> >> have performance consequences.
> >>
> >> We still haven't seen any performance gain from this. Elasticsearch
> >> putting huge unique IDs into indexed terms doesnt count.
> >>
> >> On Tue, Apr 25, 2023 at 10:25?AM Thomas Dullien
> >> <thomas.dullien@elastic.co> wrote:
> >> >
> >> > Hey,
> >> >
> >> > so there are unit tests in TestStringHelper.java that test strings of
> length greater than 8, and my change passes them. Could you explain what
> you want tested?
> >> >
> >> > Cheers,
> >> > Thomas
> >> >
> >> > On Tue, Apr 25, 2023 at 4:21?PM Robert Muir <rcmuir@gmail.com> wrote:
> >> >>
> >> >> sure, but "if length > 8 return 1" might pass these same tests too,
> >> >> yet cause a ton of hash collisions.
> >> >>
> >> >> I just think if you want to optimize for super-long strings, there
> >> >> should be a unit test.
> >> >>
> >> >> On Tue, Apr 25, 2023 at 10:20?AM Thomas Dullien
> >> >> <thomas.dullien@elastic.co> wrote:
> >> >> >
> >> >> > Hey,
> >> >> >
> >> >> > I am pretty confident about correctness. The change passes both
> Lucene and ES regression tests and my careful reading of the code is pretty
> certain that the output is the same. If you want me to randomly test the
> result for a few hundred million random strings, I'm happy to do that, too,
> if you have other suggestions for correctness testing, let me know.
> >> >> >
> >> >> > The change does increase the method size and may impact inlining -
> but so does literally any code change, particularly in a JIT'ed environment
> where placement of code (and hence things like instruction cache conflicts)
> depend on the precise history of execution. The way I understand it, one
> deals with this by benchmarking and measuring.
> >> >> >
> >> >> > FWIW, several indexing-heavy ES benchmarks show a noticeable
> improvement in indexing speed - this is why I was asking about a broad
> range of Lucene benchmarks; to verify that this is indeed the case for
> Lucene-only, too.
> >> >> >
> >> >> > Let me know what data you'd like to see to decide whether this
> patch is a good idea, and if there is consensus among the Lucene committers
> that those are reasonable criteria, I'll work on producing that data.
> >> >> >
> >> >> > Cheers,
> >> >> > Thomas
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Tue, Apr 25, 2023 at 4:02?PM Robert Muir <rcmuir@gmail.com>
> wrote:
> >> >> >>
> >> >> >> well there is some cost, as it must add additional checks to see
> if
> >> >> >> its longer than 8. in your patch, additional loops. it increases
> the
> >> >> >> method size and may impact inlining and other things. also we
> can't
> >> >> >> forget about correctness, if the hash function does the wrong
> thing it
> >> >> >> could slow everything to a crawl.
> >> >> >>
> >> >> >> On Tue, Apr 25, 2023 at 9:56?AM Thomas Dullien
> >> >> >> <thomas.dullien@elastic.co> wrote:
> >> >> >> >
> >> >> >> > Ah, I see what you mean.
> >> >> >> >
> >> >> >> > You are correct -- the change will not speed up a 5-byte word,
> but it *will* speed up all 8+-byte words, at no cost to the shorter words.
> >> >> >> >
> >> >> >> > On Tue, Apr 25, 2023 at 3:20?PM Robert Muir <rcmuir@gmail.com>
> wrote:
> >> >> >> >>
> >> >> >> >> if a word is of length 5, processing 8 bytes at a time isn't
> going to
> >> >> >> >> speed anything up. there aren't 8 bytes to process.
> >> >> >> >>
> >> >> >> >> On Tue, Apr 25, 2023 at 9:17?AM Thomas Dullien
> >> >> >> >> <thomas.dullien@elastic.co.invalid> wrote:
> >> >> >> >> >
> >> >> >> >> > Is average word length <= 4 realistic though? I mean, even
> the english wiki corpus has ~5, which would require two calls to the lucene
> layer instead of one; e.g. multiple layers of virtual dispatch that are
> unnecessary?
> >> >> >> >> >
> >> >> >> >> > You're not going to pay any cycles for reading 8 bytes
> instead of 4 bytes, so the cost of doing so will be the same - while
> speeding up in cases where 4 isn't quite enough?
> >> >> >> >> >
> >> >> >> >> > Cheers,
> >> >> >> >> > Thomas
> >> >> >> >> >
> >> >> >> >> > On Tue, Apr 25, 2023 at 3:07?PM Robert Muir <
> rcmuir@gmail.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >> i think from my perspective it has nothing to do with cpus
> being
> >> >> >> >> >> 32-bit or 64-bit and more to do with the average length of
> terms in
> >> >> >> >> >> most languages being smaller than 8. for the languages with
> longer
> >> >> >> >> >> word length, its usually because of complex morphology that
> most users
> >> >> >> >> >> would stem away. so doing 4 bytes at a time seems optimal
> IMO.
> >> >> >> >> >> languages from nature don't care about your cpu.
> >> >> >> >> >>
> >> >> >> >> >> On Tue, Apr 25, 2023 at 8:52?AM Michael McCandless
> >> >> >> >> >> <lucene@mikemccandless.com> wrote:
> >> >> >> >> >> >
> >> >> >> >> >> > For a truly "pure" indexing test I usually use a single
> thread for indexing, and SerialMergeScheduler (using that single thread to
> also do single-threaded merging). It makes the indexing take forever lol
> but it produces "comparable" results.
> >> >> >> >> >> >
> >> >> >> >> >> > But ... this sounds like a great change anyway? Do we
> really need to gate it on benchmark results? Do we think there could be a
> downside e.g. slower indexing on (the dwindling) 32 bit CPUs?
> >> >> >> >> >> >
> >> >> >> >> >> > Mike McCandless
> >> >> >> >> >> >
> >> >> >> >> >> > http://blog.mikemccandless.com
> >> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> > On Tue, Apr 25, 2023 at 7:39?AM Robert Muir <
> rcmuir@gmail.com> wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> I think the results of the benchmark will depend on the
> properties of
> >> >> >> >> >> >> the indexed terms. For english wikipedia (luceneutil)
> the average word
> >> >> >> >> >> >> length is around 5 bytes so this optimization may not do
> much.
> >> >> >> >> >> >>
> >> >> >> >> >> >> On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <
> zhai7631@gmail.com> wrote:
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > I did a quick run with your patch, but since I turned
> on the CMS as well as TieredMergePolicy I'm not sure how fair the
> comparison is. Here's the result:
> >> >> >> >> >> >> > Candidate:
> >> >> >> >> >> >> > Indexer: indexing done (890209 msec); total 33332620
> docs
> >> >> >> >> >> >> > Indexer: waitForMerges done (71622 msec)
> >> >> >> >> >> >> > Indexer: finished (961877 msec)
> >> >> >> >> >> >> > Baseline:
> >> >> >> >> >> >> > Indexer: indexing done (909706 msec); total 33332620
> docs
> >> >> >> >> >> >> > Indexer: waitForMerges done (54775 msec)
> >> >> >> >> >> >> > Indexer: finished (964528 msec)
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > For more accurate comparison I guess it's better to
> use LogxxMergePolicy and turn off CMS? If you want to run it yourself you
> can find the lines I quoted from the log file.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > Patrick
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien <
> thomas.dullien@elastic.co.invalid> wrote:
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Hey all,
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> I've been experimenting with fixing some low-hanging
> performance fruit in the ElasticSearch codebase, and came across the fact
> that the MurmurHash implementation that is used by ByteRef.hashCode() is
> reading 4 bytes per loop iteration (which is likely an artifact from 32-bit
> architectures, which are ever-less-important). I made a small fix to change
> the implementation to read 8 bytes per loop iteration; I expected a very
> small impact (2-3% CPU or so over an indexing run in ElasticSearch), but
> got a pretty nontrivial throughput improvement over a few indexing
> benchmarks.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> I tried running Lucene-only benchmarks, and succeeded
> in running the example from https://github.com/mikemccand/luceneutil -
> but I couldn't figure out how to run indexing benchmarks and how to
> interpret the results.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Could someone help me in running the benchmarks for
> the attached patch?
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Cheers,
> >> >> >> >> >> >> >> Thomas
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> >> >> >> >> To unsubscribe, e-mail:
> dev-unsubscribe@lucene.apache.org
> >> >> >> >> >> >> >> For additional commands, e-mail:
> dev-help@lucene.apache.org
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> >> >> >> To unsubscribe, e-mail:
> dev-unsubscribe@lucene.apache.org
> >> >> >> >> >> >> For additional commands, e-mail:
> dev-help@lucene.apache.org
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >> >> >> >>
>

Re: Patch to change murmurhash implementation slightly [ In reply to ]

thomas.dullien at elastic

Apr 25, 2023, 10:57 AM

Post #18 of 27 (548 views)

Hey all,

ok, attached is a second patch that adds some unit tests; I am happy to add
more.

This brings me back to my original question: I'd like to run some pretty
thorough benchmarking on Lucene, both for this change and for possible
other future changes, largely focused on indexing performance. What are
good command lines to do so? What are good corpora?

Cheers,
Thomas

On Tue, Apr 25, 2023 at 6:04?PM Thomas Dullien <thomas.dullien@elastic.co>
wrote:

> Hey,
>
> ok, I've done some digging: Unfortunately, MurmurHash3 does not publish
> official test vectors, see the following URLs:
> https://github.com/aappleby/smhasher/issues/6
>
> https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958
> There is a link to a pastebin entry in the first issue, which leads to
> https://pastebin.com/kkggV9Vx
>
> Now, the test vectors in that pastebin do not match either the output of
> pre-change Lucene's murmur3, nor the output of the Python mmh3 package.
> That said, the pre-change Lucene and the mmh3 package agree, just not with
> the published list.
>
> There *are* test vectors in the source code for the mmh3 python package,
> which I could use, or cook up a set of bespoke ones, or both (I share the
> concern about 8-byte boundaries and signedness).
>
> https://github.com/hajimes/mmh3/blob/3bf1e5aef777d701305c1be7ad0550e093038902/test_mmh3.py#L75
>
> Cheers,
> Thomas
>
> On Tue, Apr 25, 2023 at 5:15?PM Robert Muir <rcmuir@gmail.com> wrote:
>
>> i dont think we need a ton of random strings. But if you want to
>> optimize for strings of length 8, at a minimum there should be very
>> simple tests ensuring correctness for some boundary conditions (e.g.
>> string of length exactly 8). i would also strongly recommend testing
>> non-ascii since java is a language with signed integer types so it may
>> be susceptible to bugs where the input bytes have the "sign bit" set.
>>
>> IMO this could be 2 simple unit tests.
>>
>> usually at least with these kinds of algorithms you can also find
>> published "test vectors" that intend to seek out the corner cases. if
>> these exist for murmurhash, we should fold them in too.
>>
>> On Tue, Apr 25, 2023 at 11:08?AM Thomas Dullien
>> <thomas.dullien@elastic.co> wrote:
>> >
>> > Hey,
>> >
>> > I offered to run a large number of random-string-hashes to ensure that
>> the output is the same pre- and post-change. I can add an arbitrary number
>> of such tests to TestStringHelper.java, just specify the number you wish.
>> >
>> > If your worry is that my change breaches the inlining bytecode limit:
>> Did you check whether the old version was inlineable or not? The new
>> version is 263 bytecode instructions, the old version was 110. The default
>> inlining limit appears to be 35 bytecode instructions on cursory checking
>> (I may be wrong on this, though), so I don't think it was ever inlineable
>> in default configs.
>> >
>> > On your statement "we haven't seen performance gains" -- the starting
>> point of this thread was a friendly request to please point me to
>> instructions for running a broad range of Lucene indexing benchmarks, so I
>> can gather data for further discussion; from my perspective, we haven't
>> even gathered any data, so obviously we haven't seen any gains.
>> >
>> > Cheers,
>> > Thomas
>> >
>> > On Tue, Apr 25, 2023 at 4:27?PM Robert Muir <rcmuir@gmail.com> wrote:
>> >>
>> >> There is literally one string, all-ascii. This won't fail if all the
>> >> shifts and masks are wrong.
>> >>
>> >> About the inlining, i'm not talking about cpu stuff, i'm talking about
>> >> java. There are limits to the size of methods that get inlined (e.g.
>> >> -XX:MaxInlineSize). If we make this method enormous like this, it may
>> >> have performance consequences.
>> >>
>> >> We still haven't seen any performance gain from this. Elasticsearch
>> >> putting huge unique IDs into indexed terms doesnt count.
>> >>
>> >> On Tue, Apr 25, 2023 at 10:25?AM Thomas Dullien
>> >> <thomas.dullien@elastic.co> wrote:
>> >> >
>> >> > Hey,
>> >> >
>> >> > so there are unit tests in TestStringHelper.java that test strings
>> of length greater than 8, and my change passes them. Could you explain what
>> you want tested?
>> >> >
>> >> > Cheers,
>> >> > Thomas
>> >> >
>> >> > On Tue, Apr 25, 2023 at 4:21?PM Robert Muir <rcmuir@gmail.com>
>> wrote:
>> >> >>
>> >> >> sure, but "if length > 8 return 1" might pass these same tests too,
>> >> >> yet cause a ton of hash collisions.
>> >> >>
>> >> >> I just think if you want to optimize for super-long strings, there
>> >> >> should be a unit test.
>> >> >>
>> >> >> On Tue, Apr 25, 2023 at 10:20?AM Thomas Dullien
>> >> >> <thomas.dullien@elastic.co> wrote:
>> >> >> >
>> >> >> > Hey,
>> >> >> >
>> >> >> > I am pretty confident about correctness. The change passes both
>> Lucene and ES regression tests and my careful reading of the code is pretty
>> certain that the output is the same. If you want me to randomly test the
>> result for a few hundred million random strings, I'm happy to do that, too,
>> if you have other suggestions for correctness testing, let me know.
>> >> >> >
>> >> >> > The change does increase the method size and may impact inlining
>> - but so does literally any code change, particularly in a JIT'ed
>> environment where placement of code (and hence things like instruction
>> cache conflicts) depend on the precise history of execution. The way I
>> understand it, one deals with this by benchmarking and measuring.
>> >> >> >
>> >> >> > FWIW, several indexing-heavy ES benchmarks show a noticeable
>> improvement in indexing speed - this is why I was asking about a broad
>> range of Lucene benchmarks; to verify that this is indeed the case for
>> Lucene-only, too.
>> >> >> >
>> >> >> > Let me know what data you'd like to see to decide whether this
>> patch is a good idea, and if there is consensus among the Lucene committers
>> that those are reasonable criteria, I'll work on producing that data.
>> >> >> >
>> >> >> > Cheers,
>> >> >> > Thomas
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On Tue, Apr 25, 2023 at 4:02?PM Robert Muir <rcmuir@gmail.com>
>> wrote:
>> >> >> >>
>> >> >> >> well there is some cost, as it must add additional checks to see
>> if
>> >> >> >> its longer than 8. in your patch, additional loops. it increases
>> the
>> >> >> >> method size and may impact inlining and other things. also we
>> can't
>> >> >> >> forget about correctness, if the hash function does the wrong
>> thing it
>> >> >> >> could slow everything to a crawl.
>> >> >> >>
>> >> >> >> On Tue, Apr 25, 2023 at 9:56?AM Thomas Dullien
>> >> >> >> <thomas.dullien@elastic.co> wrote:
>> >> >> >> >
>> >> >> >> > Ah, I see what you mean.
>> >> >> >> >
>> >> >> >> > You are correct -- the change will not speed up a 5-byte word,
>> but it *will* speed up all 8+-byte words, at no cost to the shorter words.
>> >> >> >> >
>> >> >> >> > On Tue, Apr 25, 2023 at 3:20?PM Robert Muir <rcmuir@gmail.com>
>> wrote:
>> >> >> >> >>
>> >> >> >> >> if a word is of length 5, processing 8 bytes at a time isn't
>> going to
>> >> >> >> >> speed anything up. there aren't 8 bytes to process.
>> >> >> >> >>
>> >> >> >> >> On Tue, Apr 25, 2023 at 9:17?AM Thomas Dullien
>> >> >> >> >> <thomas.dullien@elastic.co.invalid> wrote:
>> >> >> >> >> >
>> >> >> >> >> > Is average word length <= 4 realistic though? I mean, even
>> the english wiki corpus has ~5, which would require two calls to the lucene
>> layer instead of one; e.g. multiple layers of virtual dispatch that are
>> unnecessary?
>> >> >> >> >> >
>> >> >> >> >> > You're not going to pay any cycles for reading 8 bytes
>> instead of 4 bytes, so the cost of doing so will be the same - while
>> speeding up in cases where 4 isn't quite enough?
>> >> >> >> >> >
>> >> >> >> >> > Cheers,
>> >> >> >> >> > Thomas
>> >> >> >> >> >
>> >> >> >> >> > On Tue, Apr 25, 2023 at 3:07?PM Robert Muir <
>> rcmuir@gmail.com> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> i think from my perspective it has nothing to do with cpus
>> being
>> >> >> >> >> >> 32-bit or 64-bit and more to do with the average length of
>> terms in
>> >> >> >> >> >> most languages being smaller than 8. for the languages
>> with longer
>> >> >> >> >> >> word length, its usually because of complex morphology
>> that most users
>> >> >> >> >> >> would stem away. so doing 4 bytes at a time seems optimal
>> IMO.
>> >> >> >> >> >> languages from nature don't care about your cpu.
>> >> >> >> >> >>
>> >> >> >> >> >> On Tue, Apr 25, 2023 at 8:52?AM Michael McCandless
>> >> >> >> >> >> <lucene@mikemccandless.com> wrote:
>> >> >> >> >> >> >
>> >> >> >> >> >> > For a truly "pure" indexing test I usually use a single
>> thread for indexing, and SerialMergeScheduler (using that single thread to
>> also do single-threaded merging). It makes the indexing take forever lol
>> but it produces "comparable" results.
>> >> >> >> >> >> >
>> >> >> >> >> >> > But ... this sounds like a great change anyway? Do we
>> really need to gate it on benchmark results? Do we think there could be a
>> downside e.g. slower indexing on (the dwindling) 32 bit CPUs?
>> >> >> >> >> >> >
>> >> >> >> >> >> > Mike McCandless
>> >> >> >> >> >> >
>> >> >> >> >> >> > http://blog.mikemccandless.com
>> >> >> >> >> >> >
>> >> >> >> >> >> >
>> >> >> >> >> >> > On Tue, Apr 25, 2023 at 7:39?AM Robert Muir <
>> rcmuir@gmail.com> wrote:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> I think the results of the benchmark will depend on the
>> properties of
>> >> >> >> >> >> >> the indexed terms. For english wikipedia (luceneutil)
>> the average word
>> >> >> >> >> >> >> length is around 5 bytes so this optimization may not
>> do much.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <
>> zhai7631@gmail.com> wrote:
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > I did a quick run with your patch, but since I turned
>> on the CMS as well as TieredMergePolicy I'm not sure how fair the
>> comparison is. Here's the result:
>> >> >> >> >> >> >> > Candidate:
>> >> >> >> >> >> >> > Indexer: indexing done (890209 msec); total 33332620
>> docs
>> >> >> >> >> >> >> > Indexer: waitForMerges done (71622 msec)
>> >> >> >> >> >> >> > Indexer: finished (961877 msec)
>> >> >> >> >> >> >> > Baseline:
>> >> >> >> >> >> >> > Indexer: indexing done (909706 msec); total 33332620
>> docs
>> >> >> >> >> >> >> > Indexer: waitForMerges done (54775 msec)
>> >> >> >> >> >> >> > Indexer: finished (964528 msec)
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > For more accurate comparison I guess it's better to
>> use LogxxMergePolicy and turn off CMS? If you want to run it yourself you
>> can find the lines I quoted from the log file.
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > Patrick
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien <
>> thomas.dullien@elastic.co.invalid> wrote:
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Hey all,
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> I've been experimenting with fixing some low-hanging
>> performance fruit in the ElasticSearch codebase, and came across the fact
>> that the MurmurHash implementation that is used by ByteRef.hashCode() is
>> reading 4 bytes per loop iteration (which is likely an artifact from 32-bit
>> architectures, which are ever-less-important). I made a small fix to change
>> the implementation to read 8 bytes per loop iteration; I expected a very
>> small impact (2-3% CPU or so over an indexing run in ElasticSearch), but
>> got a pretty nontrivial throughput improvement over a few indexing
>> benchmarks.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> I tried running Lucene-only benchmarks, and
>> succeeded in running the example from
>> https://github.com/mikemccand/luceneutil - but I couldn't figure out how
>> to run indexing benchmarks and how to interpret the results.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Could someone help me in running the benchmarks for
>> the attached patch?
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Cheers,
>> >> >> >> >> >> >> >> Thomas
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >>
>> ---------------------------------------------------------------------
>> >> >> >> >> >> >> >> To unsubscribe, e-mail:
>> dev-unsubscribe@lucene.apache.org
>> >> >> >> >> >> >> >> For additional commands, e-mail:
>> dev-help@lucene.apache.org
>> >> >> >> >> >> >>
>> >> >> >> >> >> >>
>> ---------------------------------------------------------------------
>> >> >> >> >> >> >> To unsubscribe, e-mail:
>> dev-unsubscribe@lucene.apache.org
>> >> >> >> >> >> >> For additional commands, e-mail:
>> dev-help@lucene.apache.org
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >>
>> ---------------------------------------------------------------------
>> >> >> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> >> >> >> >> For additional commands, e-mail:
>> dev-help@lucene.apache.org
>> >> >> >> >> >>
>>
>

Re: Patch to change murmurhash implementation slightly [ In reply to ]

wunder at wunderwood

Apr 25, 2023, 11:13 AM

Post #19 of 27 (548 views)

I would recommend some non-English tests. Non-Latin scripts (CJK, Arabic, Hebrew) will have longer byte strings because of UTF8. German has large compound words.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/ (my blog)

> On Apr 25, 2023, at 10:57 AM, Thomas Dullien <thomas.dullien@elastic.co.INVALID> wrote:
>
> Hey all,
>
> ok, attached is a second patch that adds some unit tests; I am happy to add more.
>
> This brings me back to my original question: I'd like to run some pretty thorough benchmarking on Lucene, both for this change and for possible other future changes, largely focused on indexing performance. What are good command lines to do so? What are good corpora?
>
> Cheers,
> Thomas
>
> On Tue, Apr 25, 2023 at 6:04?PM Thomas Dullien <thomas.dullien@elastic.co <mailto:thomas.dullien@elastic.co>> wrote:
>> Hey,
>>
>> ok, I've done some digging: Unfortunately, MurmurHash3 does not publish official test vectors, see the following URLs:
>> https://github.com/aappleby/smhasher/issues/6
>> https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958
>> There is a link to a pastebin entry in the first issue, which leads to https://pastebin.com/kkggV9Vx
>>
>> Now, the test vectors in that pastebin do not match either the output of pre-change Lucene's murmur3, nor the output of the Python mmh3 package. That said, the pre-change Lucene and the mmh3 package agree, just not with the published list.
>>
>> There *are* test vectors in the source code for the mmh3 python package, which I could use, or cook up a set of bespoke ones, or both (I share the concern about 8-byte boundaries and signedness).
>> https://github.com/hajimes/mmh3/blob/3bf1e5aef777d701305c1be7ad0550e093038902/test_mmh3.py#L75
>>
>> Cheers,
>> Thomas
>>
>> On Tue, Apr 25, 2023 at 5:15?PM Robert Muir <rcmuir@gmail.com <mailto:rcmuir@gmail.com>> wrote:
>>> i dont think we need a ton of random strings. But if you want to
>>> optimize for strings of length 8, at a minimum there should be very
>>> simple tests ensuring correctness for some boundary conditions (e.g.
>>> string of length exactly 8). i would also strongly recommend testing
>>> non-ascii since java is a language with signed integer types so it may
>>> be susceptible to bugs where the input bytes have the "sign bit" set.
>>>
>>> IMO this could be 2 simple unit tests.
>>>
>>> usually at least with these kinds of algorithms you can also find
>>> published "test vectors" that intend to seek out the corner cases. if
>>> these exist for murmurhash, we should fold them in too.
>>>
>>> On Tue, Apr 25, 2023 at 11:08?AM Thomas Dullien
>>> <thomas.dullien@elastic.co <mailto:thomas.dullien@elastic.co>> wrote:
>>> >
>>> > Hey,
>>> >
>>> > I offered to run a large number of random-string-hashes to ensure that the output is the same pre- and post-change. I can add an arbitrary number of such tests to TestStringHelper.java, just specify the number you wish.
>>> >
>>> > If your worry is that my change breaches the inlining bytecode limit: Did you check whether the old version was inlineable or not? The new version is 263 bytecode instructions, the old version was 110. The default inlining limit appears to be 35 bytecode instructions on cursory checking (I may be wrong on this, though), so I don't think it was ever inlineable in default configs.
>>> >
>>> > On your statement "we haven't seen performance gains" -- the starting point of this thread was a friendly request to please point me to instructions for running a broad range of Lucene indexing benchmarks, so I can gather data for further discussion; from my perspective, we haven't even gathered any data, so obviously we haven't seen any gains.
>>> >
>>> > Cheers,
>>> > Thomas
>>> >
>>> > On Tue, Apr 25, 2023 at 4:27?PM Robert Muir <rcmuir@gmail.com <mailto:rcmuir@gmail.com>> wrote:
>>> >>
>>> >> There is literally one string, all-ascii. This won't fail if all the
>>> >> shifts and masks are wrong.
>>> >>
>>> >> About the inlining, i'm not talking about cpu stuff, i'm talking about
>>> >> java. There are limits to the size of methods that get inlined (e.g.
>>> >> -XX:MaxInlineSize). If we make this method enormous like this, it may
>>> >> have performance consequences.
>>> >>
>>> >> We still haven't seen any performance gain from this. Elasticsearch
>>> >> putting huge unique IDs into indexed terms doesnt count.
>>> >>
>>> >> On Tue, Apr 25, 2023 at 10:25?AM Thomas Dullien
>>> >> <thomas.dullien@elastic.co <mailto:thomas.dullien@elastic.co>> wrote:
>>> >> >
>>> >> > Hey,
>>> >> >
>>> >> > so there are unit tests in TestStringHelper.java that test strings of length greater than 8, and my change passes them. Could you explain what you want tested?
>>> >> >
>>> >> > Cheers,
>>> >> > Thomas
>>> >> >
>>> >> > On Tue, Apr 25, 2023 at 4:21?PM Robert Muir <rcmuir@gmail.com <mailto:rcmuir@gmail.com>> wrote:
>>> >> >>
>>> >> >> sure, but "if length > 8 return 1" might pass these same tests too,
>>> >> >> yet cause a ton of hash collisions.
>>> >> >>
>>> >> >> I just think if you want to optimize for super-long strings, there
>>> >> >> should be a unit test.
>>> >> >>
>>> >> >> On Tue, Apr 25, 2023 at 10:20?AM Thomas Dullien
>>> >> >> <thomas.dullien@elastic.co <mailto:thomas.dullien@elastic.co>> wrote:
>>> >> >> >
>>> >> >> > Hey,
>>> >> >> >
>>> >> >> > I am pretty confident about correctness. The change passes both Lucene and ES regression tests and my careful reading of the code is pretty certain that the output is the same. If you want me to randomly test the result for a few hundred million random strings, I'm happy to do that, too, if you have other suggestions for correctness testing, let me know.
>>> >> >> >
>>> >> >> > The change does increase the method size and may impact inlining - but so does literally any code change, particularly in a JIT'ed environment where placement of code (and hence things like instruction cache conflicts) depend on the precise history of execution. The way I understand it, one deals with this by benchmarking and measuring.
>>> >> >> >
>>> >> >> > FWIW, several indexing-heavy ES benchmarks show a noticeable improvement in indexing speed - this is why I was asking about a broad range of Lucene benchmarks; to verify that this is indeed the case for Lucene-only, too.
>>> >> >> >
>>> >> >> > Let me know what data you'd like to see to decide whether this patch is a good idea, and if there is consensus among the Lucene committers that those are reasonable criteria, I'll work on producing that data.
>>> >> >> >
>>> >> >> > Cheers,
>>> >> >> > Thomas
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > On Tue, Apr 25, 2023 at 4:02?PM Robert Muir <rcmuir@gmail.com <mailto:rcmuir@gmail.com>> wrote:
>>> >> >> >>
>>> >> >> >> well there is some cost, as it must add additional checks to see if
>>> >> >> >> its longer than 8. in your patch, additional loops. it increases the
>>> >> >> >> method size and may impact inlining and other things. also we can't
>>> >> >> >> forget about correctness, if the hash function does the wrong thing it
>>> >> >> >> could slow everything to a crawl.
>>> >> >> >>
>>> >> >> >> On Tue, Apr 25, 2023 at 9:56?AM Thomas Dullien
>>> >> >> >> <thomas.dullien@elastic.co <mailto:thomas.dullien@elastic.co>> wrote:
>>> >> >> >> >
>>> >> >> >> > Ah, I see what you mean.
>>> >> >> >> >
>>> >> >> >> > You are correct -- the change will not speed up a 5-byte word, but it *will* speed up all 8+-byte words, at no cost to the shorter words.
>>> >> >> >> >
>>> >> >> >> > On Tue, Apr 25, 2023 at 3:20?PM Robert Muir <rcmuir@gmail.com <mailto:rcmuir@gmail.com>> wrote:
>>> >> >> >> >>
>>> >> >> >> >> if a word is of length 5, processing 8 bytes at a time isn't going to
>>> >> >> >> >> speed anything up. there aren't 8 bytes to process.
>>> >> >> >> >>
>>> >> >> >> >> On Tue, Apr 25, 2023 at 9:17?AM Thomas Dullien
>>> >> >> >> >> <thomas.dullien@elastic.co <mailto:thomas.dullien@elastic.co>.invalid> wrote:
>>> >> >> >> >> >
>>> >> >> >> >> > Is average word length <= 4 realistic though? I mean, even the english wiki corpus has ~5, which would require two calls to the lucene layer instead of one; e.g. multiple layers of virtual dispatch that are unnecessary?
>>> >> >> >> >> >
>>> >> >> >> >> > You're not going to pay any cycles for reading 8 bytes instead of 4 bytes, so the cost of doing so will be the same - while speeding up in cases where 4 isn't quite enough?
>>> >> >> >> >> >
>>> >> >> >> >> > Cheers,
>>> >> >> >> >> > Thomas
>>> >> >> >> >> >
>>> >> >> >> >> > On Tue, Apr 25, 2023 at 3:07?PM Robert Muir <rcmuir@gmail.com <mailto:rcmuir@gmail.com>> wrote:
>>> >> >> >> >> >>
>>> >> >> >> >> >> i think from my perspective it has nothing to do with cpus being
>>> >> >> >> >> >> 32-bit or 64-bit and more to do with the average length of terms in
>>> >> >> >> >> >> most languages being smaller than 8. for the languages with longer
>>> >> >> >> >> >> word length, its usually because of complex morphology that most users
>>> >> >> >> >> >> would stem away. so doing 4 bytes at a time seems optimal IMO.
>>> >> >> >> >> >> languages from nature don't care about your cpu.
>>> >> >> >> >> >>
>>> >> >> >> >> >> On Tue, Apr 25, 2023 at 8:52?AM Michael McCandless
>>> >> >> >> >> >> <lucene@mikemccandless.com <mailto:lucene@mikemccandless.com>> wrote:
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > For a truly "pure" indexing test I usually use a single thread for indexing, and SerialMergeScheduler (using that single thread to also do single-threaded merging). It makes the indexing take forever lol but it produces "comparable" results.
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > But ... this sounds like a great change anyway? Do we really need to gate it on benchmark results? Do we think there could be a downside e.g. slower indexing on (the dwindling) 32 bit CPUs?
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > Mike McCandless
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > http://blog.mikemccandless.com <http://blog.mikemccandless.com/>
>>> >> >> >> >> >> >
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > On Tue, Apr 25, 2023 at 7:39?AM Robert Muir <rcmuir@gmail.com <mailto:rcmuir@gmail.com>> wrote:
>>> >> >> >> >> >> >>
>>> >> >> >> >> >> >> I think the results of the benchmark will depend on the properties of
>>> >> >> >> >> >> >> the indexed terms. For english wikipedia (luceneutil) the average word
>>> >> >> >> >> >> >> length is around 5 bytes so this optimization may not do much.
>>> >> >> >> >> >> >>
>>> >> >> >> >> >> >> On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <zhai7631@gmail.com <mailto:zhai7631@gmail.com>> wrote:
>>> >> >> >> >> >> >> >
>>> >> >> >> >> >> >> > I did a quick run with your patch, but since I turned on the CMS as well as TieredMergePolicy I'm not sure how fair the comparison is. Here's the result:
>>> >> >> >> >> >> >> > Candidate:
>>> >> >> >> >> >> >> > Indexer: indexing done (890209 msec); total 33332620 docs
>>> >> >> >> >> >> >> > Indexer: waitForMerges done (71622 msec)
>>> >> >> >> >> >> >> > Indexer: finished (961877 msec)
>>> >> >> >> >> >> >> > Baseline:
>>> >> >> >> >> >> >> > Indexer: indexing done (909706 msec); total 33332620 docs
>>> >> >> >> >> >> >> > Indexer: waitForMerges done (54775 msec)
>>> >> >> >> >> >> >> > Indexer: finished (964528 msec)
>>> >> >> >> >> >> >> >
>>> >> >> >> >> >> >> > For more accurate comparison I guess it's better to use LogxxMergePolicy and turn off CMS? If you want to run it yourself you can find the lines I quoted from the log file.
>>> >> >> >> >> >> >> >
>>> >> >> >> >> >> >> > Patrick
>>> >> >> >> >> >> >> >
>>> >> >> >> >> >> >> > On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien <thomas.dullien@elastic.co <mailto:thomas.dullien@elastic.co>.invalid> wrote:
>>> >> >> >> >> >> >> >>
>>> >> >> >> >> >> >> >> Hey all,
>>> >> >> >> >> >> >> >>
>>> >> >> >> >> >> >> >> I've been experimenting with fixing some low-hanging performance fruit in the ElasticSearch codebase, and came across the fact that the MurmurHash implementation that is used by ByteRef.hashCode() is reading 4 bytes per loop iteration (which is likely an artifact from 32-bit architectures, which are ever-less-important). I made a small fix to change the implementation to read 8 bytes per loop iteration; I expected a very small impact (2-3% CPU or so over an indexing run in ElasticSearch), but got a pretty nontrivial throughput improvement over a few indexing benchmarks.
>>> >> >> >> >> >> >> >>
>>> >> >> >> >> >> >> >> I tried running Lucene-only benchmarks, and succeeded in running the example from https://github.com/mikemccand/luceneutil - but I couldn't figure out how to run indexing benchmarks and how to interpret the results.
>>> >> >> >> >> >> >> >>
>>> >> >> >> >> >> >> >> Could someone help me in running the benchmarks for the attached patch?
>>> >> >> >> >> >> >> >>
>>> >> >> >> >> >> >> >> Cheers,
>>> >> >> >> >> >> >> >> Thomas
>>> >> >> >> >> >> >> >>
>>> >> >> >> >> >> >> >> ---------------------------------------------------------------------
>>> >> >> >> >> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <mailto:dev-unsubscribe@lucene.apache.org>
>>> >> >> >> >> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org <mailto:dev-help@lucene.apache.org>
>>> >> >> >> >> >> >>
>>> >> >> >> >> >> >> ---------------------------------------------------------------------
>>> >> >> >> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <mailto:dev-unsubscribe@lucene.apache.org>
>>> >> >> >> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org <mailto:dev-help@lucene.apache.org>
>>> >> >> >> >> >> >>
>>> >> >> >> >> >>
>>> >> >> >> >> >> ---------------------------------------------------------------------
>>> >> >> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <mailto:dev-unsubscribe@lucene.apache.org>
>>> >> >> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org <mailto:dev-help@lucene.apache.org>
>>> >> >> >> >> >>
> <murmur-tests.patch>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org

Re: Patch to change murmurhash implementation slightly [ In reply to ]

thomas.dullien at elastic

Aug 24, 2023, 5:09 AM

Post #20 of 27 (467 views)

Hey there,

reviving this thread. To clarify: In order to show this patch is worth
doing, I should index a bunch of natural-language documents (whichever
language that is) and show that the patch brings a performance benefit?

(Just clarifying, because at least inside ElasticSearch for the logs
use-case, it turns out that it does provide a performance benefit -- but I
want to make sure I understand what the Lucene community wishes to see as
"evidence" this is worth pursuing :-)

Cheers,
Thomas

On Tue, Apr 25, 2023 at 8:14?PM Walter Underwood <wunder@wunderwood.org>
wrote:

> I would recommend some non-English tests. Non-Latin scripts (CJK, Arabic,
> Hebrew) will have longer byte strings because of UTF8. German has large
> compound words.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/ (my blog)
>
> On Apr 25, 2023, at 10:57 AM, Thomas Dullien
> <thomas.dullien@elastic.co.INVALID> wrote:
>
> Hey all,
>
> ok, attached is a second patch that adds some unit tests; I am happy to
> add more.
>
> This brings me back to my original question: I'd like to run some pretty
> thorough benchmarking on Lucene, both for this change and for possible
> other future changes, largely focused on indexing performance. What are
> good command lines to do so? What are good corpora?
>
> Cheers,
> Thomas
>
> On Tue, Apr 25, 2023 at 6:04?PM Thomas Dullien <thomas.dullien@elastic.co>
> wrote:
>
>> Hey,
>>
>> ok, I've done some digging: Unfortunately, MurmurHash3 does not publish
>> official test vectors, see the following URLs:
>> https://github.com/aappleby/smhasher/issues/6
>>
>> https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958
>> There is a link to a pastebin entry in the first issue, which leads to
>> https://pastebin.com/kkggV9Vx
>>
>> Now, the test vectors in that pastebin do not match either the output of
>> pre-change Lucene's murmur3, nor the output of the Python mmh3 package.
>> That said, the pre-change Lucene and the mmh3 package agree, just not with
>> the published list.
>>
>> There *are* test vectors in the source code for the mmh3 python package,
>> which I could use, or cook up a set of bespoke ones, or both (I share the
>> concern about 8-byte boundaries and signedness).
>>
>> https://github.com/hajimes/mmh3/blob/3bf1e5aef777d701305c1be7ad0550e093038902/test_mmh3.py#L75
>>
>> Cheers,
>> Thomas
>>
>> On Tue, Apr 25, 2023 at 5:15?PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> i dont think we need a ton of random strings. But if you want to
>>> optimize for strings of length 8, at a minimum there should be very
>>> simple tests ensuring correctness for some boundary conditions (e.g.
>>> string of length exactly 8). i would also strongly recommend testing
>>> non-ascii since java is a language with signed integer types so it may
>>> be susceptible to bugs where the input bytes have the "sign bit" set.
>>>
>>> IMO this could be 2 simple unit tests.
>>>
>>> usually at least with these kinds of algorithms you can also find
>>> published "test vectors" that intend to seek out the corner cases. if
>>> these exist for murmurhash, we should fold them in too.
>>>
>>> On Tue, Apr 25, 2023 at 11:08?AM Thomas Dullien
>>> <thomas.dullien@elastic.co> wrote:
>>> >
>>> > Hey,
>>> >
>>> > I offered to run a large number of random-string-hashes to ensure that
>>> the output is the same pre- and post-change. I can add an arbitrary number
>>> of such tests to TestStringHelper.java, just specify the number you wish.
>>> >
>>> > If your worry is that my change breaches the inlining bytecode limit:
>>> Did you check whether the old version was inlineable or not? The new
>>> version is 263 bytecode instructions, the old version was 110. The default
>>> inlining limit appears to be 35 bytecode instructions on cursory checking
>>> (I may be wrong on this, though), so I don't think it was ever inlineable
>>> in default configs.
>>> >
>>> > On your statement "we haven't seen performance gains" -- the starting
>>> point of this thread was a friendly request to please point me to
>>> instructions for running a broad range of Lucene indexing benchmarks, so I
>>> can gather data for further discussion; from my perspective, we haven't
>>> even gathered any data, so obviously we haven't seen any gains.
>>> >
>>> > Cheers,
>>> > Thomas
>>> >
>>> > On Tue, Apr 25, 2023 at 4:27?PM Robert Muir <rcmuir@gmail.com> wrote:
>>> >>
>>> >> There is literally one string, all-ascii. This won't fail if all the
>>> >> shifts and masks are wrong.
>>> >>
>>> >> About the inlining, i'm not talking about cpu stuff, i'm talking about
>>> >> java. There are limits to the size of methods that get inlined (e.g.
>>> >> -XX:MaxInlineSize). If we make this method enormous like this, it may
>>> >> have performance consequences.
>>> >>
>>> >> We still haven't seen any performance gain from this. Elasticsearch
>>> >> putting huge unique IDs into indexed terms doesnt count.
>>> >>
>>> >> On Tue, Apr 25, 2023 at 10:25?AM Thomas Dullien
>>> >> <thomas.dullien@elastic.co> wrote:
>>> >> >
>>> >> > Hey,
>>> >> >
>>> >> > so there are unit tests in TestStringHelper.java that test strings
>>> of length greater than 8, and my change passes them. Could you explain what
>>> you want tested?
>>> >> >
>>> >> > Cheers,
>>> >> > Thomas
>>> >> >
>>> >> > On Tue, Apr 25, 2023 at 4:21?PM Robert Muir <rcmuir@gmail.com>
>>> wrote:
>>> >> >>
>>> >> >> sure, but "if length > 8 return 1" might pass these same tests too,
>>> >> >> yet cause a ton of hash collisions.
>>> >> >>
>>> >> >> I just think if you want to optimize for super-long strings, there
>>> >> >> should be a unit test.
>>> >> >>
>>> >> >> On Tue, Apr 25, 2023 at 10:20?AM Thomas Dullien
>>> >> >> <thomas.dullien@elastic.co> wrote:
>>> >> >> >
>>> >> >> > Hey,
>>> >> >> >
>>> >> >> > I am pretty confident about correctness. The change passes both
>>> Lucene and ES regression tests and my careful reading of the code is pretty
>>> certain that the output is the same. If you want me to randomly test the
>>> result for a few hundred million random strings, I'm happy to do that, too,
>>> if you have other suggestions for correctness testing, let me know.
>>> >> >> >
>>> >> >> > The change does increase the method size and may impact inlining
>>> - but so does literally any code change, particularly in a JIT'ed
>>> environment where placement of code (and hence things like instruction
>>> cache conflicts) depend on the precise history of execution. The way I
>>> understand it, one deals with this by benchmarking and measuring.
>>> >> >> >
>>> >> >> > FWIW, several indexing-heavy ES benchmarks show a noticeable
>>> improvement in indexing speed - this is why I was asking about a broad
>>> range of Lucene benchmarks; to verify that this is indeed the case for
>>> Lucene-only, too.
>>> >> >> >
>>> >> >> > Let me know what data you'd like to see to decide whether this
>>> patch is a good idea, and if there is consensus among the Lucene committers
>>> that those are reasonable criteria, I'll work on producing that data.
>>> >> >> >
>>> >> >> > Cheers,
>>> >> >> > Thomas
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > On Tue, Apr 25, 2023 at 4:02?PM Robert Muir <rcmuir@gmail.com>
>>> wrote:
>>> >> >> >>
>>> >> >> >> well there is some cost, as it must add additional checks to
>>> see if
>>> >> >> >> its longer than 8. in your patch, additional loops. it
>>> increases the
>>> >> >> >> method size and may impact inlining and other things. also we
>>> can't
>>> >> >> >> forget about correctness, if the hash function does the wrong
>>> thing it
>>> >> >> >> could slow everything to a crawl.
>>> >> >> >>
>>> >> >> >> On Tue, Apr 25, 2023 at 9:56?AM Thomas Dullien
>>> >> >> >> <thomas.dullien@elastic.co> wrote:
>>> >> >> >> >
>>> >> >> >> > Ah, I see what you mean.
>>> >> >> >> >
>>> >> >> >> > You are correct -- the change will not speed up a 5-byte
>>> word, but it *will* speed up all 8+-byte words, at no cost to the shorter
>>> words.
>>> >> >> >> >
>>> >> >> >> > On Tue, Apr 25, 2023 at 3:20?PM Robert Muir <rcmuir@gmail.com>
>>> wrote:
>>> >> >> >> >>
>>> >> >> >> >> if a word is of length 5, processing 8 bytes at a time isn't
>>> going to
>>> >> >> >> >> speed anything up. there aren't 8 bytes to process.
>>> >> >> >> >>
>>> >> >> >> >> On Tue, Apr 25, 2023 at 9:17?AM Thomas Dullien
>>> >> >> >> >> <thomas.dullien@elastic.co.invalid> wrote:
>>> >> >> >> >> >
>>> >> >> >> >> > Is average word length <= 4 realistic though? I mean, even
>>> the english wiki corpus has ~5, which would require two calls to the lucene
>>> layer instead of one; e.g. multiple layers of virtual dispatch that are
>>> unnecessary?
>>> >> >> >> >> >
>>> >> >> >> >> > You're not going to pay any cycles for reading 8 bytes
>>> instead of 4 bytes, so the cost of doing so will be the same - while
>>> speeding up in cases where 4 isn't quite enough?
>>> >> >> >> >> >
>>> >> >> >> >> > Cheers,
>>> >> >> >> >> > Thomas
>>> >> >> >> >> >
>>> >> >> >> >> > On Tue, Apr 25, 2023 at 3:07?PM Robert Muir <
>>> rcmuir@gmail.com> wrote:
>>> >> >> >> >> >>
>>> >> >> >> >> >> i think from my perspective it has nothing to do with
>>> cpus being
>>> >> >> >> >> >> 32-bit or 64-bit and more to do with the average length
>>> of terms in
>>> >> >> >> >> >> most languages being smaller than 8. for the languages
>>> with longer
>>> >> >> >> >> >> word length, its usually because of complex morphology
>>> that most users
>>> >> >> >> >> >> would stem away. so doing 4 bytes at a time seems optimal
>>> IMO.
>>> >> >> >> >> >> languages from nature don't care about your cpu.
>>> >> >> >> >> >>
>>> >> >> >> >> >> On Tue, Apr 25, 2023 at 8:52?AM Michael McCandless
>>> >> >> >> >> >> <lucene@mikemccandless.com> wrote:
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > For a truly "pure" indexing test I usually use a single
>>> thread for indexing, and SerialMergeScheduler (using that single thread to
>>> also do single-threaded merging). It makes the indexing take forever lol
>>> but it produces "comparable" results.
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > But ... this sounds like a great change anyway? Do we
>>> really need to gate it on benchmark results? Do we think there could be a
>>> downside e.g. slower indexing on (the dwindling) 32 bit CPUs?
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > Mike McCandless
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > http://blog.mikemccandless.com
>>> >> >> >> >> >> >
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > On Tue, Apr 25, 2023 at 7:39?AM Robert Muir <
>>> rcmuir@gmail.com> wrote:
>>> >> >> >> >> >> >>
>>> >> >> >> >> >> >> I think the results of the benchmark will depend on
>>> the properties of
>>> >> >> >> >> >> >> the indexed terms. For english wikipedia (luceneutil)
>>> the average word
>>> >> >> >> >> >> >> length is around 5 bytes so this optimization may not
>>> do much.
>>> >> >> >> >> >> >>
>>> >> >> >> >> >> >> On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <
>>> zhai7631@gmail.com> wrote:
>>> >> >> >> >> >> >> >
>>> >> >> >> >> >> >> > I did a quick run with your patch, but since I
>>> turned on the CMS as well as TieredMergePolicy I'm not sure how fair the
>>> comparison is. Here's the result:
>>> >> >> >> >> >> >> > Candidate:
>>> >> >> >> >> >> >> > Indexer: indexing done (890209 msec); total 33332620
>>> docs
>>> >> >> >> >> >> >> > Indexer: waitForMerges done (71622 msec)
>>> >> >> >> >> >> >> > Indexer: finished (961877 msec)
>>> >> >> >> >> >> >> > Baseline:
>>> >> >> >> >> >> >> > Indexer: indexing done (909706 msec); total 33332620
>>> docs
>>> >> >> >> >> >> >> > Indexer: waitForMerges done (54775 msec)
>>> >> >> >> >> >> >> > Indexer: finished (964528 msec)
>>> >> >> >> >> >> >> >
>>> >> >> >> >> >> >> > For more accurate comparison I guess it's better to
>>> use LogxxMergePolicy and turn off CMS? If you want to run it yourself you
>>> can find the lines I quoted from the log file.
>>> >> >> >> >> >> >> >
>>> >> >> >> >> >> >> > Patrick
>>> >> >> >> >> >> >> >
>>> >> >> >> >> >> >> > On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien <
>>> thomas.dullien@elastic.co.invalid> wrote:
>>> >> >> >> >> >> >> >>
>>> >> >> >> >> >> >> >> Hey all,
>>> >> >> >> >> >> >> >>
>>> >> >> >> >> >> >> >> I've been experimenting with fixing some
>>> low-hanging performance fruit in the ElasticSearch codebase, and came
>>> across the fact that the MurmurHash implementation that is used by
>>> ByteRef.hashCode() is reading 4 bytes per loop iteration (which is likely
>>> an artifact from 32-bit architectures, which are ever-less-important). I
>>> made a small fix to change the implementation to read 8 bytes per loop
>>> iteration; I expected a very small impact (2-3% CPU or so over an indexing
>>> run in ElasticSearch), but got a pretty nontrivial throughput improvement
>>> over a few indexing benchmarks.
>>> >> >> >> >> >> >> >>
>>> >> >> >> >> >> >> >> I tried running Lucene-only benchmarks, and
>>> succeeded in running the example from
>>> https://github.com/mikemccand/luceneutil - but I couldn't figure out
>>> how to run indexing benchmarks and how to interpret the results.
>>> >> >> >> >> >> >> >>
>>> >> >> >> >> >> >> >> Could someone help me in running the benchmarks for
>>> the attached patch?
>>> >> >> >> >> >> >> >>
>>> >> >> >> >> >> >> >> Cheers,
>>> >> >> >> >> >> >> >> Thomas
>>> >> >> >> >> >> >> >>
>>> >> >> >> >> >> >> >>
>>> ---------------------------------------------------------------------
>>> >> >> >> >> >> >> >> To unsubscribe, e-mail:
>>> dev-unsubscribe@lucene.apache.org
>>> >> >> >> >> >> >> >> For additional commands, e-mail:
>>> dev-help@lucene.apache.org
>>> >> >> >> >> >> >>
>>> >> >> >> >> >> >>
>>> ---------------------------------------------------------------------
>>> >> >> >> >> >> >> To unsubscribe, e-mail:
>>> dev-unsubscribe@lucene.apache.org
>>> >> >> >> >> >> >> For additional commands, e-mail:
>>> dev-help@lucene.apache.org
>>> >> >> >> >> >> >>
>>> >> >> >> >> >>
>>> >> >> >> >> >>
>>> ---------------------------------------------------------------------
>>> >> >> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >> >> >> >> >> For additional commands, e-mail:
>>> dev-help@lucene.apache.org
>>> >> >> >> >> >>
>>>
>> <murmur-tests.patch>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
>

Re: Patch to change murmurhash implementation slightly [ In reply to ]

thomas.dullien at elastic

Aug 25, 2023, 6:34 AM

Post #21 of 27 (467 views)

Hey all,

another data point: There's a diagram with the relevant distributions of
word lengths in various languages here:

https://www.reddit.com/r/languagelearning/comments/h9eao2/average_word_length_of_languages_in_europe_except/

While English is close to the 8-byte limit, average word length in German
is 11+ bytes, and Mongolian and Finnish will likewise be 11+ bytes. I'll
gather some averages over the various Wikipedia indices.

Cheers,
Thomas

On Thu, Aug 24, 2023 at 2:09?PM Thomas Dullien <thomas.dullien@elastic.co>
wrote:

> Hey there,
>
> reviving this thread. To clarify: In order to show this patch is worth
> doing, I should index a bunch of natural-language documents (whichever
> language that is) and show that the patch brings a performance benefit?
>
> (Just clarifying, because at least inside ElasticSearch for the logs
> use-case, it turns out that it does provide a performance benefit -- but I
> want to make sure I understand what the Lucene community wishes to see as
> "evidence" this is worth pursuing :-)
>
> Cheers,
> Thomas
>
> On Tue, Apr 25, 2023 at 8:14?PM Walter Underwood <wunder@wunderwood.org>
> wrote:
>
>> I would recommend some non-English tests. Non-Latin scripts (CJK, Arabic,
>> Hebrew) will have longer byte strings because of UTF8. German has large
>> compound words.
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/ (my blog)
>>
>> On Apr 25, 2023, at 10:57 AM, Thomas Dullien
>> <thomas.dullien@elastic.co.INVALID> wrote:
>>
>> Hey all,
>>
>> ok, attached is a second patch that adds some unit tests; I am happy to
>> add more.
>>
>> This brings me back to my original question: I'd like to run some pretty
>> thorough benchmarking on Lucene, both for this change and for possible
>> other future changes, largely focused on indexing performance. What are
>> good command lines to do so? What are good corpora?
>>
>> Cheers,
>> Thomas
>>
>> On Tue, Apr 25, 2023 at 6:04?PM Thomas Dullien <thomas.dullien@elastic.co>
>> wrote:
>>
>>> Hey,
>>>
>>> ok, I've done some digging: Unfortunately, MurmurHash3 does not publish
>>> official test vectors, see the following URLs:
>>> https://github.com/aappleby/smhasher/issues/6
>>>
>>> https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958
>>> There is a link to a pastebin entry in the first issue, which leads to
>>> https://pastebin.com/kkggV9Vx
>>>
>>> Now, the test vectors in that pastebin do not match either the output of
>>> pre-change Lucene's murmur3, nor the output of the Python mmh3 package.
>>> That said, the pre-change Lucene and the mmh3 package agree, just not with
>>> the published list.
>>>
>>> There *are* test vectors in the source code for the mmh3 python package,
>>> which I could use, or cook up a set of bespoke ones, or both (I share the
>>> concern about 8-byte boundaries and signedness).
>>>
>>> https://github.com/hajimes/mmh3/blob/3bf1e5aef777d701305c1be7ad0550e093038902/test_mmh3.py#L75
>>>
>>> Cheers,
>>> Thomas
>>>
>>> On Tue, Apr 25, 2023 at 5:15?PM Robert Muir <rcmuir@gmail.com> wrote:
>>>
>>>> i dont think we need a ton of random strings. But if you want to
>>>> optimize for strings of length 8, at a minimum there should be very
>>>> simple tests ensuring correctness for some boundary conditions (e.g.
>>>> string of length exactly 8). i would also strongly recommend testing
>>>> non-ascii since java is a language with signed integer types so it may
>>>> be susceptible to bugs where the input bytes have the "sign bit" set.
>>>>
>>>> IMO this could be 2 simple unit tests.
>>>>
>>>> usually at least with these kinds of algorithms you can also find
>>>> published "test vectors" that intend to seek out the corner cases. if
>>>> these exist for murmurhash, we should fold them in too.
>>>>
>>>> On Tue, Apr 25, 2023 at 11:08?AM Thomas Dullien
>>>> <thomas.dullien@elastic.co> wrote:
>>>> >
>>>> > Hey,
>>>> >
>>>> > I offered to run a large number of random-string-hashes to ensure
>>>> that the output is the same pre- and post-change. I can add an arbitrary
>>>> number of such tests to TestStringHelper.java, just specify the number you
>>>> wish.
>>>> >
>>>> > If your worry is that my change breaches the inlining bytecode limit:
>>>> Did you check whether the old version was inlineable or not? The new
>>>> version is 263 bytecode instructions, the old version was 110. The default
>>>> inlining limit appears to be 35 bytecode instructions on cursory checking
>>>> (I may be wrong on this, though), so I don't think it was ever inlineable
>>>> in default configs.
>>>> >
>>>> > On your statement "we haven't seen performance gains" -- the starting
>>>> point of this thread was a friendly request to please point me to
>>>> instructions for running a broad range of Lucene indexing benchmarks, so I
>>>> can gather data for further discussion; from my perspective, we haven't
>>>> even gathered any data, so obviously we haven't seen any gains.
>>>> >
>>>> > Cheers,
>>>> > Thomas
>>>> >
>>>> > On Tue, Apr 25, 2023 at 4:27?PM Robert Muir <rcmuir@gmail.com> wrote:
>>>> >>
>>>> >> There is literally one string, all-ascii. This won't fail if all the
>>>> >> shifts and masks are wrong.
>>>> >>
>>>> >> About the inlining, i'm not talking about cpu stuff, i'm talking
>>>> about
>>>> >> java. There are limits to the size of methods that get inlined (e.g.
>>>> >> -XX:MaxInlineSize). If we make this method enormous like this, it may
>>>> >> have performance consequences.
>>>> >>
>>>> >> We still haven't seen any performance gain from this. Elasticsearch
>>>> >> putting huge unique IDs into indexed terms doesnt count.
>>>> >>
>>>> >> On Tue, Apr 25, 2023 at 10:25?AM Thomas Dullien
>>>> >> <thomas.dullien@elastic.co> wrote:
>>>> >> >
>>>> >> > Hey,
>>>> >> >
>>>> >> > so there are unit tests in TestStringHelper.java that test strings
>>>> of length greater than 8, and my change passes them. Could you explain what
>>>> you want tested?
>>>> >> >
>>>> >> > Cheers,
>>>> >> > Thomas
>>>> >> >
>>>> >> > On Tue, Apr 25, 2023 at 4:21?PM Robert Muir <rcmuir@gmail.com>
>>>> wrote:
>>>> >> >>
>>>> >> >> sure, but "if length > 8 return 1" might pass these same tests
>>>> too,
>>>> >> >> yet cause a ton of hash collisions.
>>>> >> >>
>>>> >> >> I just think if you want to optimize for super-long strings, there
>>>> >> >> should be a unit test.
>>>> >> >>
>>>> >> >> On Tue, Apr 25, 2023 at 10:20?AM Thomas Dullien
>>>> >> >> <thomas.dullien@elastic.co> wrote:
>>>> >> >> >
>>>> >> >> > Hey,
>>>> >> >> >
>>>> >> >> > I am pretty confident about correctness. The change passes both
>>>> Lucene and ES regression tests and my careful reading of the code is pretty
>>>> certain that the output is the same. If you want me to randomly test the
>>>> result for a few hundred million random strings, I'm happy to do that, too,
>>>> if you have other suggestions for correctness testing, let me know.
>>>> >> >> >
>>>> >> >> > The change does increase the method size and may impact
>>>> inlining - but so does literally any code change, particularly in a JIT'ed
>>>> environment where placement of code (and hence things like instruction
>>>> cache conflicts) depend on the precise history of execution. The way I
>>>> understand it, one deals with this by benchmarking and measuring.
>>>> >> >> >
>>>> >> >> > FWIW, several indexing-heavy ES benchmarks show a noticeable
>>>> improvement in indexing speed - this is why I was asking about a broad
>>>> range of Lucene benchmarks; to verify that this is indeed the case for
>>>> Lucene-only, too.
>>>> >> >> >
>>>> >> >> > Let me know what data you'd like to see to decide whether this
>>>> patch is a good idea, and if there is consensus among the Lucene committers
>>>> that those are reasonable criteria, I'll work on producing that data.
>>>> >> >> >
>>>> >> >> > Cheers,
>>>> >> >> > Thomas
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > On Tue, Apr 25, 2023 at 4:02?PM Robert Muir <rcmuir@gmail.com>
>>>> wrote:
>>>> >> >> >>
>>>> >> >> >> well there is some cost, as it must add additional checks to
>>>> see if
>>>> >> >> >> its longer than 8. in your patch, additional loops. it
>>>> increases the
>>>> >> >> >> method size and may impact inlining and other things. also we
>>>> can't
>>>> >> >> >> forget about correctness, if the hash function does the wrong
>>>> thing it
>>>> >> >> >> could slow everything to a crawl.
>>>> >> >> >>
>>>> >> >> >> On Tue, Apr 25, 2023 at 9:56?AM Thomas Dullien
>>>> >> >> >> <thomas.dullien@elastic.co> wrote:
>>>> >> >> >> >
>>>> >> >> >> > Ah, I see what you mean.
>>>> >> >> >> >
>>>> >> >> >> > You are correct -- the change will not speed up a 5-byte
>>>> word, but it *will* speed up all 8+-byte words, at no cost to the shorter
>>>> words.
>>>> >> >> >> >
>>>> >> >> >> > On Tue, Apr 25, 2023 at 3:20?PM Robert Muir <
>>>> rcmuir@gmail.com> wrote:
>>>> >> >> >> >>
>>>> >> >> >> >> if a word is of length 5, processing 8 bytes at a time
>>>> isn't going to
>>>> >> >> >> >> speed anything up. there aren't 8 bytes to process.
>>>> >> >> >> >>
>>>> >> >> >> >> On Tue, Apr 25, 2023 at 9:17?AM Thomas Dullien
>>>> >> >> >> >> <thomas.dullien@elastic.co.invalid> wrote:
>>>> >> >> >> >> >
>>>> >> >> >> >> > Is average word length <= 4 realistic though? I mean,
>>>> even the english wiki corpus has ~5, which would require two calls to the
>>>> lucene layer instead of one; e.g. multiple layers of virtual dispatch that
>>>> are unnecessary?
>>>> >> >> >> >> >
>>>> >> >> >> >> > You're not going to pay any cycles for reading 8 bytes
>>>> instead of 4 bytes, so the cost of doing so will be the same - while
>>>> speeding up in cases where 4 isn't quite enough?
>>>> >> >> >> >> >
>>>> >> >> >> >> > Cheers,
>>>> >> >> >> >> > Thomas
>>>> >> >> >> >> >
>>>> >> >> >> >> > On Tue, Apr 25, 2023 at 3:07?PM Robert Muir <
>>>> rcmuir@gmail.com> wrote:
>>>> >> >> >> >> >>
>>>> >> >> >> >> >> i think from my perspective it has nothing to do with
>>>> cpus being
>>>> >> >> >> >> >> 32-bit or 64-bit and more to do with the average length
>>>> of terms in
>>>> >> >> >> >> >> most languages being smaller than 8. for the languages
>>>> with longer
>>>> >> >> >> >> >> word length, its usually because of complex morphology
>>>> that most users
>>>> >> >> >> >> >> would stem away. so doing 4 bytes at a time seems
>>>> optimal IMO.
>>>> >> >> >> >> >> languages from nature don't care about your cpu.
>>>> >> >> >> >> >>
>>>> >> >> >> >> >> On Tue, Apr 25, 2023 at 8:52?AM Michael McCandless
>>>> >> >> >> >> >> <lucene@mikemccandless.com> wrote:
>>>> >> >> >> >> >> >
>>>> >> >> >> >> >> > For a truly "pure" indexing test I usually use a
>>>> single thread for indexing, and SerialMergeScheduler (using that single
>>>> thread to also do single-threaded merging). It makes the indexing take
>>>> forever lol but it produces "comparable" results.
>>>> >> >> >> >> >> >
>>>> >> >> >> >> >> > But ... this sounds like a great change anyway? Do we
>>>> really need to gate it on benchmark results? Do we think there could be a
>>>> downside e.g. slower indexing on (the dwindling) 32 bit CPUs?
>>>> >> >> >> >> >> >
>>>> >> >> >> >> >> > Mike McCandless
>>>> >> >> >> >> >> >
>>>> >> >> >> >> >> > http://blog.mikemccandless.com
>>>> >> >> >> >> >> >
>>>> >> >> >> >> >> >
>>>> >> >> >> >> >> > On Tue, Apr 25, 2023 at 7:39?AM Robert Muir <
>>>> rcmuir@gmail.com> wrote:
>>>> >> >> >> >> >> >>
>>>> >> >> >> >> >> >> I think the results of the benchmark will depend on
>>>> the properties of
>>>> >> >> >> >> >> >> the indexed terms. For english wikipedia (luceneutil)
>>>> the average word
>>>> >> >> >> >> >> >> length is around 5 bytes so this optimization may not
>>>> do much.
>>>> >> >> >> >> >> >>
>>>> >> >> >> >> >> >> On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <
>>>> zhai7631@gmail.com> wrote:
>>>> >> >> >> >> >> >> >
>>>> >> >> >> >> >> >> > I did a quick run with your patch, but since I
>>>> turned on the CMS as well as TieredMergePolicy I'm not sure how fair the
>>>> comparison is. Here's the result:
>>>> >> >> >> >> >> >> > Candidate:
>>>> >> >> >> >> >> >> > Indexer: indexing done (890209 msec); total
>>>> 33332620 docs
>>>> >> >> >> >> >> >> > Indexer: waitForMerges done (71622 msec)
>>>> >> >> >> >> >> >> > Indexer: finished (961877 msec)
>>>> >> >> >> >> >> >> > Baseline:
>>>> >> >> >> >> >> >> > Indexer: indexing done (909706 msec); total
>>>> 33332620 docs
>>>> >> >> >> >> >> >> > Indexer: waitForMerges done (54775 msec)
>>>> >> >> >> >> >> >> > Indexer: finished (964528 msec)
>>>> >> >> >> >> >> >> >
>>>> >> >> >> >> >> >> > For more accurate comparison I guess it's better to
>>>> use LogxxMergePolicy and turn off CMS? If you want to run it yourself you
>>>> can find the lines I quoted from the log file.
>>>> >> >> >> >> >> >> >
>>>> >> >> >> >> >> >> > Patrick
>>>> >> >> >> >> >> >> >
>>>> >> >> >> >> >> >> > On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien <
>>>> thomas.dullien@elastic.co.invalid> wrote:
>>>> >> >> >> >> >> >> >>
>>>> >> >> >> >> >> >> >> Hey all,
>>>> >> >> >> >> >> >> >>
>>>> >> >> >> >> >> >> >> I've been experimenting with fixing some
>>>> low-hanging performance fruit in the ElasticSearch codebase, and came
>>>> across the fact that the MurmurHash implementation that is used by
>>>> ByteRef.hashCode() is reading 4 bytes per loop iteration (which is likely
>>>> an artifact from 32-bit architectures, which are ever-less-important). I
>>>> made a small fix to change the implementation to read 8 bytes per loop
>>>> iteration; I expected a very small impact (2-3% CPU or so over an indexing
>>>> run in ElasticSearch), but got a pretty nontrivial throughput improvement
>>>> over a few indexing benchmarks.
>>>> >> >> >> >> >> >> >>
>>>> >> >> >> >> >> >> >> I tried running Lucene-only benchmarks, and
>>>> succeeded in running the example from
>>>> https://github.com/mikemccand/luceneutil - but I couldn't figure out
>>>> how to run indexing benchmarks and how to interpret the results.
>>>> >> >> >> >> >> >> >>
>>>> >> >> >> >> >> >> >> Could someone help me in running the benchmarks
>>>> for the attached patch?
>>>> >> >> >> >> >> >> >>
>>>> >> >> >> >> >> >> >> Cheers,
>>>> >> >> >> >> >> >> >> Thomas
>>>> >> >> >> >> >> >> >>
>>>> >> >> >> >> >> >> >>
>>>> ---------------------------------------------------------------------
>>>> >> >> >> >> >> >> >> To unsubscribe, e-mail:
>>>> dev-unsubscribe@lucene.apache.org
>>>> >> >> >> >> >> >> >> For additional commands, e-mail:
>>>> dev-help@lucene.apache.org
>>>> >> >> >> >> >> >>
>>>> >> >> >> >> >> >>
>>>> ---------------------------------------------------------------------
>>>> >> >> >> >> >> >> To unsubscribe, e-mail:
>>>> dev-unsubscribe@lucene.apache.org
>>>> >> >> >> >> >> >> For additional commands, e-mail:
>>>> dev-help@lucene.apache.org
>>>> >> >> >> >> >> >>
>>>> >> >> >> >> >>
>>>> >> >> >> >> >>
>>>> ---------------------------------------------------------------------
>>>> >> >> >> >> >> To unsubscribe, e-mail:
>>>> dev-unsubscribe@lucene.apache.org
>>>> >> >> >> >> >> For additional commands, e-mail:
>>>> dev-help@lucene.apache.org
>>>> >> >> >> >> >>
>>>>
>>> <murmur-tests.patch>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>>

Re: Patch to change murmurhash implementation slightly [ In reply to ]

rcmuir at gmail

Aug 25, 2023, 6:37 AM

Post #22 of 27 (467 views)

chart is wrong, average word length for english is like 5.

On Fri, Aug 25, 2023 at 9:35?AM Thomas Dullien
<thomas.dullien@elastic.co.invalid> wrote:
>
> Hey all,
>
> another data point: There's a diagram with the relevant distributions of word lengths in various languages here:
>
> https://www.reddit.com/r/languagelearning/comments/h9eao2/average_word_length_of_languages_in_europe_except/
>
> While English is close to the 8-byte limit, average word length in German is 11+ bytes, and Mongolian and Finnish will likewise be 11+ bytes. I'll gather some averages over the various Wikipedia indices.
>
> Cheers,
> Thomas
>
> On Thu, Aug 24, 2023 at 2:09?PM Thomas Dullien <thomas.dullien@elastic.co> wrote:
>>
>> Hey there,
>>
>> reviving this thread. To clarify: In order to show this patch is worth doing, I should index a bunch of natural-language documents (whichever language that is) and show that the patch brings a performance benefit?
>>
>> (Just clarifying, because at least inside ElasticSearch for the logs use-case, it turns out that it does provide a performance benefit -- but I want to make sure I understand what the Lucene community wishes to see as "evidence" this is worth pursuing :-)
>>
>> Cheers,
>> Thomas
>>
>> On Tue, Apr 25, 2023 at 8:14?PM Walter Underwood <wunder@wunderwood.org> wrote:
>>>
>>> I would recommend some non-English tests. Non-Latin scripts (CJK, Arabic, Hebrew) will have longer byte strings because of UTF8. German has large compound words.
>>>
>>> wunder
>>> Walter Underwood
>>> wunder@wunderwood.org
>>> http://observer.wunderwood.org/ (my blog)
>>>
>>> On Apr 25, 2023, at 10:57 AM, Thomas Dullien <thomas.dullien@elastic.co.INVALID> wrote:
>>>
>>> Hey all,
>>>
>>> ok, attached is a second patch that adds some unit tests; I am happy to add more.
>>>
>>> This brings me back to my original question: I'd like to run some pretty thorough benchmarking on Lucene, both for this change and for possible other future changes, largely focused on indexing performance. What are good command lines to do so? What are good corpora?
>>>
>>> Cheers,
>>> Thomas
>>>
>>> On Tue, Apr 25, 2023 at 6:04?PM Thomas Dullien <thomas.dullien@elastic.co> wrote:
>>>>
>>>> Hey,
>>>>
>>>> ok, I've done some digging: Unfortunately, MurmurHash3 does not publish official test vectors, see the following URLs:
>>>> https://github.com/aappleby/smhasher/issues/6
>>>> https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958
>>>> There is a link to a pastebin entry in the first issue, which leads to https://pastebin.com/kkggV9Vx
>>>>
>>>> Now, the test vectors in that pastebin do not match either the output of pre-change Lucene's murmur3, nor the output of the Python mmh3 package. That said, the pre-change Lucene and the mmh3 package agree, just not with the published list.
>>>>
>>>> There *are* test vectors in the source code for the mmh3 python package, which I could use, or cook up a set of bespoke ones, or both (I share the concern about 8-byte boundaries and signedness).
>>>> https://github.com/hajimes/mmh3/blob/3bf1e5aef777d701305c1be7ad0550e093038902/test_mmh3.py#L75
>>>>
>>>> Cheers,
>>>> Thomas
>>>>
>>>> On Tue, Apr 25, 2023 at 5:15?PM Robert Muir <rcmuir@gmail.com> wrote:
>>>>>
>>>>> i dont think we need a ton of random strings. But if you want to
>>>>> optimize for strings of length 8, at a minimum there should be very
>>>>> simple tests ensuring correctness for some boundary conditions (e.g.
>>>>> string of length exactly 8). i would also strongly recommend testing
>>>>> non-ascii since java is a language with signed integer types so it may
>>>>> be susceptible to bugs where the input bytes have the "sign bit" set.
>>>>>
>>>>> IMO this could be 2 simple unit tests.
>>>>>
>>>>> usually at least with these kinds of algorithms you can also find
>>>>> published "test vectors" that intend to seek out the corner cases. if
>>>>> these exist for murmurhash, we should fold them in too.
>>>>>
>>>>> On Tue, Apr 25, 2023 at 11:08?AM Thomas Dullien
>>>>> <thomas.dullien@elastic.co> wrote:
>>>>> >
>>>>> > Hey,
>>>>> >
>>>>> > I offered to run a large number of random-string-hashes to ensure that the output is the same pre- and post-change. I can add an arbitrary number of such tests to TestStringHelper.java, just specify the number you wish.
>>>>> >
>>>>> > If your worry is that my change breaches the inlining bytecode limit: Did you check whether the old version was inlineable or not? The new version is 263 bytecode instructions, the old version was 110. The default inlining limit appears to be 35 bytecode instructions on cursory checking (I may be wrong on this, though), so I don't think it was ever inlineable in default configs.
>>>>> >
>>>>> > On your statement "we haven't seen performance gains" -- the starting point of this thread was a friendly request to please point me to instructions for running a broad range of Lucene indexing benchmarks, so I can gather data for further discussion; from my perspective, we haven't even gathered any data, so obviously we haven't seen any gains.
>>>>> >
>>>>> > Cheers,
>>>>> > Thomas
>>>>> >
>>>>> > On Tue, Apr 25, 2023 at 4:27?PM Robert Muir <rcmuir@gmail.com> wrote:
>>>>> >>
>>>>> >> There is literally one string, all-ascii. This won't fail if all the
>>>>> >> shifts and masks are wrong.
>>>>> >>
>>>>> >> About the inlining, i'm not talking about cpu stuff, i'm talking about
>>>>> >> java. There are limits to the size of methods that get inlined (e.g.
>>>>> >> -XX:MaxInlineSize). If we make this method enormous like this, it may
>>>>> >> have performance consequences.
>>>>> >>
>>>>> >> We still haven't seen any performance gain from this. Elasticsearch
>>>>> >> putting huge unique IDs into indexed terms doesnt count.
>>>>> >>
>>>>> >> On Tue, Apr 25, 2023 at 10:25?AM Thomas Dullien
>>>>> >> <thomas.dullien@elastic.co> wrote:
>>>>> >> >
>>>>> >> > Hey,
>>>>> >> >
>>>>> >> > so there are unit tests in TestStringHelper.java that test strings of length greater than 8, and my change passes them. Could you explain what you want tested?
>>>>> >> >
>>>>> >> > Cheers,
>>>>> >> > Thomas
>>>>> >> >
>>>>> >> > On Tue, Apr 25, 2023 at 4:21?PM Robert Muir <rcmuir@gmail.com> wrote:
>>>>> >> >>
>>>>> >> >> sure, but "if length > 8 return 1" might pass these same tests too,
>>>>> >> >> yet cause a ton of hash collisions.
>>>>> >> >>
>>>>> >> >> I just think if you want to optimize for super-long strings, there
>>>>> >> >> should be a unit test.
>>>>> >> >>
>>>>> >> >> On Tue, Apr 25, 2023 at 10:20?AM Thomas Dullien
>>>>> >> >> <thomas.dullien@elastic.co> wrote:
>>>>> >> >> >
>>>>> >> >> > Hey,
>>>>> >> >> >
>>>>> >> >> > I am pretty confident about correctness. The change passes both Lucene and ES regression tests and my careful reading of the code is pretty certain that the output is the same. If you want me to randomly test the result for a few hundred million random strings, I'm happy to do that, too, if you have other suggestions for correctness testing, let me know.
>>>>> >> >> >
>>>>> >> >> > The change does increase the method size and may impact inlining - but so does literally any code change, particularly in a JIT'ed environment where placement of code (and hence things like instruction cache conflicts) depend on the precise history of execution. The way I understand it, one deals with this by benchmarking and measuring.
>>>>> >> >> >
>>>>> >> >> > FWIW, several indexing-heavy ES benchmarks show a noticeable improvement in indexing speed - this is why I was asking about a broad range of Lucene benchmarks; to verify that this is indeed the case for Lucene-only, too.
>>>>> >> >> >
>>>>> >> >> > Let me know what data you'd like to see to decide whether this patch is a good idea, and if there is consensus among the Lucene committers that those are reasonable criteria, I'll work on producing that data.
>>>>> >> >> >
>>>>> >> >> > Cheers,
>>>>> >> >> > Thomas
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> > On Tue, Apr 25, 2023 at 4:02?PM Robert Muir <rcmuir@gmail.com> wrote:
>>>>> >> >> >>
>>>>> >> >> >> well there is some cost, as it must add additional checks to see if
>>>>> >> >> >> its longer than 8. in your patch, additional loops. it increases the
>>>>> >> >> >> method size and may impact inlining and other things. also we can't
>>>>> >> >> >> forget about correctness, if the hash function does the wrong thing it
>>>>> >> >> >> could slow everything to a crawl.
>>>>> >> >> >>
>>>>> >> >> >> On Tue, Apr 25, 2023 at 9:56?AM Thomas Dullien
>>>>> >> >> >> <thomas.dullien@elastic.co> wrote:
>>>>> >> >> >> >
>>>>> >> >> >> > Ah, I see what you mean.
>>>>> >> >> >> >
>>>>> >> >> >> > You are correct -- the change will not speed up a 5-byte word, but it *will* speed up all 8+-byte words, at no cost to the shorter words.
>>>>> >> >> >> >
>>>>> >> >> >> > On Tue, Apr 25, 2023 at 3:20?PM Robert Muir <rcmuir@gmail.com> wrote:
>>>>> >> >> >> >>
>>>>> >> >> >> >> if a word is of length 5, processing 8 bytes at a time isn't going to
>>>>> >> >> >> >> speed anything up. there aren't 8 bytes to process.
>>>>> >> >> >> >>
>>>>> >> >> >> >> On Tue, Apr 25, 2023 at 9:17?AM Thomas Dullien
>>>>> >> >> >> >> <thomas.dullien@elastic.co.invalid> wrote:
>>>>> >> >> >> >> >
>>>>> >> >> >> >> > Is average word length <= 4 realistic though? I mean, even the english wiki corpus has ~5, which would require two calls to the lucene layer instead of one; e.g. multiple layers of virtual dispatch that are unnecessary?
>>>>> >> >> >> >> >
>>>>> >> >> >> >> > You're not going to pay any cycles for reading 8 bytes instead of 4 bytes, so the cost of doing so will be the same - while speeding up in cases where 4 isn't quite enough?
>>>>> >> >> >> >> >
>>>>> >> >> >> >> > Cheers,
>>>>> >> >> >> >> > Thomas
>>>>> >> >> >> >> >
>>>>> >> >> >> >> > On Tue, Apr 25, 2023 at 3:07?PM Robert Muir <rcmuir@gmail.com> wrote:
>>>>> >> >> >> >> >>
>>>>> >> >> >> >> >> i think from my perspective it has nothing to do with cpus being
>>>>> >> >> >> >> >> 32-bit or 64-bit and more to do with the average length of terms in
>>>>> >> >> >> >> >> most languages being smaller than 8. for the languages with longer
>>>>> >> >> >> >> >> word length, its usually because of complex morphology that most users
>>>>> >> >> >> >> >> would stem away. so doing 4 bytes at a time seems optimal IMO.
>>>>> >> >> >> >> >> languages from nature don't care about your cpu.
>>>>> >> >> >> >> >>
>>>>> >> >> >> >> >> On Tue, Apr 25, 2023 at 8:52?AM Michael McCandless
>>>>> >> >> >> >> >> <lucene@mikemccandless.com> wrote:
>>>>> >> >> >> >> >> >
>>>>> >> >> >> >> >> > For a truly "pure" indexing test I usually use a single thread for indexing, and SerialMergeScheduler (using that single thread to also do single-threaded merging). It makes the indexing take forever lol but it produces "comparable" results.
>>>>> >> >> >> >> >> >
>>>>> >> >> >> >> >> > But ... this sounds like a great change anyway? Do we really need to gate it on benchmark results? Do we think there could be a downside e.g. slower indexing on (the dwindling) 32 bit CPUs?
>>>>> >> >> >> >> >> >
>>>>> >> >> >> >> >> > Mike McCandless
>>>>> >> >> >> >> >> >
>>>>> >> >> >> >> >> > http://blog.mikemccandless.com
>>>>> >> >> >> >> >> >
>>>>> >> >> >> >> >> >
>>>>> >> >> >> >> >> > On Tue, Apr 25, 2023 at 7:39?AM Robert Muir <rcmuir@gmail.com> wrote:
>>>>> >> >> >> >> >> >>
>>>>> >> >> >> >> >> >> I think the results of the benchmark will depend on the properties of
>>>>> >> >> >> >> >> >> the indexed terms. For english wikipedia (luceneutil) the average word
>>>>> >> >> >> >> >> >> length is around 5 bytes so this optimization may not do much.
>>>>> >> >> >> >> >> >>
>>>>> >> >> >> >> >> >> On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <zhai7631@gmail.com> wrote:
>>>>> >> >> >> >> >> >> >
>>>>> >> >> >> >> >> >> > I did a quick run with your patch, but since I turned on the CMS as well as TieredMergePolicy I'm not sure how fair the comparison is. Here's the result:
>>>>> >> >> >> >> >> >> > Candidate:
>>>>> >> >> >> >> >> >> > Indexer: indexing done (890209 msec); total 33332620 docs
>>>>> >> >> >> >> >> >> > Indexer: waitForMerges done (71622 msec)
>>>>> >> >> >> >> >> >> > Indexer: finished (961877 msec)
>>>>> >> >> >> >> >> >> > Baseline:
>>>>> >> >> >> >> >> >> > Indexer: indexing done (909706 msec); total 33332620 docs
>>>>> >> >> >> >> >> >> > Indexer: waitForMerges done (54775 msec)
>>>>> >> >> >> >> >> >> > Indexer: finished (964528 msec)
>>>>> >> >> >> >> >> >> >
>>>>> >> >> >> >> >> >> > For more accurate comparison I guess it's better to use LogxxMergePolicy and turn off CMS? If you want to run it yourself you can find the lines I quoted from the log file.
>>>>> >> >> >> >> >> >> >
>>>>> >> >> >> >> >> >> > Patrick
>>>>> >> >> >> >> >> >> >
>>>>> >> >> >> >> >> >> > On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien <thomas.dullien@elastic.co.invalid> wrote:
>>>>> >> >> >> >> >> >> >>
>>>>> >> >> >> >> >> >> >> Hey all,
>>>>> >> >> >> >> >> >> >>
>>>>> >> >> >> >> >> >> >> I've been experimenting with fixing some low-hanging performance fruit in the ElasticSearch codebase, and came across the fact that the MurmurHash implementation that is used by ByteRef.hashCode() is reading 4 bytes per loop iteration (which is likely an artifact from 32-bit architectures, which are ever-less-important). I made a small fix to change the implementation to read 8 bytes per loop iteration; I expected a very small impact (2-3% CPU or so over an indexing run in ElasticSearch), but got a pretty nontrivial throughput improvement over a few indexing benchmarks.
>>>>> >> >> >> >> >> >> >>
>>>>> >> >> >> >> >> >> >> I tried running Lucene-only benchmarks, and succeeded in running the example from https://github.com/mikemccand/luceneutil - but I couldn't figure out how to run indexing benchmarks and how to interpret the results.
>>>>> >> >> >> >> >> >> >>
>>>>> >> >> >> >> >> >> >> Could someone help me in running the benchmarks for the attached patch?
>>>>> >> >> >> >> >> >> >>
>>>>> >> >> >> >> >> >> >> Cheers,
>>>>> >> >> >> >> >> >> >> Thomas
>>>>> >> >> >> >> >> >> >>
>>>>> >> >> >> >> >> >> >> ---------------------------------------------------------------------
>>>>> >> >> >> >> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> >> >> >> >> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>> >> >> >> >> >> >>
>>>>> >> >> >> >> >> >> ---------------------------------------------------------------------
>>>>> >> >> >> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> >> >> >> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>> >> >> >> >> >> >>
>>>>> >> >> >> >> >>
>>>>> >> >> >> >> >> ---------------------------------------------------------------------
>>>>> >> >> >> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> >> >> >> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>> >> >> >> >> >>
>>>
>>> <murmur-tests.patch>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Patch to change murmurhash implementation slightly [ In reply to ]

thomas.dullien at elastic

Aug 25, 2023, 3:34 PM

Post #23 of 27 (467 views)

Hey all,

apologies if the chart is incorrect. Anyhow, I think the more important
questions are:

1) Which benchmarks does the Lucene community have that y'all would like to
see an improvement on before accepting this (or any other future)
performance patches?

I'm guessing the reason why the patch improves http_log performance is
because that benchmark indexes many IP addresses, which tend to be 9-15
bytes in length. That does not strike me as an atypical workload.

I've also done some quick experiments to estimate the average UTF-8 word
size of languages in non-ASCII scripts (for example Hindi), and it seems to
me the average word size is 2-3 times larger than english because most
indic characters will encode to 2-3 bytes. The following excerpt from Hindi
Wikipedia is 2242 bytes, but just 146 words
==== begin hindi-example.txt ====
?? ??? ??????? ?? ??? ??????? ??????? ????? ?? ??????? ??? ???? ??????
?????? ??? ?????? ?? ????? ???? ???? ?? ?????? ?????-??????? ??????? ?? ???
??? ?????? ????, ????? ???? [[?????? ??????|???????? ?????? ??????]] ??
??????? ????{{sfn|Ludden|2002|p = 68}} ?????? ??? ???? ??
??? ????????? ???? ?? ??? ?????? ???? ?? ?????? ???? ??? ?????? ????? ????
?? ????????? ?????? ?? ???? ?????? ?? ???-????????? ?? ????? ?? ???????????
?? ???? ?????{{sfn|Asher|Talbot|2008|p = 47}}{{sfn|Metcalf|Metcalf|2006|p =
6}} ?? ??? ??????? ??? [[????? ????????
?|???????]] ?????? ???? ?? ????????? ?????? ?? ???? ?? ????? ??? ?????? ??
??? ?? ???? ???????? [[??????? ?????????|??????? ?????????]] ?? ?????
??????? ????{{sfn|Asher|Talbot|2008|p = 53}} ?? ????? [[???|??? ??????]] ??
?????? ?? ????? ????? ?? ??????? ???? ??? ??????
??? ?? ???? ?? ????? ??? ?? ???? ???? ?? ???? ??? ???? ??? ?? ?????? ??????
???? ?? ???????? ?????{{sfn|Metcalf|Metcalf|2006|p =
12}}{{sfn|Asher|Talbot|2008|p = 53}}
==== end hindi-example.txt ====

cat hindi-example.txt | wc -w
146
2242 divided by 146 yields a word length of ~15 bytes, so I'd be surprised
if average word length of Hindi wikipedia was below 12 bytes.

Do y'all wish for me to create another benchmark for indexing indic scripts
and large corpora of IPv4 and IPv6 addresses (both things that seem not
very niche), and if the patch shows improvement there, will y'all accept it?

2) It'd be great if there was some public documentation about "what we want
to see from a performance improvement before we accept it". I personally
find the discussion around this patch to be bewilderingly long, and I am
happy to help work on such a guideline with others -- IMO having a clear
set of hurdles is preferable to the back-and-forth we've had so far?

Cheers,
Thomas

On Fri, Aug 25, 2023 at 3:38?PM Robert Muir <rcmuir@gmail.com> wrote:

> chart is wrong, average word length for english is like 5.
>
> On Fri, Aug 25, 2023 at 9:35?AM Thomas Dullien
> <thomas.dullien@elastic.co.invalid> wrote:
> >
> > Hey all,
> >
> > another data point: There's a diagram with the relevant distributions of
> word lengths in various languages here:
> >
> >
> https://www.reddit.com/r/languagelearning/comments/h9eao2/average_word_length_of_languages_in_europe_except/
> >
> > While English is close to the 8-byte limit, average word length in
> German is 11+ bytes, and Mongolian and Finnish will likewise be 11+ bytes.
> I'll gather some averages over the various Wikipedia indices.
> >
> > Cheers,
> > Thomas
> >
> > On Thu, Aug 24, 2023 at 2:09?PM Thomas Dullien <
> thomas.dullien@elastic.co> wrote:
> >>
> >> Hey there,
> >>
> >> reviving this thread. To clarify: In order to show this patch is worth
> doing, I should index a bunch of natural-language documents (whichever
> language that is) and show that the patch brings a performance benefit?
> >>
> >> (Just clarifying, because at least inside ElasticSearch for the logs
> use-case, it turns out that it does provide a performance benefit -- but I
> want to make sure I understand what the Lucene community wishes to see as
> "evidence" this is worth pursuing :-)
> >>
> >> Cheers,
> >> Thomas
> >>
> >> On Tue, Apr 25, 2023 at 8:14?PM Walter Underwood <wunder@wunderwood.org>
> wrote:
> >>>
> >>> I would recommend some non-English tests. Non-Latin scripts (CJK,
> Arabic, Hebrew) will have longer byte strings because of UTF8. German has
> large compound words.
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wunder@wunderwood.org
> >>> http://observer.wunderwood.org/ (my blog)
> >>>
> >>> On Apr 25, 2023, at 10:57 AM, Thomas Dullien <
> thomas.dullien@elastic.co.INVALID> wrote:
> >>>
> >>> Hey all,
> >>>
> >>> ok, attached is a second patch that adds some unit tests; I am happy
> to add more.
> >>>
> >>> This brings me back to my original question: I'd like to run some
> pretty thorough benchmarking on Lucene, both for this change and for
> possible other future changes, largely focused on indexing performance.
> What are good command lines to do so? What are good corpora?
> >>>
> >>> Cheers,
> >>> Thomas
> >>>
> >>> On Tue, Apr 25, 2023 at 6:04?PM Thomas Dullien <
> thomas.dullien@elastic.co> wrote:
> >>>>
> >>>> Hey,
> >>>>
> >>>> ok, I've done some digging: Unfortunately, MurmurHash3 does not
> publish official test vectors, see the following URLs:
> >>>> https://github.com/aappleby/smhasher/issues/6
> >>>>
> https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958
> >>>> There is a link to a pastebin entry in the first issue, which leads
> to https://pastebin.com/kkggV9Vx
> >>>>
> >>>> Now, the test vectors in that pastebin do not match either the output
> of pre-change Lucene's murmur3, nor the output of the Python mmh3 package.
> That said, the pre-change Lucene and the mmh3 package agree, just not with
> the published list.
> >>>>
> >>>> There *are* test vectors in the source code for the mmh3 python
> package, which I could use, or cook up a set of bespoke ones, or both (I
> share the concern about 8-byte boundaries and signedness).
> >>>>
> https://github.com/hajimes/mmh3/blob/3bf1e5aef777d701305c1be7ad0550e093038902/test_mmh3.py#L75
> >>>>
> >>>> Cheers,
> >>>> Thomas
> >>>>
> >>>> On Tue, Apr 25, 2023 at 5:15?PM Robert Muir <rcmuir@gmail.com> wrote:
> >>>>>
> >>>>> i dont think we need a ton of random strings. But if you want to
> >>>>> optimize for strings of length 8, at a minimum there should be very
> >>>>> simple tests ensuring correctness for some boundary conditions (e.g.
> >>>>> string of length exactly 8). i would also strongly recommend testing
> >>>>> non-ascii since java is a language with signed integer types so it
> may
> >>>>> be susceptible to bugs where the input bytes have the "sign bit" set.
> >>>>>
> >>>>> IMO this could be 2 simple unit tests.
> >>>>>
> >>>>> usually at least with these kinds of algorithms you can also find
> >>>>> published "test vectors" that intend to seek out the corner cases. if
> >>>>> these exist for murmurhash, we should fold them in too.
> >>>>>
> >>>>> On Tue, Apr 25, 2023 at 11:08?AM Thomas Dullien
> >>>>> <thomas.dullien@elastic.co> wrote:
> >>>>> >
> >>>>> > Hey,
> >>>>> >
> >>>>> > I offered to run a large number of random-string-hashes to ensure
> that the output is the same pre- and post-change. I can add an arbitrary
> number of such tests to TestStringHelper.java, just specify the number you
> wish.
> >>>>> >
> >>>>> > If your worry is that my change breaches the inlining bytecode
> limit: Did you check whether the old version was inlineable or not? The new
> version is 263 bytecode instructions, the old version was 110. The default
> inlining limit appears to be 35 bytecode instructions on cursory checking
> (I may be wrong on this, though), so I don't think it was ever inlineable
> in default configs.
> >>>>> >
> >>>>> > On your statement "we haven't seen performance gains" -- the
> starting point of this thread was a friendly request to please point me to
> instructions for running a broad range of Lucene indexing benchmarks, so I
> can gather data for further discussion; from my perspective, we haven't
> even gathered any data, so obviously we haven't seen any gains.
> >>>>> >
> >>>>> > Cheers,
> >>>>> > Thomas
> >>>>> >
> >>>>> > On Tue, Apr 25, 2023 at 4:27?PM Robert Muir <rcmuir@gmail.com>
> wrote:
> >>>>> >>
> >>>>> >> There is literally one string, all-ascii. This won't fail if all
> the
> >>>>> >> shifts and masks are wrong.
> >>>>> >>
> >>>>> >> About the inlining, i'm not talking about cpu stuff, i'm talking
> about
> >>>>> >> java. There are limits to the size of methods that get inlined
> (e.g.
> >>>>> >> -XX:MaxInlineSize). If we make this method enormous like this, it
> may
> >>>>> >> have performance consequences.
> >>>>> >>
> >>>>> >> We still haven't seen any performance gain from this.
> Elasticsearch
> >>>>> >> putting huge unique IDs into indexed terms doesnt count.
> >>>>> >>
> >>>>> >> On Tue, Apr 25, 2023 at 10:25?AM Thomas Dullien
> >>>>> >> <thomas.dullien@elastic.co> wrote:
> >>>>> >> >
> >>>>> >> > Hey,
> >>>>> >> >
> >>>>> >> > so there are unit tests in TestStringHelper.java that test
> strings of length greater than 8, and my change passes them. Could you
> explain what you want tested?
> >>>>> >> >
> >>>>> >> > Cheers,
> >>>>> >> > Thomas
> >>>>> >> >
> >>>>> >> > On Tue, Apr 25, 2023 at 4:21?PM Robert Muir <rcmuir@gmail.com>
> wrote:
> >>>>> >> >>
> >>>>> >> >> sure, but "if length > 8 return 1" might pass these same tests
> too,
> >>>>> >> >> yet cause a ton of hash collisions.
> >>>>> >> >>
> >>>>> >> >> I just think if you want to optimize for super-long strings,
> there
> >>>>> >> >> should be a unit test.
> >>>>> >> >>
> >>>>> >> >> On Tue, Apr 25, 2023 at 10:20?AM Thomas Dullien
> >>>>> >> >> <thomas.dullien@elastic.co> wrote:
> >>>>> >> >> >
> >>>>> >> >> > Hey,
> >>>>> >> >> >
> >>>>> >> >> > I am pretty confident about correctness. The change passes
> both Lucene and ES regression tests and my careful reading of the code is
> pretty certain that the output is the same. If you want me to randomly test
> the result for a few hundred million random strings, I'm happy to do that,
> too, if you have other suggestions for correctness testing, let me know.
> >>>>> >> >> >
> >>>>> >> >> > The change does increase the method size and may impact
> inlining - but so does literally any code change, particularly in a JIT'ed
> environment where placement of code (and hence things like instruction
> cache conflicts) depend on the precise history of execution. The way I
> understand it, one deals with this by benchmarking and measuring.
> >>>>> >> >> >
> >>>>> >> >> > FWIW, several indexing-heavy ES benchmarks show a noticeable
> improvement in indexing speed - this is why I was asking about a broad
> range of Lucene benchmarks; to verify that this is indeed the case for
> Lucene-only, too.
> >>>>> >> >> >
> >>>>> >> >> > Let me know what data you'd like to see to decide whether
> this patch is a good idea, and if there is consensus among the Lucene
> committers that those are reasonable criteria, I'll work on producing that
> data.
> >>>>> >> >> >
> >>>>> >> >> > Cheers,
> >>>>> >> >> > Thomas
> >>>>> >> >> >
> >>>>> >> >> >
> >>>>> >> >> >
> >>>>> >> >> > On Tue, Apr 25, 2023 at 4:02?PM Robert Muir <
> rcmuir@gmail.com> wrote:
> >>>>> >> >> >>
> >>>>> >> >> >> well there is some cost, as it must add additional checks
> to see if
> >>>>> >> >> >> its longer than 8. in your patch, additional loops. it
> increases the
> >>>>> >> >> >> method size and may impact inlining and other things. also
> we can't
> >>>>> >> >> >> forget about correctness, if the hash function does the
> wrong thing it
> >>>>> >> >> >> could slow everything to a crawl.
> >>>>> >> >> >>
> >>>>> >> >> >> On Tue, Apr 25, 2023 at 9:56?AM Thomas Dullien
> >>>>> >> >> >> <thomas.dullien@elastic.co> wrote:
> >>>>> >> >> >> >
> >>>>> >> >> >> > Ah, I see what you mean.
> >>>>> >> >> >> >
> >>>>> >> >> >> > You are correct -- the change will not speed up a 5-byte
> word, but it *will* speed up all 8+-byte words, at no cost to the shorter
> words.
> >>>>> >> >> >> >
> >>>>> >> >> >> > On Tue, Apr 25, 2023 at 3:20?PM Robert Muir <
> rcmuir@gmail.com> wrote:
> >>>>> >> >> >> >>
> >>>>> >> >> >> >> if a word is of length 5, processing 8 bytes at a time
> isn't going to
> >>>>> >> >> >> >> speed anything up. there aren't 8 bytes to process.
> >>>>> >> >> >> >>
> >>>>> >> >> >> >> On Tue, Apr 25, 2023 at 9:17?AM Thomas Dullien
> >>>>> >> >> >> >> <thomas.dullien@elastic.co.invalid> wrote:
> >>>>> >> >> >> >> >
> >>>>> >> >> >> >> > Is average word length <= 4 realistic though? I mean,
> even the english wiki corpus has ~5, which would require two calls to the
> lucene layer instead of one; e.g. multiple layers of virtual dispatch that
> are unnecessary?
> >>>>> >> >> >> >> >
> >>>>> >> >> >> >> > You're not going to pay any cycles for reading 8 bytes
> instead of 4 bytes, so the cost of doing so will be the same - while
> speeding up in cases where 4 isn't quite enough?
> >>>>> >> >> >> >> >
> >>>>> >> >> >> >> > Cheers,
> >>>>> >> >> >> >> > Thomas
> >>>>> >> >> >> >> >
> >>>>> >> >> >> >> > On Tue, Apr 25, 2023 at 3:07?PM Robert Muir <
> rcmuir@gmail.com> wrote:
> >>>>> >> >> >> >> >>
> >>>>> >> >> >> >> >> i think from my perspective it has nothing to do with
> cpus being
> >>>>> >> >> >> >> >> 32-bit or 64-bit and more to do with the average
> length of terms in
> >>>>> >> >> >> >> >> most languages being smaller than 8. for the
> languages with longer
> >>>>> >> >> >> >> >> word length, its usually because of complex
> morphology that most users
> >>>>> >> >> >> >> >> would stem away. so doing 4 bytes at a time seems
> optimal IMO.
> >>>>> >> >> >> >> >> languages from nature don't care about your cpu.
> >>>>> >> >> >> >> >>
> >>>>> >> >> >> >> >> On Tue, Apr 25, 2023 at 8:52?AM Michael McCandless
> >>>>> >> >> >> >> >> <lucene@mikemccandless.com> wrote:
> >>>>> >> >> >> >> >> >
> >>>>> >> >> >> >> >> > For a truly "pure" indexing test I usually use a
> single thread for indexing, and SerialMergeScheduler (using that single
> thread to also do single-threaded merging). It makes the indexing take
> forever lol but it produces "comparable" results.
> >>>>> >> >> >> >> >> >
> >>>>> >> >> >> >> >> > But ... this sounds like a great change anyway? Do
> we really need to gate it on benchmark results? Do we think there could be
> a downside e.g. slower indexing on (the dwindling) 32 bit CPUs?
> >>>>> >> >> >> >> >> >
> >>>>> >> >> >> >> >> > Mike McCandless
> >>>>> >> >> >> >> >> >
> >>>>> >> >> >> >> >> > http://blog.mikemccandless.com
> >>>>> >> >> >> >> >> >
> >>>>> >> >> >> >> >> >
> >>>>> >> >> >> >> >> > On Tue, Apr 25, 2023 at 7:39?AM Robert Muir <
> rcmuir@gmail.com> wrote:
> >>>>> >> >> >> >> >> >>
> >>>>> >> >> >> >> >> >> I think the results of the benchmark will depend
> on the properties of
> >>>>> >> >> >> >> >> >> the indexed terms. For english wikipedia
> (luceneutil) the average word
> >>>>> >> >> >> >> >> >> length is around 5 bytes so this optimization may
> not do much.
> >>>>> >> >> >> >> >> >>
> >>>>> >> >> >> >> >> >> On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <
> zhai7631@gmail.com> wrote:
> >>>>> >> >> >> >> >> >> >
> >>>>> >> >> >> >> >> >> > I did a quick run with your patch, but since I
> turned on the CMS as well as TieredMergePolicy I'm not sure how fair the
> comparison is. Here's the result:
> >>>>> >> >> >> >> >> >> > Candidate:
> >>>>> >> >> >> >> >> >> > Indexer: indexing done (890209 msec); total
> 33332620 docs
> >>>>> >> >> >> >> >> >> > Indexer: waitForMerges done (71622 msec)
> >>>>> >> >> >> >> >> >> > Indexer: finished (961877 msec)
> >>>>> >> >> >> >> >> >> > Baseline:
> >>>>> >> >> >> >> >> >> > Indexer: indexing done (909706 msec); total
> 33332620 docs
> >>>>> >> >> >> >> >> >> > Indexer: waitForMerges done (54775 msec)
> >>>>> >> >> >> >> >> >> > Indexer: finished (964528 msec)
> >>>>> >> >> >> >> >> >> >
> >>>>> >> >> >> >> >> >> > For more accurate comparison I guess it's better
> to use LogxxMergePolicy and turn off CMS? If you want to run it yourself
> you can find the lines I quoted from the log file.
> >>>>> >> >> >> >> >> >> >
> >>>>> >> >> >> >> >> >> > Patrick
> >>>>> >> >> >> >> >> >> >
> >>>>> >> >> >> >> >> >> > On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien <
> thomas.dullien@elastic.co.invalid> wrote:
> >>>>> >> >> >> >> >> >> >>
> >>>>> >> >> >> >> >> >> >> Hey all,
> >>>>> >> >> >> >> >> >> >>
> >>>>> >> >> >> >> >> >> >> I've been experimenting with fixing some
> low-hanging performance fruit in the ElasticSearch codebase, and came
> across the fact that the MurmurHash implementation that is used by
> ByteRef.hashCode() is reading 4 bytes per loop iteration (which is likely
> an artifact from 32-bit architectures, which are ever-less-important). I
> made a small fix to change the implementation to read 8 bytes per loop
> iteration; I expected a very small impact (2-3% CPU or so over an indexing
> run in ElasticSearch), but got a pretty nontrivial throughput improvement
> over a few indexing benchmarks.
> >>>>> >> >> >> >> >> >> >>
> >>>>> >> >> >> >> >> >> >> I tried running Lucene-only benchmarks, and
> succeeded in running the example from
> https://github.com/mikemccand/luceneutil - but I couldn't figure out how
> to run indexing benchmarks and how to interpret the results.
> >>>>> >> >> >> >> >> >> >>
> >>>>> >> >> >> >> >> >> >> Could someone help me in running the benchmarks
> for the attached patch?
> >>>>> >> >> >> >> >> >> >>
> >>>>> >> >> >> >> >> >> >> Cheers,
> >>>>> >> >> >> >> >> >> >> Thomas
> >>>>> >> >> >> >> >> >> >>
> >>>>> >> >> >> >> >> >> >>
> ---------------------------------------------------------------------
> >>>>> >> >> >> >> >> >> >> To unsubscribe, e-mail:
> dev-unsubscribe@lucene.apache.org
> >>>>> >> >> >> >> >> >> >> For additional commands, e-mail:
> dev-help@lucene.apache.org
> >>>>> >> >> >> >> >> >>
> >>>>> >> >> >> >> >> >>
> ---------------------------------------------------------------------
> >>>>> >> >> >> >> >> >> To unsubscribe, e-mail:
> dev-unsubscribe@lucene.apache.org
> >>>>> >> >> >> >> >> >> For additional commands, e-mail:
> dev-help@lucene.apache.org
> >>>>> >> >> >> >> >> >>
> >>>>> >> >> >> >> >>
> >>>>> >> >> >> >> >>
> ---------------------------------------------------------------------
> >>>>> >> >> >> >> >> To unsubscribe, e-mail:
> dev-unsubscribe@lucene.apache.org
> >>>>> >> >> >> >> >> For additional commands, e-mail:
> dev-help@lucene.apache.org
> >>>>> >> >> >> >> >>
> >>>
> >>> <murmur-tests.patch>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Patch to change murmurhash implementation slightly [ In reply to ]

marcuseagan at gmail

Aug 25, 2023, 4:03 PM

Post #24 of 27 (467 views)

Hi Thomas,

Thank you for the hard work thus far. I'm excited to see if the community
can benefit from the work. The best way to use the lucene bench is to run
the baseline and candidate branches as described here
<https://github.com/mikemccand/luceneutil#preparing-the-benchmark-candidates>
.

I can help you with it and even submit an update to the benchmark repo as
needed if we find that we can improve some of the steps there to make it
easier for onlookers. Have you already tried setting up lucene_util?

- Marcus

On Fri, Aug 25, 2023 at 6:34?PM Thomas Dullien
<thomas.dullien@elastic.co.invalid> wrote:

> Hey all,
>
> apologies if the chart is incorrect. Anyhow, I think the more important
> questions are:
>
> 1) Which benchmarks does the Lucene community have that y'all would like
> to see an improvement on before accepting this (or any other future)
> performance patches?
>
> I'm guessing the reason why the patch improves http_log performance is
> because that benchmark indexes many IP addresses, which tend to be 9-15
> bytes in length. That does not strike me as an atypical workload.
>
> I've also done some quick experiments to estimate the average UTF-8 word
> size of languages in non-ASCII scripts (for example Hindi), and it seems to
> me the average word size is 2-3 times larger than english because most
> indic characters will encode to 2-3 bytes. The following excerpt from Hindi
> Wikipedia is 2242 bytes, but just 146 words
> ==== begin hindi-example.txt ====
> ?? ??? ??????? ?? ??? ??????? ??????? ????? ?? ??????? ??? ???? ??????
> ?????? ??? ?????? ?? ????? ???? ???? ?? ?????? ?????-??????? ??????? ?? ???
> ??? ?????? ????, ????? ???? [[?????? ??????|???????? ?????? ??????]] ??
> ??????? ????{{sfn|Ludden|2002|p = 68}} ?????? ??? ???? ??
> ??? ????????? ???? ?? ??? ?????? ???? ?? ?????? ???? ??? ?????? ?????
> ???? ?? ????????? ?????? ?? ???? ?????? ?? ???-????????? ?? ????? ??
> ??????????? ?? ???? ?????{{sfn|Asher|Talbot|2008|p =
> 47}}{{sfn|Metcalf|Metcalf|2006|p = 6}} ?? ??? ??????? ??? [[????? ????????
> ?|???????]] ?????? ???? ?? ????????? ?????? ?? ???? ?? ????? ??? ?????? ??
> ??? ?? ???? ???????? [[??????? ?????????|??????? ?????????]] ?? ?????
> ??????? ????{{sfn|Asher|Talbot|2008|p = 53}} ?? ????? [[???|??? ??????]] ??
> ?????? ?? ????? ????? ?? ??????? ???? ??? ??????
> ??? ?? ???? ?? ????? ??? ?? ???? ???? ?? ???? ??? ???? ??? ?? ??????
> ?????? ???? ?? ???????? ?????{{sfn|Metcalf|Metcalf|2006|p =
> 12}}{{sfn|Asher|Talbot|2008|p = 53}}
> ==== end hindi-example.txt ====
>
> cat hindi-example.txt | wc -w
> 146
> 2242 divided by 146 yields a word length of ~15 bytes, so I'd be surprised
> if average word length of Hindi wikipedia was below 12 bytes.
>
> Do y'all wish for me to create another benchmark for indexing indic
> scripts and large corpora of IPv4 and IPv6 addresses (both things that seem
> not very niche), and if the patch shows improvement there, will y'all
> accept it?
>
> 2) It'd be great if there was some public documentation about "what we
> want to see from a performance improvement before we accept it". I
> personally find the discussion around this patch to be bewilderingly long,
> and I am happy to help work on such a guideline with others -- IMO having a
> clear set of hurdles is preferable to the back-and-forth we've had so far?
>
> Cheers,
> Thomas
>
>
>
>
>
> On Fri, Aug 25, 2023 at 3:38?PM Robert Muir <rcmuir@gmail.com> wrote:
>
>> chart is wrong, average word length for english is like 5.
>>
>> On Fri, Aug 25, 2023 at 9:35?AM Thomas Dullien
>> <thomas.dullien@elastic.co.invalid> wrote:
>> >
>> > Hey all,
>> >
>> > another data point: There's a diagram with the relevant distributions
>> of word lengths in various languages here:
>> >
>> >
>> https://www.reddit.com/r/languagelearning/comments/h9eao2/average_word_length_of_languages_in_europe_except/
>> >
>> > While English is close to the 8-byte limit, average word length in
>> German is 11+ bytes, and Mongolian and Finnish will likewise be 11+ bytes.
>> I'll gather some averages over the various Wikipedia indices.
>> >
>> > Cheers,
>> > Thomas
>> >
>> > On Thu, Aug 24, 2023 at 2:09?PM Thomas Dullien <
>> thomas.dullien@elastic.co> wrote:
>> >>
>> >> Hey there,
>> >>
>> >> reviving this thread. To clarify: In order to show this patch is worth
>> doing, I should index a bunch of natural-language documents (whichever
>> language that is) and show that the patch brings a performance benefit?
>> >>
>> >> (Just clarifying, because at least inside ElasticSearch for the logs
>> use-case, it turns out that it does provide a performance benefit -- but I
>> want to make sure I understand what the Lucene community wishes to see as
>> "evidence" this is worth pursuing :-)
>> >>
>> >> Cheers,
>> >> Thomas
>> >>
>> >> On Tue, Apr 25, 2023 at 8:14?PM Walter Underwood <
>> wunder@wunderwood.org> wrote:
>> >>>
>> >>> I would recommend some non-English tests. Non-Latin scripts (CJK,
>> Arabic, Hebrew) will have longer byte strings because of UTF8. German has
>> large compound words.
>> >>>
>> >>> wunder
>> >>> Walter Underwood
>> >>> wunder@wunderwood.org
>> >>> http://observer.wunderwood.org/ (my blog)
>> >>>
>> >>> On Apr 25, 2023, at 10:57 AM, Thomas Dullien <
>> thomas.dullien@elastic.co.INVALID> wrote:
>> >>>
>> >>> Hey all,
>> >>>
>> >>> ok, attached is a second patch that adds some unit tests; I am happy
>> to add more.
>> >>>
>> >>> This brings me back to my original question: I'd like to run some
>> pretty thorough benchmarking on Lucene, both for this change and for
>> possible other future changes, largely focused on indexing performance.
>> What are good command lines to do so? What are good corpora?
>> >>>
>> >>> Cheers,
>> >>> Thomas
>> >>>
>> >>> On Tue, Apr 25, 2023 at 6:04?PM Thomas Dullien <
>> thomas.dullien@elastic.co> wrote:
>> >>>>
>> >>>> Hey,
>> >>>>
>> >>>> ok, I've done some digging: Unfortunately, MurmurHash3 does not
>> publish official test vectors, see the following URLs:
>> >>>> https://github.com/aappleby/smhasher/issues/6
>> >>>>
>> https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958
>> >>>> There is a link to a pastebin entry in the first issue, which leads
>> to https://pastebin.com/kkggV9Vx
>> >>>>
>> >>>> Now, the test vectors in that pastebin do not match either the
>> output of pre-change Lucene's murmur3, nor the output of the Python mmh3
>> package. That said, the pre-change Lucene and the mmh3 package agree, just
>> not with the published list.
>> >>>>
>> >>>> There *are* test vectors in the source code for the mmh3 python
>> package, which I could use, or cook up a set of bespoke ones, or both (I
>> share the concern about 8-byte boundaries and signedness).
>> >>>>
>> https://github.com/hajimes/mmh3/blob/3bf1e5aef777d701305c1be7ad0550e093038902/test_mmh3.py#L75
>> >>>>
>> >>>> Cheers,
>> >>>> Thomas
>> >>>>
>> >>>> On Tue, Apr 25, 2023 at 5:15?PM Robert Muir <rcmuir@gmail.com>
>> wrote:
>> >>>>>
>> >>>>> i dont think we need a ton of random strings. But if you want to
>> >>>>> optimize for strings of length 8, at a minimum there should be very
>> >>>>> simple tests ensuring correctness for some boundary conditions (e.g.
>> >>>>> string of length exactly 8). i would also strongly recommend testing
>> >>>>> non-ascii since java is a language with signed integer types so it
>> may
>> >>>>> be susceptible to bugs where the input bytes have the "sign bit"
>> set.
>> >>>>>
>> >>>>> IMO this could be 2 simple unit tests.
>> >>>>>
>> >>>>> usually at least with these kinds of algorithms you can also find
>> >>>>> published "test vectors" that intend to seek out the corner cases.
>> if
>> >>>>> these exist for murmurhash, we should fold them in too.
>> >>>>>
>> >>>>> On Tue, Apr 25, 2023 at 11:08?AM Thomas Dullien
>> >>>>> <thomas.dullien@elastic.co> wrote:
>> >>>>> >
>> >>>>> > Hey,
>> >>>>> >
>> >>>>> > I offered to run a large number of random-string-hashes to ensure
>> that the output is the same pre- and post-change. I can add an arbitrary
>> number of such tests to TestStringHelper.java, just specify the number you
>> wish.
>> >>>>> >
>> >>>>> > If your worry is that my change breaches the inlining bytecode
>> limit: Did you check whether the old version was inlineable or not? The new
>> version is 263 bytecode instructions, the old version was 110. The default
>> inlining limit appears to be 35 bytecode instructions on cursory checking
>> (I may be wrong on this, though), so I don't think it was ever inlineable
>> in default configs.
>> >>>>> >
>> >>>>> > On your statement "we haven't seen performance gains" -- the
>> starting point of this thread was a friendly request to please point me to
>> instructions for running a broad range of Lucene indexing benchmarks, so I
>> can gather data for further discussion; from my perspective, we haven't
>> even gathered any data, so obviously we haven't seen any gains.
>> >>>>> >
>> >>>>> > Cheers,
>> >>>>> > Thomas
>> >>>>> >
>> >>>>> > On Tue, Apr 25, 2023 at 4:27?PM Robert Muir <rcmuir@gmail.com>
>> wrote:
>> >>>>> >>
>> >>>>> >> There is literally one string, all-ascii. This won't fail if all
>> the
>> >>>>> >> shifts and masks are wrong.
>> >>>>> >>
>> >>>>> >> About the inlining, i'm not talking about cpu stuff, i'm talking
>> about
>> >>>>> >> java. There are limits to the size of methods that get inlined
>> (e.g.
>> >>>>> >> -XX:MaxInlineSize). If we make this method enormous like this,
>> it may
>> >>>>> >> have performance consequences.
>> >>>>> >>
>> >>>>> >> We still haven't seen any performance gain from this.
>> Elasticsearch
>> >>>>> >> putting huge unique IDs into indexed terms doesnt count.
>> >>>>> >>
>> >>>>> >> On Tue, Apr 25, 2023 at 10:25?AM Thomas Dullien
>> >>>>> >> <thomas.dullien@elastic.co> wrote:
>> >>>>> >> >
>> >>>>> >> > Hey,
>> >>>>> >> >
>> >>>>> >> > so there are unit tests in TestStringHelper.java that test
>> strings of length greater than 8, and my change passes them. Could you
>> explain what you want tested?
>> >>>>> >> >
>> >>>>> >> > Cheers,
>> >>>>> >> > Thomas
>> >>>>> >> >
>> >>>>> >> > On Tue, Apr 25, 2023 at 4:21?PM Robert Muir <rcmuir@gmail.com>
>> wrote:
>> >>>>> >> >>
>> >>>>> >> >> sure, but "if length > 8 return 1" might pass these same
>> tests too,
>> >>>>> >> >> yet cause a ton of hash collisions.
>> >>>>> >> >>
>> >>>>> >> >> I just think if you want to optimize for super-long strings,
>> there
>> >>>>> >> >> should be a unit test.
>> >>>>> >> >>
>> >>>>> >> >> On Tue, Apr 25, 2023 at 10:20?AM Thomas Dullien
>> >>>>> >> >> <thomas.dullien@elastic.co> wrote:
>> >>>>> >> >> >
>> >>>>> >> >> > Hey,
>> >>>>> >> >> >
>> >>>>> >> >> > I am pretty confident about correctness. The change passes
>> both Lucene and ES regression tests and my careful reading of the code is
>> pretty certain that the output is the same. If you want me to randomly test
>> the result for a few hundred million random strings, I'm happy to do that,
>> too, if you have other suggestions for correctness testing, let me know.
>> >>>>> >> >> >
>> >>>>> >> >> > The change does increase the method size and may impact
>> inlining - but so does literally any code change, particularly in a JIT'ed
>> environment where placement of code (and hence things like instruction
>> cache conflicts) depend on the precise history of execution. The way I
>> understand it, one deals with this by benchmarking and measuring.
>> >>>>> >> >> >
>> >>>>> >> >> > FWIW, several indexing-heavy ES benchmarks show a
>> noticeable improvement in indexing speed - this is why I was asking about a
>> broad range of Lucene benchmarks; to verify that this is indeed the case
>> for Lucene-only, too.
>> >>>>> >> >> >
>> >>>>> >> >> > Let me know what data you'd like to see to decide whether
>> this patch is a good idea, and if there is consensus among the Lucene
>> committers that those are reasonable criteria, I'll work on producing that
>> data.
>> >>>>> >> >> >
>> >>>>> >> >> > Cheers,
>> >>>>> >> >> > Thomas
>> >>>>> >> >> >
>> >>>>> >> >> >
>> >>>>> >> >> >
>> >>>>> >> >> > On Tue, Apr 25, 2023 at 4:02?PM Robert Muir <
>> rcmuir@gmail.com> wrote:
>> >>>>> >> >> >>
>> >>>>> >> >> >> well there is some cost, as it must add additional checks
>> to see if
>> >>>>> >> >> >> its longer than 8. in your patch, additional loops. it
>> increases the
>> >>>>> >> >> >> method size and may impact inlining and other things. also
>> we can't
>> >>>>> >> >> >> forget about correctness, if the hash function does the
>> wrong thing it
>> >>>>> >> >> >> could slow everything to a crawl.
>> >>>>> >> >> >>
>> >>>>> >> >> >> On Tue, Apr 25, 2023 at 9:56?AM Thomas Dullien
>> >>>>> >> >> >> <thomas.dullien@elastic.co> wrote:
>> >>>>> >> >> >> >
>> >>>>> >> >> >> > Ah, I see what you mean.
>> >>>>> >> >> >> >
>> >>>>> >> >> >> > You are correct -- the change will not speed up a 5-byte
>> word, but it *will* speed up all 8+-byte words, at no cost to the shorter
>> words.
>> >>>>> >> >> >> >
>> >>>>> >> >> >> > On Tue, Apr 25, 2023 at 3:20?PM Robert Muir <
>> rcmuir@gmail.com> wrote:
>> >>>>> >> >> >> >>
>> >>>>> >> >> >> >> if a word is of length 5, processing 8 bytes at a time
>> isn't going to
>> >>>>> >> >> >> >> speed anything up. there aren't 8 bytes to process.
>> >>>>> >> >> >> >>
>> >>>>> >> >> >> >> On Tue, Apr 25, 2023 at 9:17?AM Thomas Dullien
>> >>>>> >> >> >> >> <thomas.dullien@elastic.co.invalid> wrote:
>> >>>>> >> >> >> >> >
>> >>>>> >> >> >> >> > Is average word length <= 4 realistic though? I mean,
>> even the english wiki corpus has ~5, which would require two calls to the
>> lucene layer instead of one; e.g. multiple layers of virtual dispatch that
>> are unnecessary?
>> >>>>> >> >> >> >> >
>> >>>>> >> >> >> >> > You're not going to pay any cycles for reading 8
>> bytes instead of 4 bytes, so the cost of doing so will be the same - while
>> speeding up in cases where 4 isn't quite enough?
>> >>>>> >> >> >> >> >
>> >>>>> >> >> >> >> > Cheers,
>> >>>>> >> >> >> >> > Thomas
>> >>>>> >> >> >> >> >
>> >>>>> >> >> >> >> > On Tue, Apr 25, 2023 at 3:07?PM Robert Muir <
>> rcmuir@gmail.com> wrote:
>> >>>>> >> >> >> >> >>
>> >>>>> >> >> >> >> >> i think from my perspective it has nothing to do
>> with cpus being
>> >>>>> >> >> >> >> >> 32-bit or 64-bit and more to do with the average
>> length of terms in
>> >>>>> >> >> >> >> >> most languages being smaller than 8. for the
>> languages with longer
>> >>>>> >> >> >> >> >> word length, its usually because of complex
>> morphology that most users
>> >>>>> >> >> >> >> >> would stem away. so doing 4 bytes at a time seems
>> optimal IMO.
>> >>>>> >> >> >> >> >> languages from nature don't care about your cpu.
>> >>>>> >> >> >> >> >>
>> >>>>> >> >> >> >> >> On Tue, Apr 25, 2023 at 8:52?AM Michael McCandless
>> >>>>> >> >> >> >> >> <lucene@mikemccandless.com> wrote:
>> >>>>> >> >> >> >> >> >
>> >>>>> >> >> >> >> >> > For a truly "pure" indexing test I usually use a
>> single thread for indexing, and SerialMergeScheduler (using that single
>> thread to also do single-threaded merging). It makes the indexing take
>> forever lol but it produces "comparable" results.
>> >>>>> >> >> >> >> >> >
>> >>>>> >> >> >> >> >> > But ... this sounds like a great change anyway?
>> Do we really need to gate it on benchmark results? Do we think there could
>> be a downside e.g. slower indexing on (the dwindling) 32 bit CPUs?
>> >>>>> >> >> >> >> >> >
>> >>>>> >> >> >> >> >> > Mike McCandless
>> >>>>> >> >> >> >> >> >
>> >>>>> >> >> >> >> >> > http://blog.mikemccandless.com
>> >>>>> >> >> >> >> >> >
>> >>>>> >> >> >> >> >> >
>> >>>>> >> >> >> >> >> > On Tue, Apr 25, 2023 at 7:39?AM Robert Muir <
>> rcmuir@gmail.com> wrote:
>> >>>>> >> >> >> >> >> >>
>> >>>>> >> >> >> >> >> >> I think the results of the benchmark will depend
>> on the properties of
>> >>>>> >> >> >> >> >> >> the indexed terms. For english wikipedia
>> (luceneutil) the average word
>> >>>>> >> >> >> >> >> >> length is around 5 bytes so this optimization may
>> not do much.
>> >>>>> >> >> >> >> >> >>
>> >>>>> >> >> >> >> >> >> On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <
>> zhai7631@gmail.com> wrote:
>> >>>>> >> >> >> >> >> >> >
>> >>>>> >> >> >> >> >> >> > I did a quick run with your patch, but since I
>> turned on the CMS as well as TieredMergePolicy I'm not sure how fair the
>> comparison is. Here's the result:
>> >>>>> >> >> >> >> >> >> > Candidate:
>> >>>>> >> >> >> >> >> >> > Indexer: indexing done (890209 msec); total
>> 33332620 docs
>> >>>>> >> >> >> >> >> >> > Indexer: waitForMerges done (71622 msec)
>> >>>>> >> >> >> >> >> >> > Indexer: finished (961877 msec)
>> >>>>> >> >> >> >> >> >> > Baseline:
>> >>>>> >> >> >> >> >> >> > Indexer: indexing done (909706 msec); total
>> 33332620 docs
>> >>>>> >> >> >> >> >> >> > Indexer: waitForMerges done (54775 msec)
>> >>>>> >> >> >> >> >> >> > Indexer: finished (964528 msec)
>> >>>>> >> >> >> >> >> >> >
>> >>>>> >> >> >> >> >> >> > For more accurate comparison I guess it's
>> better to use LogxxMergePolicy and turn off CMS? If you want to run it
>> yourself you can find the lines I quoted from the log file.
>> >>>>> >> >> >> >> >> >> >
>> >>>>> >> >> >> >> >> >> > Patrick
>> >>>>> >> >> >> >> >> >> >
>> >>>>> >> >> >> >> >> >> > On Mon, Apr 24, 2023 at 12:34?PM Thomas Dullien
>> <thomas.dullien@elastic.co.invalid> wrote:
>> >>>>> >> >> >> >> >> >> >>
>> >>>>> >> >> >> >> >> >> >> Hey all,
>> >>>>> >> >> >> >> >> >> >>
>> >>>>> >> >> >> >> >> >> >> I've been experimenting with fixing some
>> low-hanging performance fruit in the ElasticSearch codebase, and came
>> across the fact that the MurmurHash implementation that is used by
>> ByteRef.hashCode() is reading 4 bytes per loop iteration (which is likely
>> an artifact from 32-bit architectures, which are ever-less-important). I
>> made a small fix to change the implementation to read 8 bytes per loop
>> iteration; I expected a very small impact (2-3% CPU or so over an indexing
>> run in ElasticSearch), but got a pretty nontrivial throughput improvement
>> over a few indexing benchmarks.
>> >>>>> >> >> >> >> >> >> >>
>> >>>>> >> >> >> >> >> >> >> I tried running Lucene-only benchmarks, and
>> succeeded in running the example from
>> https://github.com/mikemccand/luceneutil - but I couldn't figure out how
>> to run indexing benchmarks and how to interpret the results.
>> >>>>> >> >> >> >> >> >> >>
>> >>>>> >> >> >> >> >> >> >> Could someone help me in running the
>> benchmarks for the attached patch?
>> >>>>> >> >> >> >> >> >> >>
>> >>>>> >> >> >> >> >> >> >> Cheers,
>> >>>>> >> >> >> >> >> >> >> Thomas
>> >>>>> >> >> >> >> >> >> >>
>> >>>>> >> >> >> >> >> >> >>
>> ---------------------------------------------------------------------
>> >>>>> >> >> >> >> >> >> >> To unsubscribe, e-mail:
>> dev-unsubscribe@lucene.apache.org
>> >>>>> >> >> >> >> >> >> >> For additional commands, e-mail:
>> dev-help@lucene.apache.org
>> >>>>> >> >> >> >> >> >>
>> >>>>> >> >> >> >> >> >>
>> ---------------------------------------------------------------------
>> >>>>> >> >> >> >> >> >> To unsubscribe, e-mail:
>> dev-unsubscribe@lucene.apache.org
>> >>>>> >> >> >> >> >> >> For additional commands, e-mail:
>> dev-help@lucene.apache.org
>> >>>>> >> >> >> >> >> >>
>> >>>>> >> >> >> >> >>
>> >>>>> >> >> >> >> >>
>> ---------------------------------------------------------------------
>> >>>>> >> >> >> >> >> To unsubscribe, e-mail:
>> dev-unsubscribe@lucene.apache.org
>> >>>>> >> >> >> >> >> For additional commands, e-mail:
>> dev-help@lucene.apache.org
>> >>>>> >> >> >> >> >>
>> >>>
>> >>> <murmur-tests.patch>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>
>> >>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

--
Marcus Eagan

Re: Patch to change murmurhash implementation slightly [ In reply to ]

marcuseagan at gmail

Aug 25, 2023, 4:11 PM

Post #25 of 27 (467 views)

Thomas,

Also, is it possible to open this patch as a pull request in GitHub?

I guess it does not matter for a lot of the people here. It would make it
easier for more people to collaborate in that medium given the shift to
GitHub recently.

- Marcus

On Fri, Aug 25, 2023 at 7:03?PM Marcus Eagan <marcuseagan@gmail.com> wrote:

> Hi Thomas,
>
> Thank you for the hard work thus far. I'm excited to see if the community
> can benefit from the work. The best way to use the lucene bench is to run
> the baseline and candidate branches as described here
> <https://github.com/mikemccand/luceneutil#preparing-the-benchmark-candidates>
> .
>
> I can help you with it and even submit an update to the benchmark repo as
> needed if we find that we can improve some of the steps there to make it
> easier for onlookers. Have you already tried setting up lucene_util?
>
> - Marcus
>
>
>
> On Fri, Aug 25, 2023 at 6:34?PM Thomas Dullien
> <thomas.dullien@elastic.co.invalid> wrote:
>
>> Hey all,
>>
>> apologies if the chart is incorrect. Anyhow, I think the more important
>> questions are:
>>
>> 1) Which benchmarks does the Lucene community have that y'all would like
>> to see an improvement on before accepting this (or any other future)
>> performance patches?
>>
>> I'm guessing the reason why the patch improves http_log performance is
>> because that benchmark indexes many IP addresses, which tend to be 9-15
>> bytes in length. That does not strike me as an atypical workload.
>>
>> I've also done some quick experiments to estimate the average UTF-8 word
>> size of languages in non-ASCII scripts (for example Hindi), and it seems to
>> me the average word size is 2-3 times larger than english because most
>> indic characters will encode to 2-3 bytes. The following excerpt from Hindi
>> Wikipedia is 2242 bytes, but just 146 words
>> ==== begin hindi-example.txt ====
>> ?? ??? ??????? ?? ??? ??????? ??????? ????? ?? ??????? ??? ???? ??????
>> ?????? ??? ?????? ?? ????? ???? ???? ?? ?????? ?????-??????? ??????? ?? ???
>> ??? ?????? ????, ????? ???? [[?????? ??????|???????? ?????? ??????]] ??
>> ??????? ????{{sfn|Ludden|2002|p = 68}} ?????? ??? ???? ??
>> ??? ????????? ???? ?? ??? ?????? ???? ?? ?????? ???? ??? ?????? ?????
>> ???? ?? ????????? ?????? ?? ???? ?????? ?? ???-????????? ?? ????? ??
>> ??????????? ?? ???? ?????{{sfn|Asher|Talbot|2008|p =
>> 47}}{{sfn|Metcalf|Metcalf|2006|p = 6}} ?? ??? ??????? ??? [[????? ????????
>> ?|???????]] ?????? ???? ?? ????????? ?????? ?? ???? ?? ????? ??? ??????
>> ?? ??? ?? ???? ???????? [[??????? ?????????|??????? ?????????]] ?? ?????
>> ??????? ????{{sfn|Asher|Talbot|2008|p = 53}} ?? ????? [[???|??? ??????]] ??
>> ?????? ?? ????? ????? ?? ??????? ???? ??? ??????
>> ??? ?? ???? ?? ????? ??? ?? ???? ???? ?? ???? ??? ???? ??? ?? ??????
>> ?????? ???? ?? ???????? ?????{{sfn|Metcalf|Metcalf|2006|p =
>> 12}}{{sfn|Asher|Talbot|2008|p = 53}}
>> ==== end hindi-example.txt ====
>>
>> cat hindi-example.txt | wc -w
>> 146
>> 2242 divided by 146 yields a word length of ~15 bytes, so I'd be
>> surprised if average word length of Hindi wikipedia was below 12 bytes.
>>
>> Do y'all wish for me to create another benchmark for indexing indic
>> scripts and large corpora of IPv4 and IPv6 addresses (both things that seem
>> not very niche), and if the patch shows improvement there, will y'all
>> accept it?
>>
>> 2) It'd be great if there was some public documentation about "what we
>> want to see from a performance improvement before we accept it". I
>> personally find the discussion around this patch to be bewilderingly long,
>> and I am happy to help work on such a guideline with others -- IMO having a
>> clear set of hurdles is preferable to the back-and-forth we've had so far?
>>
>> Cheers,
>> Thomas
>>
>>
>>
>>
>>
>> On Fri, Aug 25, 2023 at 3:38?PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> chart is wrong, average word length for english is like 5.
>>>
>>> On Fri, Aug 25, 2023 at 9:35?AM Thomas Dullien
>>> <thomas.dullien@elastic.co.invalid> wrote:
>>> >
>>> > Hey all,
>>> >
>>> > another data point: There's a diagram with the relevant distributions
>>> of word lengths in various languages here:
>>> >
>>> >
>>> https://www.reddit.com/r/languagelearning/comments/h9eao2/average_word_length_of_languages_in_europe_except/
>>> >
>>> > While English is close to the 8-byte limit, average word length in
>>> German is 11+ bytes, and Mongolian and Finnish will likewise be 11+ bytes.
>>> I'll gather some averages over the various Wikipedia indices.
>>> >
>>> > Cheers,
>>> > Thomas
>>> >
>>> > On Thu, Aug 24, 2023 at 2:09?PM Thomas Dullien <
>>> thomas.dullien@elastic.co> wrote:
>>> >>
>>> >> Hey there,
>>> >>
>>> >> reviving this thread. To clarify: In order to show this patch is
>>> worth doing, I should index a bunch of natural-language documents
>>> (whichever language that is) and show that the patch brings a performance
>>> benefit?
>>> >>
>>> >> (Just clarifying, because at least inside ElasticSearch for the logs
>>> use-case, it turns out that it does provide a performance benefit -- but I
>>> want to make sure I understand what the Lucene community wishes to see as
>>> "evidence" this is worth pursuing :-)
>>> >>
>>> >> Cheers,
>>> >> Thomas
>>> >>
>>> >> On Tue, Apr 25, 2023 at 8:14?PM Walter Underwood <
>>> wunder@wunderwood.org> wrote:
>>> >>>
>>> >>> I would recommend some non-English tests. Non-Latin scripts (CJK,
>>> Arabic, Hebrew) will have longer byte strings because of UTF8. German has
>>> large compound words.
>>> >>>
>>> >>> wunder
>>> >>> Walter Underwood
>>> >>> wunder@wunderwood.org
>>> >>> http://observer.wunderwood.org/ (my blog)
>>> >>>
>>> >>> On Apr 25, 2023, at 10:57 AM, Thomas Dullien <
>>> thomas.dullien@elastic.co.INVALID> wrote:
>>> >>>
>>> >>> Hey all,
>>> >>>
>>> >>> ok, attached is a second patch that adds some unit tests; I am happy
>>> to add more.
>>> >>>
>>> >>> This brings me back to my original question: I'd like to run some
>>> pretty thorough benchmarking on Lucene, both for this change and for
>>> possible other future changes, largely focused on indexing performance.
>>> What are good command lines to do so? What are good corpora?
>>> >>>
>>> >>> Cheers,
>>> >>> Thomas
>>> >>>
>>> >>> On Tue, Apr 25, 2023 at 6:04?PM Thomas Dullien <
>>> thomas.dullien@elastic.co> wrote:
>>> >>>>
>>> >>>> Hey,
>>> >>>>
>>> >>>> ok, I've done some digging: Unfortunately, MurmurHash3 does not
>>> publish official test vectors, see the following URLs:
>>> >>>> https://github.com/aappleby/smhasher/issues/6
>>> >>>>
>>> https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958
>>> >>>> There is a link to a pastebin entry in the first issue, which leads
>>> to https://pastebin.com/kkggV9Vx
>>> >>>>
>>> >>>> Now, the test vectors in that pastebin do not match either the
>>> output of pre-change Lucene's murmur3, nor the output of the Python mmh3
>>> package. That said, the pre-change Lucene and the mmh3 package agree, just
>>> not with the published list.
>>> >>>>
>>> >>>> There *are* test vectors in the source code for the mmh3 python
>>> package, which I could use, or cook up a set of bespoke ones, or both (I
>>> share the concern about 8-byte boundaries and signedness).
>>> >>>>
>>> https://github.com/hajimes/mmh3/blob/3bf1e5aef777d701305c1be7ad0550e093038902/test_mmh3.py#L75
>>> >>>>
>>> >>>> Cheers,
>>> >>>> Thomas
>>> >>>>
>>> >>>> On Tue, Apr 25, 2023 at 5:15?PM Robert Muir <rcmuir@gmail.com>
>>> wrote:
>>> >>>>>
>>> >>>>> i dont think we need a ton of random strings. But if you want to
>>> >>>>> optimize for strings of length 8, at a minimum there should be very
>>> >>>>> simple tests ensuring correctness for some boundary conditions
>>> (e.g.
>>> >>>>> string of length exactly 8). i would also strongly recommend
>>> testing
>>> >>>>> non-ascii since java is a language with signed integer types so it
>>> may
>>> >>>>> be susceptible to bugs where the input bytes have the "sign bit"
>>> set.
>>> >>>>>
>>> >>>>> IMO this could be 2 simple unit tests.
>>> >>>>>
>>> >>>>> usually at least with these kinds of algorithms you can also find
>>> >>>>> published "test vectors" that intend to seek out the corner cases.
>>> if
>>> >>>>> these exist for murmurhash, we should fold them in too.
>>> >>>>>
>>> >>>>> On Tue, Apr 25, 2023 at 11:08?AM Thomas Dullien
>>> >>>>> <thomas.dullien@elastic.co> wrote:
>>> >>>>> >
>>> >>>>> > Hey,
>>> >>>>> >
>>> >>>>> > I offered to run a large number of random-string-hashes to
>>> ensure that the output is the same pre- and post-change. I can add an
>>> arbitrary number of such tests to TestStringHelper.java, just specify the
>>> number you wish.
>>> >>>>> >
>>> >>>>> > If your worry is that my change breaches the inlining bytecode
>>> limit: Did you check whether the old version was inlineable or not? The new
>>> version is 263 bytecode instructions, the old version was 110. The default
>>> inlining limit appears to be 35 bytecode instructions on cursory checking
>>> (I may be wrong on this, though), so I don't think it was ever inlineable
>>> in default configs.
>>> >>>>> >
>>> >>>>> > On your statement "we haven't seen performance gains" -- the
>>> starting point of this thread was a friendly request to please point me to
>>> instructions for running a broad range of Lucene indexing benchmarks, so I
>>> can gather data for further discussion; from my perspective, we haven't
>>> even gathered any data, so obviously we haven't seen any gains.
>>> >>>>> >
>>> >>>>> > Cheers,
>>> >>>>> > Thomas
>>> >>>>> >
>>> >>>>> > On Tue, Apr 25, 2023 at 4:27?PM Robert Muir <rcmuir@gmail.com>
>>> wrote:
>>> >>>>> >>
>>> >>>>> >> There is literally one string, all-ascii. This won't fail if
>>> all the
>>> >>>>> >> shifts and masks are wrong.
>>> >>>>> >>
>>> >>>>> >> About the inlining, i'm not talking about cpu stuff, i'm
>>> talking about
>>> >>>>> >> java. There are limits to the size of methods that get inlined
>>> (e.g.
>>> >>>>> >> -XX:MaxInlineSize). If we make this method enormous like this,
>>> it may
>>> >>>>> >> have performance consequences.
>>> >>>>> >>
>>> >>>>> >> We still haven't seen any performance gain from this.
>>> Elasticsearch
>>> >>>>> >> putting huge unique IDs into indexed terms doesnt count.
>>> >>>>> >>
>>> >>>>> >> On Tue, Apr 25, 2023 at 10:25?AM Thomas Dullien
>>> >>>>> >> <thomas.dullien@elastic.co> wrote:
>>> >>>>> >> >
>>> >>>>> >> > Hey,
>>> >>>>> >> >
>>> >>>>> >> > so there are unit tests in TestStringHelper.java that test
>>> strings of length greater than 8, and my change passes them. Could you
>>> explain what you want tested?
>>> >>>>> >> >
>>> >>>>> >> > Cheers,
>>> >>>>> >> > Thomas
>>> >>>>> >> >
>>> >>>>> >> > On Tue, Apr 25, 2023 at 4:21?PM Robert Muir <rcmuir@gmail.com>
>>> wrote:
>>> >>>>> >> >>
>>> >>>>> >> >> sure, but "if length > 8 return 1" might pass these same
>>> tests too,
>>> >>>>> >> >> yet cause a ton of hash collisions.
>>> >>>>> >> >>
>>> >>>>> >> >> I just think if you want to optimize for super-long strings,
>>> there
>>> >>>>> >> >> should be a unit test.
>>> >>>>> >> >>
>>> >>>>> >> >> On Tue, Apr 25, 2023 at 10:20?AM Thomas Dullien
>>> >>>>> >> >> <thomas.dullien@elastic.co> wrote:
>>> >>>>> >> >> >
>>> >>>>> >> >> > Hey,
>>> >>>>> >> >> >
>>> >>>>> >> >> > I am pretty confident about correctness. The change passes
>>> both Lucene and ES regression tests and my careful reading of the code is
>>> pretty certain that the output is the same. If you want me to randomly test
>>> the result for a few hundred million random strings, I'm happy to do that,
>>> too, if you have other suggestions for correctness testing, let me know.
>>> >>>>> >> >> >
>>> >>>>> >> >> > The change does increase the method size and may impact
>>> inlining - but so does literally any code change, particularly in a JIT'ed
>>> environment where placement of code (and hence things like instruction
>>> cache conflicts) depend on the precise history of execution. The way I
>>> understand it, one deals with this by benchmarking and measuring.
>>> >>>>> >> >> >
>>> >>>>> >> >> > FWIW, several indexing-heavy ES benchmarks show a
>>> noticeable improvement in indexing speed - this is why I was asking about a
>>> broad range of Lucene benchmarks; to verify that this is indeed the case
>>> for Lucene-only, too.
>>> >>>>> >> >> >
>>> >>>>> >> >> > Let me know what data you'd like to see to decide whether
>>> this patch is a good idea, and if there is consensus among the Lucene
>>> committers that those are reasonable criteria, I'll work on producing that
>>> data.
>>> >>>>> >> >> >
>>> >>>>> >> >> > Cheers,
>>> >>>>> >> >> > Thomas
>>> >>>>> >> >> >
>>> >>>>> >> >> >
>>> >>>>> >> >> >
>>> >>>>> >> >> > On Tue, Apr 25, 2023 at 4:02?PM Robert Muir <
>>> rcmuir@gmail.com> wrote:
>>> >>>>> >> >> >>
>>> >>>>> >> >> >> well there is some cost, as it must add additional checks
>>> to see if
>>> >>>>> >> >> >> its longer than 8. in your patch, additional loops. it
>>> increases the
>>> >>>>> >> >> >> method size and may impact inlining and other things.
>>> also we can't
>>> >>>>> >> >> >> forget about correctness, if the hash function does the
>>> wrong thing it
>>> >>>>> >> >> >> could slow everything to a crawl.
>>> >>>>> >> >> >>
>>> >>>>> >> >> >> On Tue, Apr 25, 2023 at 9:56?AM Thomas Dullien
>>> >>>>> >> >> >> <thomas.dullien@elastic.co> wrote:
>>> >>>>> >> >> >> >
>>> >>>>> >> >> >> > Ah, I see what you mean.
>>> >>>>> >> >> >> >
>>> >>>>> >> >> >> > You are correct -- the change will not speed up a
>>> 5-byte word, but it *will* speed up all 8+-byte words, at no cost to the
>>> shorter words.
>>> >>>>> >> >> >> >
>>> >>>>> >> >> >> > On Tue, Apr 25, 2023 at 3:20?PM Robert Muir <
>>> rcmuir@gmail.com> wrote:
>>> >>>>> >> >> >> >>
>>> >>>>> >> >> >> >> if a word is of length 5, processing 8 bytes at a time
>>> isn't going to
>>> >>>>> >> >> >> >> speed anything up. there aren't 8 bytes to process.
>>> >>>>> >> >> >> >>
>>> >>>>> >> >> >> >> On Tue, Apr 25, 2023 at 9:17?AM Thomas Dullien
>>> >>>>> >> >> >> >> <thomas.dullien@elastic.co.invalid> wrote:
>>> >>>>> >> >> >> >> >
>>> >>>>> >> >> >> >> > Is average word length <= 4 realistic though? I
>>> mean, even the english wiki corpus has ~5, which would require two calls to
>>> the lucene layer instead of one; e.g. multiple layers of virtual dispatch
>>> that are unnecessary?
>>> >>>>> >> >> >> >> >
>>> >>>>> >> >> >> >> > You're not going to pay any cycles for reading 8
>>> bytes instead of 4 bytes, so the cost of doing so will be the same - while
>>> speeding up in cases where 4 isn't quite enough?
>>> >>>>> >> >> >> >> >
>>> >>>>> >> >> >> >> > Cheers,
>>> >>>>> >> >> >> >> > Thomas
>>> >>>>> >> >> >> >> >
>>> >>>>> >> >> >> >> > On Tue, Apr 25, 2023 at 3:07?PM Robert Muir <
>>> rcmuir@gmail.com> wrote:
>>> >>>>> >> >> >> >> >>
>>> >>>>> >> >> >> >> >> i think from my perspective it has nothing to do
>>> with cpus being
>>> >>>>> >> >> >> >> >> 32-bit or 64-bit and more to do with the average
>>> length of terms in
>>> >>>>> >> >> >> >> >> most languages being smaller than 8. for the
>>> languages with longer
>>> >>>>> >> >> >> >> >> word length, its usually because of complex
>>> morphology that most users
>>> >>>>> >> >> >> >> >> would stem away. so doing 4 bytes at a time seems
>>> optimal IMO.
>>> >>>>> >> >> >> >> >> languages from nature don't care about your cpu.
>>> >>>>> >> >> >> >> >>
>>> >>>>> >> >> >> >> >> On Tue, Apr 25, 2023 at 8:52?AM Michael McCandless
>>> >>>>> >> >> >> >> >> <lucene@mikemccandless.com> wrote:
>>> >>>>> >> >> >> >> >> >
>>> >>>>> >> >> >> >> >> > For a truly "pure" indexing test I usually use a
>>> single thread for indexing, and SerialMergeScheduler (using that single
>>> thread to also do single-threaded merging). It makes the indexing take
>>> forever lol but it produces "comparable" results.
>>> >>>>> >> >> >> >> >> >
>>> >>>>> >> >> >> >> >> > But ... this sounds like a great change anyway?
>>> Do we really need to gate it on benchmark results? Do we think there could
>>> be a downside e.g. slower indexing on (the dwindling) 32 bit CPUs?
>>> >>>>> >> >> >> >> >> >
>>> >>>>> >> >> >> >> >> > Mike McCandless
>>> >>>>> >> >> >> >> >> >
>>> >>>>> >> >> >> >> >> > http://blog.mikemccandless.com
>>> >>>>> >> >> >> >> >> >
>>> >>>>> >> >> >> >> >> >
>>> >>>>> >> >> >> >> >> > On Tue, Apr 25, 2023 at 7:39?AM Robert Muir <
>>> rcmuir@gmail.com> wrote:
>>> >>>>> >> >> >> >> >> >>
>>> >>>>> >> >> >> >> >> >> I think the results of the benchmark will depend
>>> on the properties of
>>> >>>>> >> >> >> >> >> >> the indexed terms. For english wikipedia
>>> (luceneutil) the average word
>>> >>>>> >> >> >> >> >> >> length is around 5 bytes so this optimization
>>> may not do much.
>>> >>>>> >> >> >> >> >> >>
>>> >>>>> >> >> >> >> >> >> On Tue, Apr 25, 2023 at 1:58?AM Patrick Zhai <
>>> zhai7631@gmail.com> wrote:
>>> >>>>> >> >> >> >> >> >> >
>>> >>>>> >> >> >> >> >> >> > I did a quick run with your patch, but since I
>>> turned on the CMS as well as TieredMergePolicy I'm not sure how fair the
>>> comparison is. Here's the result:
>>> >>>>> >> >> >> >> >> >> > Candidate:
>>> >>>>> >> >> >> >> >> >> > Indexer: indexing done (890209 msec); total
>>> 33332620 docs
>>> >>>>> >> >> >> >> >> >> > Indexer: waitForMerges done (71622 msec)
>>> >>>>> >> >> >> >> >> >> > Indexer: finished (961877 msec)
>>> >>>>> >> >> >> >> >> >> > Baseline:
>>> >>>>> >> >> >> >> >> >> > Indexer: indexing done (909706 msec); total
>>> 33332620 docs
>>> >>>>> >> >> >> >> >> >> > Indexer: waitForMerges done (54775 msec)
>>> >>>>> >> >> >> >> >> >> > Indexer: finished (964528 msec)
>>> >>>>> >> >> >> >> >> >> >
>>> >>>>> >> >> >> >> >> >> > For more accurate comparison I guess it's
>>> better to use LogxxMergePolicy and turn off CMS? If you want to run it
>>> yourself you can find the lines I quoted from the log file.
>>> >>>>> >> >> >> >> >> >> >
>>> >>>>> >> >> >> >> >> >> > Patrick
>>> >>>>> >> >> >> >> >> >> >
>>> >>>>> >> >> >> >> >> >> > On Mon, Apr 24, 2023 at 12:34?PM Thomas
>>> Dullien <thomas.dullien@elastic.co.invalid> wrote:
>>> >>>>> >> >> >> >> >> >> >>
>>> >>>>> >> >> >> >> >> >> >> Hey all,
>>> >>>>> >> >> >> >> >> >> >>
>>> >>>>> >> >> >> >> >> >> >> I've been experimenting with fixing some
>>> low-hanging performance fruit in the ElasticSearch codebase, and came
>>> across the fact that the MurmurHash implementation that is used by
>>> ByteRef.hashCode() is reading 4 bytes per loop iteration (which is likely
>>> an artifact from 32-bit architectures, which are ever-less-important). I
>>> made a small fix to change the implementation to read 8 bytes per loop
>>> iteration; I expected a very small impact (2-3% CPU or so over an indexing
>>> run in ElasticSearch), but got a pretty nontrivial throughput improvement
>>> over a few indexing benchmarks.
>>> >>>>> >> >> >> >> >> >> >>
>>> >>>>> >> >> >> >> >> >> >> I tried running Lucene-only benchmarks, and
>>> succeeded in running the example from
>>> https://github.com/mikemccand/luceneutil - but I couldn't figure out
>>> how to run indexing benchmarks and how to interpret the results.
>>> >>>>> >> >> >> >> >> >> >>
>>> >>>>> >> >> >> >> >> >> >> Could someone help me in running the
>>> benchmarks for the attached patch?
>>> >>>>> >> >> >> >> >> >> >>
>>> >>>>> >> >> >> >> >> >> >> Cheers,
>>> >>>>> >> >> >> >> >> >> >> Thomas
>>> >>>>> >> >> >> >> >> >> >>
>>> >>>>> >> >> >> >> >> >> >>
>>> ---------------------------------------------------------------------
>>> >>>>> >> >> >> >> >> >> >> To unsubscribe, e-mail:
>>> dev-unsubscribe@lucene.apache.org
>>> >>>>> >> >> >> >> >> >> >> For additional commands, e-mail:
>>> dev-help@lucene.apache.org
>>> >>>>> >> >> >> >> >> >>
>>> >>>>> >> >> >> >> >> >>
>>> ---------------------------------------------------------------------
>>> >>>>> >> >> >> >> >> >> To unsubscribe, e-mail:
>>> dev-unsubscribe@lucene.apache.org
>>> >>>>> >> >> >> >> >> >> For additional commands, e-mail:
>>> dev-help@lucene.apache.org
>>> >>>>> >> >> >> >> >> >>
>>> >>>>> >> >> >> >> >>
>>> >>>>> >> >> >> >> >>
>>> ---------------------------------------------------------------------
>>> >>>>> >> >> >> >> >> To unsubscribe, e-mail:
>>> dev-unsubscribe@lucene.apache.org
>>> >>>>> >> >> >> >> >> For additional commands, e-mail:
>>> dev-help@lucene.apache.org
>>> >>>>> >> >> >> >> >>
>>> >>>
>>> >>> <murmur-tests.patch>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>>
>>> >>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>
> --
> Marcus Eagan
>
>

--
Marcus Eagan