Mailing List Archive

Re: [jira] [Commented] (LUCENE-10062) Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for faceting
Hard to read on the phone, but is that a 482% speed up I saw??!

On Thu, Sep 23, 2021, 1:28 PM Greg Miller (Jira) <jira@apache.org> wrote:

>
> [
> https://issues.apache.org/jira/browse/LUCENE-10062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419349#comment-17419349
> ]
>
> Greg Miller commented on LUCENE-10062:
> --------------------------------------
>
> I re-ran {{luceneutil}} benchmarks {{wikimedium10m}} since [~mikemccand]
> added new faceting tasks (thanks Mike!). Looks like there's a nice
> improvement on these new faceting tasks as well with this change (and no
> regressions anywhere else that I see).
>
> I was waiting to iterate on my PR until I was able to run these new
> benchmarking tasks, but it seems like there's enough benefit to this change
> to pick it back up.
>
>
> {noformat}
> TaskQPS baseline StdDevQPS candidate
> StdDev Pct diff p-value
> HighTermDayOfYearSort 70.02 (13.7%) 68.45
> (9.7%) -2.2% ( -22% - 24%) 0.551
> MedTerm 1300.90 (5.5%) 1275.97
> (6.7%) -1.9% ( -13% - 10%) 0.324
> HighTerm 1953.46 (5.8%) 1925.79
> (7.9%) -1.4% ( -14% - 13%) 0.518
> HighTermTitleBDVSort 122.35 (15.6%) 120.86
> (14.9%) -1.2% ( -27% - 34%) 0.801
> TermDTSort 133.47 (8.7%) 131.86
> (7.4%) -1.2% ( -15% - 16%) 0.637
> LowTerm 1636.13 (5.5%) 1622.34
> (7.4%) -0.8% ( -12% - 12%) 0.682
> Prefix3 25.69 (6.0%) 25.48
> (6.3%) -0.8% ( -12% - 12%) 0.676
> LowSpanNear 118.02 (2.1%) 117.31
> (1.8%) -0.6% ( -4% - 3%) 0.326
> HighTermMonthSort 140.17 (9.8%) 139.47
> (9.9%) -0.5% ( -18% - 21%) 0.872
> AndHighHigh 49.17 (3.1%) 48.92
> (2.7%) -0.5% ( -6% - 5%) 0.584
> HighSpanNear 25.54 (2.7%) 25.41
> (2.2%) -0.5% ( -5% - 4%) 0.529
> AndHighLow 556.68 (5.8%) 554.80
> (5.4%) -0.3% ( -10% - 11%) 0.848
> BrowseDayOfYearSSDVFacets 16.53 (2.5%) 16.47
> (2.4%) -0.3% ( -5% - 4%) 0.674
> IntNRQ 87.76 (2.0%) 87.49
> (2.1%) -0.3% ( -4% - 3%) 0.634
> MedSpanNear 31.11 (2.2%) 31.04
> (1.6%) -0.2% ( -3% - 3%) 0.714
> OrNotHighLow 765.10 (4.5%) 763.60
> (5.4%) -0.2% ( -9% - 10%) 0.901
> MedPhrase 160.05 (3.1%) 159.83
> (2.9%) -0.1% ( -5% - 6%) 0.885
> HighSloppyPhrase 27.67 (3.1%) 27.64
> (3.0%) -0.1% ( -6% - 6%) 0.915
> LowPhrase 61.12 (3.2%) 61.05
> (3.2%) -0.1% ( -6% - 6%) 0.921
> OrHighMed 71.85 (2.9%) 71.82
> (2.1%) -0.0% ( -4% - 5%) 0.963
> HighPhrase 29.40 (2.3%) 29.39
> (2.8%) -0.0% ( -5% - 5%) 0.971
> Fuzzy2 32.58 (4.3%) 32.57
> (6.1%) -0.0% ( -9% - 10%) 0.992
> LowIntervalsOrdered 150.30 (1.9%) 150.28
> (1.9%) -0.0% ( -3% - 3%) 0.986
> AndHighMed 151.32 (3.9%) 151.31
> (4.1%) -0.0% ( -7% - 8%) 0.993
> OrHighHigh 23.90 (2.3%) 23.91
> (1.9%) 0.0% ( -4% - 4%) 0.970
> OrHighNotLow 579.17 (5.1%) 579.35
> (6.4%) 0.0% ( -10% - 12%) 0.986
> MedIntervalsOrdered 86.93 (1.7%) 86.98
> (1.9%) 0.1% ( -3% - 3%) 0.913
> OrHighNotHigh 536.17 (5.6%) 536.57
> (6.6%) 0.1% ( -11% - 12%) 0.969
> OrNotHighHigh 787.07 (6.5%) 787.96
> (8.1%) 0.1% ( -13% - 15%) 0.961
> OrNotHighMed 687.97 (4.7%) 688.77
> (6.9%) 0.1% ( -10% - 12%) 0.950
> MedSloppyPhrase 68.62 (2.8%) 68.74
> (2.7%) 0.2% ( -5% - 5%) 0.838
> LowSloppyPhrase 130.37 (2.6%) 130.62
> (2.2%) 0.2% ( -4% - 5%) 0.797
> OrHighLow 440.44 (4.1%) 441.33
> (4.1%) 0.2% ( -7% - 8%) 0.877
> Wildcard 122.01 (5.2%) 122.35
> (5.3%) 0.3% ( -9% - 11%) 0.867
> HighIntervalsOrdered 14.24 (2.2%) 14.34
> (2.1%) 0.6% ( -3% - 5%) 0.350
> Respell 52.04 (2.2%) 52.48
> (2.0%) 0.8% ( -3% - 5%) 0.209
> OrHighNotMed 674.76 (4.8%) 680.97
> (8.0%) 0.9% ( -11% - 14%) 0.659
> PKLookup 153.45 (4.3%) 155.13
> (3.8%) 1.1% ( -6% - 9%) 0.394
> Fuzzy1 56.57 (9.1%) 57.76
> (6.7%) 2.1% ( -12% - 19%) 0.406
> BrowseMonthSSDVFacets 19.59 (10.4%) 20.03
> (6.7%) 2.3% ( -13% - 21%) 0.413
> AndHighHighDayTaxoFacets 19.22 (1.6%) 22.13
> (2.2%) 15.1% ( 11% - 19%) 0.000
> AndHighMedDayTaxoFacets 25.62 (1.5%) 29.93
> (2.2%) 16.8% ( 12% - 20%) 0.000
> MedTermDayTaxoFacets 12.96 (2.2%) 18.99
> (3.4%) 46.5% ( 39% - 53%) 0.000
> OrHighMedDayTaxoFacets 3.97 (2.0%) 5.81
> (4.3%) 46.5% ( 39% - 53%) 0.000
> BrowseMonthTaxoFacets 2.59 (10.9%) 11.16
> (35.8%) 330.4% ( 255% - 423%) 0.000
> BrowseDateTaxoFacets 2.44 (9.7%) 13.12
> (51.8%) 438.1% ( 343% - 553%) 0.000
> BrowseDayOfYearTaxoFacets 2.44 (9.7%) 13.13
> (51.7%) 438.2% ( 343% - 552%) 0.000
> {noformat}
>
>
> > Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for
> faceting
> >
> --------------------------------------------------------------------------------
> >
> > Key: LUCENE-10062
> > URL: https://issues.apache.org/jira/browse/LUCENE-10062
> > Project: Lucene - Core
> > Issue Type: Improvement
> > Components: modules/facet
> > Reporter: Greg Miller
> > Assignee: Greg Miller
> > Priority: Minor
> > Time Spent: 1h 40m
> > Remaining Estimate: 0h
> >
> > We currently encode taxonomy ordinals using varint style packing in a
> binary doc values field. I suspect there have been a number of improvements
> to SortedNumericDocValues since taxonomy faceting was first introduced, and
> I plan to explore replacing the custom binary format we have today with a
> SORTED_NUMERIC type dv field instead.
> > I'll report benchmark results and index size impact here.
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
> For additional commands, e-mail: issues-help@lucene.apache.org
>
>