Hey folks-
I've been experimenting with a different approach to implementing skip
lists in Lucene, and would be curious if anyone else has tried
something similar, or has early thoughts/feedback on what I'm doing.
The idea is generally a simplification of what Lucene currently does,
targeted at improving QPS at search-time. Instead of treating skip
lists as forward-iterating structures, I'm indexing a single, flat
structure that I binary search when advancing. I'm indexing the same
data present in the lowest level of the current structures (i.e., skip
pointers to each block of 128 docs), and then binary searching those
pointers to find the candidate block for the target docID (instead of
only seeking forward).
Before describing this idea in more detail (or opening a Jira), I'd be
curious how this community thinks about disk access operations and
what platforms/use-cases we primarily optimize for these days. This
approach I've been experimenting with is essentially a non-starter if
we assume that index accesses generally involve actually going to
disk—especially if those disks spin. Random seeks all over the place
are probably a terrible idea if those are actually hitting disk. In
the cases I've been experimenting with, indexes are generally hot, so
random seeks aren't much of a concern. Do we tend to optimize with
this assumption in mind, or are we still pretty careful about
use-cases that are actually doing a lot of disk IO?
There are some other tricky implications with the approach I'm
experimenting with (some edge cases around Impacts and index size
growth due to not being able to do as much delta-encoding), but it's
not worth going into those until addressing the main idea of
whether-or-not random seeks even make sense.
I've done some early benchmarking with luceneutil (wikimedium10m
specifically) and the idea looks like it might have some promise. I
don't really see any regressions to QPS*, while a few tasks seem to
show significant QPS improvements (AndHighLow: 9.8%, OrNotHighLow:
11.1%, OrHighLow: 12.3%).
* (note: I've disabled Impacts in both baseline/candidate for early
experiments because of some challenges related to them... so these
results have a major asteriks right now and more work would certainly
need to be done)
Thanks in advance for any feedback! I'm completely open to shooting
down this idea early if there are some fundamental flaws, or
alternatively opening up a Jira to investigate this further if folks
think it's reasonable to explore more :)
Cheers,
-Greg
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
I've been experimenting with a different approach to implementing skip
lists in Lucene, and would be curious if anyone else has tried
something similar, or has early thoughts/feedback on what I'm doing.
The idea is generally a simplification of what Lucene currently does,
targeted at improving QPS at search-time. Instead of treating skip
lists as forward-iterating structures, I'm indexing a single, flat
structure that I binary search when advancing. I'm indexing the same
data present in the lowest level of the current structures (i.e., skip
pointers to each block of 128 docs), and then binary searching those
pointers to find the candidate block for the target docID (instead of
only seeking forward).
Before describing this idea in more detail (or opening a Jira), I'd be
curious how this community thinks about disk access operations and
what platforms/use-cases we primarily optimize for these days. This
approach I've been experimenting with is essentially a non-starter if
we assume that index accesses generally involve actually going to
disk—especially if those disks spin. Random seeks all over the place
are probably a terrible idea if those are actually hitting disk. In
the cases I've been experimenting with, indexes are generally hot, so
random seeks aren't much of a concern. Do we tend to optimize with
this assumption in mind, or are we still pretty careful about
use-cases that are actually doing a lot of disk IO?
There are some other tricky implications with the approach I'm
experimenting with (some edge cases around Impacts and index size
growth due to not being able to do as much delta-encoding), but it's
not worth going into those until addressing the main idea of
whether-or-not random seeks even make sense.
I've done some early benchmarking with luceneutil (wikimedium10m
specifically) and the idea looks like it might have some promise. I
don't really see any regressions to QPS*, while a few tasks seem to
show significant QPS improvements (AndHighLow: 9.8%, OrNotHighLow:
11.1%, OrHighLow: 12.3%).
* (note: I've disabled Impacts in both baseline/candidate for early
experiments because of some challenges related to them... so these
results have a major asteriks right now and more work would certainly
need to be done)
Thanks in advance for any feedback! I'm completely open to shooting
down this idea early if there are some fundamental flaws, or
alternatively opening up a Jira to investigate this further if folks
think it's reasonable to explore more :)
Cheers,
-Greg
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org