Hi folks-
I've run into a bit of an interesting situation when attempting to
enforce a query evaluation time budget in our Lucene application
(Amazon Product Search), and I'm curious if it's something others have
run into or have thoughts on. There's a reasonable chance this
use-case is fairly specific to our application, but if others have
seen similar use-cases, then maybe there's a general solution worth
pursuing here in Lucene itself?
We'd like to enforce a strict time budget for query evaluation, even
at the cost of potentially missing some matches that have yet to be
seen. One tempting solution is to enforce this in our leaf collectors
each time they see a hit by throwing a CollectionTerminationException,
which is handled nicely by IndexSearcher (we use concurrent search so
this more-or-less enforces an overall time budget). Our queries follow
a two-phase matching approach, and we've run into some interesting
edge-cases where the "approximation" phase may produce a very large
set of match candidates but the "confirmation" phase only confirms
matches on a very small fraction of them. In extreme cases, the entire
index could match in the "approximation" phase and none of the hits
could be "confirmed" in the second phase check.
This creates an interesting issue where the query may evaluate for a
long time before the leaf collectors see hits (or they may never see
hits). This boils down to the BulkScorer running a loop over all
"approximate" candidates and then attempting to "confirm" each before
the leaf collector "sees" anything (it could also happen in a case of
many first phase matches with many of those hits having been deleted).
In these cases, we can run significantly over our time budget.
One solution I've come up with is to create a top-level Query
implementation that enforces the time budget each time it produces
"approximate" matches. This more-or-less works for our use-case, but
has some "rough edges" as a general solution. What I've observed is
that Lucene really only supports collectors / leaf collectors throwing
CollectionTerminationException and doesn't necessarily support Query
implementations doing this. One of the most glaring issues is that the
LRU query caching (if enabled) doesn't handle the exception, so if a
Query were to throw when pre-populating the cache bitsets, it would
terminate the entire search (in a pretty ungraceful way).
I'm also aware of ExitableDirectoryReader but it's trickier to manage
for our use case since we read from the index outside of the main
query evaluation phase for other purposes. I'm sure there's a solution
where we maintain multiple Readers, etc.
So... I'm interested if anyone else has run into a similar use-case.
Does anyone have thoughts on alternative solutions? Is there any
appetite to augment Lucene to allow for queries to signal early
termination by throwing CollectionTerminationExceptions? I suspect
ExitableDirectoryReader probably provides a good enough solution for
others in this situation, but I wanted to raise the topic and see what
other folks here think.
Cheers,
-Greg
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
I've run into a bit of an interesting situation when attempting to
enforce a query evaluation time budget in our Lucene application
(Amazon Product Search), and I'm curious if it's something others have
run into or have thoughts on. There's a reasonable chance this
use-case is fairly specific to our application, but if others have
seen similar use-cases, then maybe there's a general solution worth
pursuing here in Lucene itself?
We'd like to enforce a strict time budget for query evaluation, even
at the cost of potentially missing some matches that have yet to be
seen. One tempting solution is to enforce this in our leaf collectors
each time they see a hit by throwing a CollectionTerminationException,
which is handled nicely by IndexSearcher (we use concurrent search so
this more-or-less enforces an overall time budget). Our queries follow
a two-phase matching approach, and we've run into some interesting
edge-cases where the "approximation" phase may produce a very large
set of match candidates but the "confirmation" phase only confirms
matches on a very small fraction of them. In extreme cases, the entire
index could match in the "approximation" phase and none of the hits
could be "confirmed" in the second phase check.
This creates an interesting issue where the query may evaluate for a
long time before the leaf collectors see hits (or they may never see
hits). This boils down to the BulkScorer running a loop over all
"approximate" candidates and then attempting to "confirm" each before
the leaf collector "sees" anything (it could also happen in a case of
many first phase matches with many of those hits having been deleted).
In these cases, we can run significantly over our time budget.
One solution I've come up with is to create a top-level Query
implementation that enforces the time budget each time it produces
"approximate" matches. This more-or-less works for our use-case, but
has some "rough edges" as a general solution. What I've observed is
that Lucene really only supports collectors / leaf collectors throwing
CollectionTerminationException and doesn't necessarily support Query
implementations doing this. One of the most glaring issues is that the
LRU query caching (if enabled) doesn't handle the exception, so if a
Query were to throw when pre-populating the cache bitsets, it would
terminate the entire search (in a pretty ungraceful way).
I'm also aware of ExitableDirectoryReader but it's trickier to manage
for our use case since we read from the index outside of the main
query evaluation phase for other purposes. I'm sure there's a solution
where we maintain multiple Readers, etc.
So... I'm interested if anyone else has run into a similar use-case.
Does anyone have thoughts on alternative solutions? Is there any
appetite to augment Lucene to allow for queries to signal early
termination by throwing CollectionTerminationExceptions? I suspect
ExitableDirectoryReader probably provides a good enough solution for
others in this situation, but I wanted to raise the topic and see what
other folks here think.
Cheers,
-Greg
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org