Mailing List Archive

Slow disjunction query on a large dataset
Hello everyone,
I am running a simple boolean `should` over 3 Term queries on the same
field, ~500M docs dataset. The query takes around 20 seconds.

Explain shows that each individual Term query is relatively fast (up to 250
ms), but the Boolean query match phase takes 17 seconds, scoring 1.5
seconds. There are more than 5M advance/twoPhaseIterator invocation counts.
Also no obvious resource limitations, except that the index does not fit in
memory (many other fields indexed as well). Faster disk did not provide
meaningful improvement. TopScoreDocCollector is initialized with
numHits=100 and totalHitsThreshold=1000.

Wrapping query in ConstantScoreQuery makes it very fast. I am also not sure
why Boolean query requires twoPhaseIterator.

Is it expected to be that slow for 3 high cardinality terms disjunction or
maybe there is something obvious I could check and improve?

Thank you,
Alex