Mailing List Archive

Binning/Grouping large result sets efficiently
Hi,

I am still learning about the performance implications of Lucene's APIs when aggregating large
result sets. It seems that some cases require a deeper understanding of Lucenes internals and the
use of not-so-front-facing APIs.

For some time I am struggling with poor grouping/ aggregation performance on the following dataset:

* A sample of 600k locations (points) worldwide, pretty random distribution --> LatLon / Long Term
* A location type (restaurant, cinema, ...) --> String Term
* a few more properties for each location, mostly used for Filter queries --> various terms

Producing frequencies of location types ([restaurant: 23451], [cinema: 853], ... ) is pretty fast
when using GroupSearch() and TopDocs (around 200ms).

Frequencies of aggregated locations are more tricky: In order to produce the grids, I have tried
GroupSearch() with a custum ValueSource that translates the location field into GeoTile / GeoHash
ID, so the GroupSearch can aggregate them to the desired grid level.

[cell=6/8/47, frequency=66],[cell=6/8/48, frequency=114],[cell=6/8/49, frequency=120],[cell=6/8/50,
frequency=120], ...

Unfortunately, this is aggregation pretty slow (takes 4 seconds with 3.8k bins). When profiling, I
can see that Lucene spends most of the time in lucene.util.PriorityQueue.

So I am looking for ways to speed this up. From what I have seen in the tests and examples, Lucene's
spatial indices (i.e. implementations of SpatialPrefixTree) already use GeoHash and Quadtree
encoding / prefix codes. Is there a way to leverage those for my task?

Is there related documentation in the Lucene ecosystem that I can study?


I am also interested in learning how to efficiently produce combined aggregations on cell and
location type, e.g.: [cell=6/8/47, type=restaurant, frequency=12],[cell=6/8/47, type=cinema,
frequency=2], ...

Since sorting by two or more dimensions is possible, it should be possible to stream this
efficiently out of the indices, permitted Lucene provides APIs to do this. Right now, I am resorting
to LearReader, but that is probably the slowest of all options.



- Matthias

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org