Mailing List Archive: Richer Aggregations in Lucene

Hi Lucene devs,

I work on product search at Amazon, where we use Lucene faceting
to compute aggregations. There's a few functionalities I'm missing with
faceting. For example, faceting will always aggregate all the way up to the
dimension and it can't compute multiple aggregations in one pass of the
match-set.

Lucene-based search engines (like Elastic or OpenSearch) have feature-rich
aggregation engines which allow different collection modes and give the user
more control over the granularity of the scopes for which aggregations are
computed.

Are there historical reasons not to have this type of aggregation engine
directly in Lucene? If it seems like a worthwhile idea to pursue, I've
experimented a bit with how we could fulfill these needs in Lucene and I can
open an issue/PR.

Thanks,
Shradha

Hey Shradha,

Such a contribution would be welcome. There is no good reason not to
support richer aggregations in Lucene. One thing that I have found
interesting with faceting/aggregations is that every implementation seems
to make different trade-offs, e.g.
- Lucene's faceting historically required adding side-car data, but we
seem to want to make it work more and more with regular doc values instead
of the side-car index?
- Both Lucene's faceting module and Solr (I think) load the set of matches
into a bitset first, and then compute facets against this bitset while
Elasticsearch computes aggregations within the collector.
- Both Elasticsearch and Solr have composable aggregations, e.g. break
down by category, and then within each category by brand, but Lucene's
facets don't support this.

If you're going to build a new one, I have some suggestions:
- Let's avoid dependencies on side-car indexes?
- I don't think we should load matches into an int[] or BitSet. It takes
too much memory. However it's also true that collecting docs one-by-one
makes some things slower. Maybe we should look into doing
something in-between like batching computation of aggregations? This could
still allow taking advantage of e.g. vectorization if computing, say, the
average of a field.

On Fri, Jun 16, 2023 at 4:14?PM Shradha Shankar <shradha.shankar@gmail.com>
wrote:

> Hi Lucene devs,
>
> I work on product search at Amazon, where we use Lucene faceting
> to compute aggregations. There's a few functionalities I'm missing with
> faceting. For example, faceting will always aggregate all the way up to the
> dimension and it can't compute multiple aggregations in one pass of the
> match-set.
>
> Lucene-based search engines (like Elastic or OpenSearch) have feature-rich
> aggregation engines which allow different collection modes and give the
> user
> more control over the granularity of the scopes for which aggregations are
> computed.
>
> Are there historical reasons not to have this type of aggregation engine
> directly in Lucene? If it seems like a worthwhile idea to pursue, I've
> experimented a bit with how we could fulfill these needs in Lucene and I
> can
> open an issue/PR.
>
> Thanks,
> Shradha
>

--
Adrien