Mailing List Archive: Computing multiple different aggregations over a match-set in one pass

Computing multiple different aggregations over a match-set in one pass

stefan.vodita at gmail

Feb 10, 2023, 11:32 AM

Post #1 of 15 (1066 views)

Hi all,

Let’s say I have an index of books, similar to the example in the facet demo [1]
with a hierarchical facet field encapsulating `Genre / Author’s nationality /
Author’s name`.

I might like to find the latest publish date of a book written by Mark
Twain, the
sum of the prices of books written by American authors, and the number of
sci-fi novels.

As far as I understand, this would require faceting 3 times over the match-set,
one iteration for each aggregation of a different type (max(date), sum(price),
count). That seems inefficient if we could instead compute all aggregations in
one pass.

Is there a way to do that?

Stefan

[1] https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Computing multiple different aggregations over a match-set in one pass [ In reply to ]

gsmiller at gmail

Feb 10, 2023, 4:14 PM

Post #2 of 15 (1066 views)

Hi Stefan-

Can you clarify your example a little bit? It sounds like you want to facet
over three different match sets (one constrained by "Mark Twain" as the
author, one constrained by "American authors" and one constrained by the
"sci-fi" genre). Is that correct?

Cheers,
-Greg

On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <stefan.vodita@gmail.com>
wrote:

> Hi all,
>
> Let’s say I have an index of books, similar to the example in the facet
> demo [1]
> with a hierarchical facet field encapsulating `Genre / Author’s
> nationality /
> Author’s name`.
>
> I might like to find the latest publish date of a book written by Mark
> Twain, the
> sum of the prices of books written by American authors, and the number of
> sci-fi novels.
>
> As far as I understand, this would require faceting 3 times over the
> match-set,
> one iteration for each aggregation of a different type (max(date),
> sum(price),
> count). That seems inefficient if we could instead compute all
> aggregations in
> one pass.
>
> Is there a way to do that?
>
>
> Stefan
>
> [1]
> https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Computing multiple different aggregations over a match-set in one pass [ In reply to ]

stefan.vodita at gmail

Feb 11, 2023, 2:27 AM

Post #3 of 15 (1066 views)

Hi Greg,

I’m assuming we have one match-set which was not constrained by any
of the categories we want to aggregate over, so it may have books by
Mark Twain, books by American authors, and sci-fi books.

Maybe we can imagine we obtained it by searching for a keyword, say
“Washington”, which is present in Mark Twain’s writing, and those of other
American authors, and in sci-fi novels too.

Does that make the example clearer?

Stefan

On Sat, 11 Feb 2023 at 00:16, Greg Miller <gsmiller@gmail.com> wrote:
>
> Hi Stefan-
>
> Can you clarify your example a little bit? It sounds like you want to facet
> over three different match sets (one constrained by "Mark Twain" as the
> author, one constrained by "American authors" and one constrained by the
> "sci-fi" genre). Is that correct?
>
> Cheers,
> -Greg
>
> On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <stefan.vodita@gmail.com>
> wrote:
>
> > Hi all,
> >
> > Let’s say I have an index of books, similar to the example in the facet
> > demo [1]
> > with a hierarchical facet field encapsulating `Genre / Author’s
> > nationality /
> > Author’s name`.
> >
> > I might like to find the latest publish date of a book written by Mark
> > Twain, the
> > sum of the prices of books written by American authors, and the number of
> > sci-fi novels.
> >
> > As far as I understand, this would require faceting 3 times over the
> > match-set,
> > one iteration for each aggregation of a different type (max(date),
> > sum(price),
> > count). That seems inefficient if we could instead compute all
> > aggregations in
> > one pass.
> >
> > Is there a way to do that?
> >
> >
> > Stefan
> >
> > [1]
> > https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Computing multiple different aggregations over a match-set in one pass [ In reply to ]

gsmiller at gmail

Feb 13, 2023, 2:45 PM

Post #4 of 15 (1062 views)

Hi Stefan-

That helps, thanks. I'm a bit confused about where you're concerned with
iterating over the match set multiple times. Is this a situation where the
ordinals you want to facet over are stored in different index fields, so
you have to create multiple Facets instances (one per field) to compute the
aggregations? If that's the case, then yes—you have to iterate over the
match set multiple times (once per field). I'm not sure that's such a big
issue given that you're doing novel work during each iteration, so the only
repetitive cost is actually iterating the hits. If the ordinals are
"packed" into the same field though (which is the default in Lucene if
you're using taxonomy faceting), then you should only need to do a single
iteration over that field.

Cheers,
-Greg

On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita <stefan.vodita@gmail.com>
wrote:

> Hi Greg,
>
> I’m assuming we have one match-set which was not constrained by any
> of the categories we want to aggregate over, so it may have books by
> Mark Twain, books by American authors, and sci-fi books.
>
> Maybe we can imagine we obtained it by searching for a keyword, say
> “Washington”, which is present in Mark Twain’s writing, and those of other
> American authors, and in sci-fi novels too.
>
> Does that make the example clearer?
>
>
> Stefan
>
>
> On Sat, 11 Feb 2023 at 00:16, Greg Miller <gsmiller@gmail.com> wrote:
> >
> > Hi Stefan-
> >
> > Can you clarify your example a little bit? It sounds like you want to
> facet
> > over three different match sets (one constrained by "Mark Twain" as the
> > author, one constrained by "American authors" and one constrained by the
> > "sci-fi" genre). Is that correct?
> >
> > Cheers,
> > -Greg
> >
> > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <stefan.vodita@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > Let’s say I have an index of books, similar to the example in the facet
> > > demo [1]
> > > with a hierarchical facet field encapsulating `Genre / Author’s
> > > nationality /
> > > Author’s name`.
> > >
> > > I might like to find the latest publish date of a book written by Mark
> > > Twain, the
> > > sum of the prices of books written by American authors, and the number
> of
> > > sci-fi novels.
> > >
> > > As far as I understand, this would require faceting 3 times over the
> > > match-set,
> > > one iteration for each aggregation of a different type (max(date),
> > > sum(price),
> > > count). That seems inefficient if we could instead compute all
> > > aggregations in
> > > one pass.
> > >
> > > Is there a way to do that?
> > >
> > >
> > > Stefan
> > >
> > > [1]
> > >
> https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Computing multiple different aggregations over a match-set in one pass [ In reply to ]

stefan.vodita at gmail

Feb 14, 2023, 3:19 AM

Post #5 of 15 (1062 views)

Hi Greg,

I see now where my example didn’t give enough info. In my mind, `Genre /
Author nationality / Author name` is stored in one hierarchical facet field.
The data we’re aggregating over, like publish date or price, are stored in
DocValues.

The demo package shows something similar [1], where the aggregation
is computed across a facet field using data from a `popularity` DocValue.

In the demo, we compute `sum(_score * sqrt(popularity))`, but what if we
want several other different aggregations with respect to the same facet
field? Maybe we want `max(popularity)`. In that case, iterating twice
duplicates most of the work, correct?

Stefan

[1] https://github.com/apache/lucene/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/demo/src/java/org/apache/lucene/demo/facet/ExpressionAggregationFacetsExample.java#L91

On Mon, 13 Feb 2023 at 22:46, Greg Miller <gsmiller@gmail.com> wrote:
>
> Hi Stefan-
>
> That helps, thanks. I'm a bit confused about where you're concerned with
> iterating over the match set multiple times. Is this a situation where the
> ordinals you want to facet over are stored in different index fields, so
> you have to create multiple Facets instances (one per field) to compute the
> aggregations? If that's the case, then yes—you have to iterate over the
> match set multiple times (once per field). I'm not sure that's such a big
> issue given that you're doing novel work during each iteration, so the only
> repetitive cost is actually iterating the hits. If the ordinals are
> "packed" into the same field though (which is the default in Lucene if
> you're using taxonomy faceting), then you should only need to do a single
> iteration over that field.
>
> Cheers,
> -Greg
>
> On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita <stefan.vodita@gmail.com>
> wrote:
>
> > Hi Greg,
> >
> > I’m assuming we have one match-set which was not constrained by any
> > of the categories we want to aggregate over, so it may have books by
> > Mark Twain, books by American authors, and sci-fi books.
> >
> > Maybe we can imagine we obtained it by searching for a keyword, say
> > “Washington”, which is present in Mark Twain’s writing, and those of other
> > American authors, and in sci-fi novels too.
> >
> > Does that make the example clearer?
> >
> >
> > Stefan
> >
> >
> > On Sat, 11 Feb 2023 at 00:16, Greg Miller <gsmiller@gmail.com> wrote:
> > >
> > > Hi Stefan-
> > >
> > > Can you clarify your example a little bit? It sounds like you want to
> > facet
> > > over three different match sets (one constrained by "Mark Twain" as the
> > > author, one constrained by "American authors" and one constrained by the
> > > "sci-fi" genre). Is that correct?
> > >
> > > Cheers,
> > > -Greg
> > >
> > > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <stefan.vodita@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > Let’s say I have an index of books, similar to the example in the facet
> > > > demo [1]
> > > > with a hierarchical facet field encapsulating `Genre / Author’s
> > > > nationality /
> > > > Author’s name`.
> > > >
> > > > I might like to find the latest publish date of a book written by Mark
> > > > Twain, the
> > > > sum of the prices of books written by American authors, and the number
> > of
> > > > sci-fi novels.
> > > >
> > > > As far as I understand, this would require faceting 3 times over the
> > > > match-set,
> > > > one iteration for each aggregation of a different type (max(date),
> > > > sum(price),
> > > > count). That seems inefficient if we could instead compute all
> > > > aggregations in
> > > > one pass.
> > > >
> > > > Is there a way to do that?
> > > >
> > > >
> > > > Stefan
> > > >
> > > > [1]
> > > >
> > https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Computing multiple different aggregations over a match-set in one pass [ In reply to ]

gsmiller at gmail

Feb 15, 2023, 7:47 AM

Post #6 of 15 (1062 views)

Hi Stefan-

> In that case, iterating twice duplicates most of the work, correct?

I'm not sure I'd agree that it duplicates "most" of the work. This is an
association faceting example, which is a little bit of a special case in
some ways. But, to your question, there is duplicated work here of
re-loading the ordinals across the two aggregations, but I would suspect
the more expensive work is actually computing the different aggregations,
which is not duplicated. You're right that it would likely be more
efficient to iterate the hits once, loading the ordinals once and computing
multiple aggregations in one pass. There's no facility for doing that
currently in Lucene's faceting module, but you could always propose it! :)
That said, I'm not sure how common of a case this really is for the
majority of users? But that's just a guess/assumption.

Cheers,
-Greg

On Tue, Feb 14, 2023 at 3:19 AM Stefan Vodita <stefan.vodita@gmail.com>
wrote:

> Hi Greg,
>
> I see now where my example didn’t give enough info. In my mind, `Genre /
> Author nationality / Author name` is stored in one hierarchical facet
> field.
> The data we’re aggregating over, like publish date or price, are stored in
> DocValues.
>
> The demo package shows something similar [1], where the aggregation
> is computed across a facet field using data from a `popularity` DocValue.
>
> In the demo, we compute `sum(_score * sqrt(popularity))`, but what if we
> want several other different aggregations with respect to the same facet
> field? Maybe we want `max(popularity)`. In that case, iterating twice
> duplicates most of the work, correct?
>
>
> Stefan
>
> [1]
> https://github.com/apache/lucene/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/demo/src/java/org/apache/lucene/demo/facet/ExpressionAggregationFacetsExample.java#L91
>
> On Mon, 13 Feb 2023 at 22:46, Greg Miller <gsmiller@gmail.com> wrote:
> >
> > Hi Stefan-
> >
> > That helps, thanks. I'm a bit confused about where you're concerned with
> > iterating over the match set multiple times. Is this a situation where
> the
> > ordinals you want to facet over are stored in different index fields, so
> > you have to create multiple Facets instances (one per field) to compute
> the
> > aggregations? If that's the case, then yes—you have to iterate over the
> > match set multiple times (once per field). I'm not sure that's such a big
> > issue given that you're doing novel work during each iteration, so the
> only
> > repetitive cost is actually iterating the hits. If the ordinals are
> > "packed" into the same field though (which is the default in Lucene if
> > you're using taxonomy faceting), then you should only need to do a single
> > iteration over that field.
> >
> > Cheers,
> > -Greg
> >
> > On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita <stefan.vodita@gmail.com>
> > wrote:
> >
> > > Hi Greg,
> > >
> > > I’m assuming we have one match-set which was not constrained by any
> > > of the categories we want to aggregate over, so it may have books by
> > > Mark Twain, books by American authors, and sci-fi books.
> > >
> > > Maybe we can imagine we obtained it by searching for a keyword, say
> > > “Washington”, which is present in Mark Twain’s writing, and those of
> other
> > > American authors, and in sci-fi novels too.
> > >
> > > Does that make the example clearer?
> > >
> > >
> > > Stefan
> > >
> > >
> > > On Sat, 11 Feb 2023 at 00:16, Greg Miller <gsmiller@gmail.com> wrote:
> > > >
> > > > Hi Stefan-
> > > >
> > > > Can you clarify your example a little bit? It sounds like you want to
> > > facet
> > > > over three different match sets (one constrained by "Mark Twain" as
> the
> > > > author, one constrained by "American authors" and one constrained by
> the
> > > > "sci-fi" genre). Is that correct?
> > > >
> > > > Cheers,
> > > > -Greg
> > > >
> > > > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <
> stefan.vodita@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Let’s say I have an index of books, similar to the example in the
> facet
> > > > > demo [1]
> > > > > with a hierarchical facet field encapsulating `Genre / Author’s
> > > > > nationality /
> > > > > Author’s name`.
> > > > >
> > > > > I might like to find the latest publish date of a book written by
> Mark
> > > > > Twain, the
> > > > > sum of the prices of books written by American authors, and the
> number
> > > of
> > > > > sci-fi novels.
> > > > >
> > > > > As far as I understand, this would require faceting 3 times over
> the
> > > > > match-set,
> > > > > one iteration for each aggregation of a different type (max(date),
> > > > > sum(price),
> > > > > count). That seems inefficient if we could instead compute all
> > > > > aggregations in
> > > > > one pass.
> > > > >
> > > > > Is there a way to do that?
> > > > >
> > > > >
> > > > > Stefan
> > > > >
> > > > > [1]
> > > > >
> > >
> https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Computing multiple different aggregations over a match-set in one pass [ In reply to ]

stefan.vodita at gmail

Feb 16, 2023, 1:31 PM

Post #7 of 15 (1062 views)

Hi Greg,

To better understand how much work gets duplicated, I went ahead
and modified FloatTaxonomyFacets as an example [1]. It doesn't look
too pretty, but it illustrates how I think multiple aggregations in one
iteration could work.

Overall, you're right, there's not as much wasted work as I had
expected. I'll try to do a performance comparison to quantify precisely
how much time we could save, just in case.

Thank you the help!
Stefan

[1] https://github.com/stefanvodita/lucene/commit/3227dabe746858fc81b9f6e4d2ac9b66e8c32684

On Wed, 15 Feb 2023 at 15:48, Greg Miller <gsmiller@gmail.com> wrote:
>
> Hi Stefan-
>
> > In that case, iterating twice duplicates most of the work, correct?
>
> I'm not sure I'd agree that it duplicates "most" of the work. This is an
> association faceting example, which is a little bit of a special case in
> some ways. But, to your question, there is duplicated work here of
> re-loading the ordinals across the two aggregations, but I would suspect
> the more expensive work is actually computing the different aggregations,
> which is not duplicated. You're right that it would likely be more
> efficient to iterate the hits once, loading the ordinals once and computing
> multiple aggregations in one pass. There's no facility for doing that
> currently in Lucene's faceting module, but you could always propose it! :)
> That said, I'm not sure how common of a case this really is for the
> majority of users? But that's just a guess/assumption.
>
> Cheers,
> -Greg
>
> On Tue, Feb 14, 2023 at 3:19 AM Stefan Vodita <stefan.vodita@gmail.com>
> wrote:
>
> > Hi Greg,
> >
> > I see now where my example didn’t give enough info. In my mind, `Genre /
> > Author nationality / Author name` is stored in one hierarchical facet
> > field.
> > The data we’re aggregating over, like publish date or price, are stored in
> > DocValues.
> >
> > The demo package shows something similar [1], where the aggregation
> > is computed across a facet field using data from a `popularity` DocValue.
> >
> > In the demo, we compute `sum(_score * sqrt(popularity))`, but what if we
> > want several other different aggregations with respect to the same facet
> > field? Maybe we want `max(popularity)`. In that case, iterating twice
> > duplicates most of the work, correct?
> >
> >
> > Stefan
> >
> > [1]
> > https://github.com/apache/lucene/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/demo/src/java/org/apache/lucene/demo/facet/ExpressionAggregationFacetsExample.java#L91
> >
> > On Mon, 13 Feb 2023 at 22:46, Greg Miller <gsmiller@gmail.com> wrote:
> > >
> > > Hi Stefan-
> > >
> > > That helps, thanks. I'm a bit confused about where you're concerned with
> > > iterating over the match set multiple times. Is this a situation where
> > the
> > > ordinals you want to facet over are stored in different index fields, so
> > > you have to create multiple Facets instances (one per field) to compute
> > the
> > > aggregations? If that's the case, then yes—you have to iterate over the
> > > match set multiple times (once per field). I'm not sure that's such a big
> > > issue given that you're doing novel work during each iteration, so the
> > only
> > > repetitive cost is actually iterating the hits. If the ordinals are
> > > "packed" into the same field though (which is the default in Lucene if
> > > you're using taxonomy faceting), then you should only need to do a single
> > > iteration over that field.
> > >
> > > Cheers,
> > > -Greg
> > >
> > > On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita <stefan.vodita@gmail.com>
> > > wrote:
> > >
> > > > Hi Greg,
> > > >
> > > > I’m assuming we have one match-set which was not constrained by any
> > > > of the categories we want to aggregate over, so it may have books by
> > > > Mark Twain, books by American authors, and sci-fi books.
> > > >
> > > > Maybe we can imagine we obtained it by searching for a keyword, say
> > > > “Washington”, which is present in Mark Twain’s writing, and those of
> > other
> > > > American authors, and in sci-fi novels too.
> > > >
> > > > Does that make the example clearer?
> > > >
> > > >
> > > > Stefan
> > > >
> > > >
> > > > On Sat, 11 Feb 2023 at 00:16, Greg Miller <gsmiller@gmail.com> wrote:
> > > > >
> > > > > Hi Stefan-
> > > > >
> > > > > Can you clarify your example a little bit? It sounds like you want to
> > > > facet
> > > > > over three different match sets (one constrained by "Mark Twain" as
> > the
> > > > > author, one constrained by "American authors" and one constrained by
> > the
> > > > > "sci-fi" genre). Is that correct?
> > > > >
> > > > > Cheers,
> > > > > -Greg
> > > > >
> > > > > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <
> > stefan.vodita@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Let’s say I have an index of books, similar to the example in the
> > facet
> > > > > > demo [1]
> > > > > > with a hierarchical facet field encapsulating `Genre / Author’s
> > > > > > nationality /
> > > > > > Author’s name`.
> > > > > >
> > > > > > I might like to find the latest publish date of a book written by
> > Mark
> > > > > > Twain, the
> > > > > > sum of the prices of books written by American authors, and the
> > number
> > > > of
> > > > > > sci-fi novels.
> > > > > >
> > > > > > As far as I understand, this would require faceting 3 times over
> > the
> > > > > > match-set,
> > > > > > one iteration for each aggregation of a different type (max(date),
> > > > > > sum(price),
> > > > > > count). That seems inefficient if we could instead compute all
> > > > > > aggregations in
> > > > > > one pass.
> > > > > >
> > > > > > Is there a way to do that?
> > > > > >
> > > > > >
> > > > > > Stefan
> > > > > >
> > > > > > [1]
> > > > > >
> > > >
> > https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > >
> > > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Computing multiple different aggregations over a match-set in one pass [ In reply to ]

gsmiller at gmail

Feb 17, 2023, 10:44 AM

Post #8 of 15 (1061 views)

Thanks for the follow up Stefan. If you find significant overhead
associated with the multiple iterations, please keep challenging the
current approach and suggest improvements. It's always good to revisit
these things!

Cheers,
-Greg

On Thu, Feb 16, 2023 at 1:32 PM Stefan Vodita <stefan.vodita@gmail.com>
wrote:

> Hi Greg,
>
> To better understand how much work gets duplicated, I went ahead
> and modified FloatTaxonomyFacets as an example [1]. It doesn't look
> too pretty, but it illustrates how I think multiple aggregations in one
> iteration could work.
>
> Overall, you're right, there's not as much wasted work as I had
> expected. I'll try to do a performance comparison to quantify precisely
> how much time we could save, just in case.
>
> Thank you the help!
> Stefan
>
> [1]
> https://github.com/stefanvodita/lucene/commit/3227dabe746858fc81b9f6e4d2ac9b66e8c32684
>
> On Wed, 15 Feb 2023 at 15:48, Greg Miller <gsmiller@gmail.com> wrote:
> >
> > Hi Stefan-
> >
> > > In that case, iterating twice duplicates most of the work, correct?
> >
> > I'm not sure I'd agree that it duplicates "most" of the work. This is an
> > association faceting example, which is a little bit of a special case in
> > some ways. But, to your question, there is duplicated work here of
> > re-loading the ordinals across the two aggregations, but I would suspect
> > the more expensive work is actually computing the different aggregations,
> > which is not duplicated. You're right that it would likely be more
> > efficient to iterate the hits once, loading the ordinals once and
> computing
> > multiple aggregations in one pass. There's no facility for doing that
> > currently in Lucene's faceting module, but you could always propose it!
> :)
> > That said, I'm not sure how common of a case this really is for the
> > majority of users? But that's just a guess/assumption.
> >
> > Cheers,
> > -Greg
> >
> > On Tue, Feb 14, 2023 at 3:19 AM Stefan Vodita <stefan.vodita@gmail.com>
> > wrote:
> >
> > > Hi Greg,
> > >
> > > I see now where my example didn’t give enough info. In my mind, `Genre
> /
> > > Author nationality / Author name` is stored in one hierarchical facet
> > > field.
> > > The data we’re aggregating over, like publish date or price, are
> stored in
> > > DocValues.
> > >
> > > The demo package shows something similar [1], where the aggregation
> > > is computed across a facet field using data from a `popularity`
> DocValue.
> > >
> > > In the demo, we compute `sum(_score * sqrt(popularity))`, but what if
> we
> > > want several other different aggregations with respect to the same
> facet
> > > field? Maybe we want `max(popularity)`. In that case, iterating twice
> > > duplicates most of the work, correct?
> > >
> > >
> > > Stefan
> > >
> > > [1]
> > >
> https://github.com/apache/lucene/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/demo/src/java/org/apache/lucene/demo/facet/ExpressionAggregationFacetsExample.java#L91
> > >
> > > On Mon, 13 Feb 2023 at 22:46, Greg Miller <gsmiller@gmail.com> wrote:
> > > >
> > > > Hi Stefan-
> > > >
> > > > That helps, thanks. I'm a bit confused about where you're concerned
> with
> > > > iterating over the match set multiple times. Is this a situation
> where
> > > the
> > > > ordinals you want to facet over are stored in different index
> fields, so
> > > > you have to create multiple Facets instances (one per field) to
> compute
> > > the
> > > > aggregations? If that's the case, then yes—you have to iterate over
> the
> > > > match set multiple times (once per field). I'm not sure that's such
> a big
> > > > issue given that you're doing novel work during each iteration, so
> the
> > > only
> > > > repetitive cost is actually iterating the hits. If the ordinals are
> > > > "packed" into the same field though (which is the default in Lucene
> if
> > > > you're using taxonomy faceting), then you should only need to do a
> single
> > > > iteration over that field.
> > > >
> > > > Cheers,
> > > > -Greg
> > > >
> > > > On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita <
> stefan.vodita@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Greg,
> > > > >
> > > > > I’m assuming we have one match-set which was not constrained by any
> > > > > of the categories we want to aggregate over, so it may have books
> by
> > > > > Mark Twain, books by American authors, and sci-fi books.
> > > > >
> > > > > Maybe we can imagine we obtained it by searching for a keyword, say
> > > > > “Washington”, which is present in Mark Twain’s writing, and those
> of
> > > other
> > > > > American authors, and in sci-fi novels too.
> > > > >
> > > > > Does that make the example clearer?
> > > > >
> > > > >
> > > > > Stefan
> > > > >
> > > > >
> > > > > On Sat, 11 Feb 2023 at 00:16, Greg Miller <gsmiller@gmail.com>
> wrote:
> > > > > >
> > > > > > Hi Stefan-
> > > > > >
> > > > > > Can you clarify your example a little bit? It sounds like you
> want to
> > > > > facet
> > > > > > over three different match sets (one constrained by "Mark Twain"
> as
> > > the
> > > > > > author, one constrained by "American authors" and one
> constrained by
> > > the
> > > > > > "sci-fi" genre). Is that correct?
> > > > > >
> > > > > > Cheers,
> > > > > > -Greg
> > > > > >
> > > > > > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <
> > > stefan.vodita@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > Let’s say I have an index of books, similar to the example in
> the
> > > facet
> > > > > > > demo [1]
> > > > > > > with a hierarchical facet field encapsulating `Genre / Author’s
> > > > > > > nationality /
> > > > > > > Author’s name`.
> > > > > > >
> > > > > > > I might like to find the latest publish date of a book written
> by
> > > Mark
> > > > > > > Twain, the
> > > > > > > sum of the prices of books written by American authors, and the
> > > number
> > > > > of
> > > > > > > sci-fi novels.
> > > > > > >
> > > > > > > As far as I understand, this would require faceting 3 times
> over
> > > the
> > > > > > > match-set,
> > > > > > > one iteration for each aggregation of a different type
> (max(date),
> > > > > > > sum(price),
> > > > > > > count). That seems inefficient if we could instead compute all
> > > > > > > aggregations in
> > > > > > > one pass.
> > > > > > >
> > > > > > > Is there a way to do that?
> > > > > > >
> > > > > > >
> > > > > > > Stefan
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > >
> > >
> https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
> > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > > > > > > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Computing multiple different aggregations over a match-set in one pass [ In reply to ]

stefan.vodita at gmail

Feb 17, 2023, 3:14 PM

Post #9 of 15 (1061 views)

After benchmarking my implementation against the existing one, I think there is
some meaningful overhead. I built a small driver [1] that runs the two
solutions over
a geo data [2] index (thank you Greg for providing the indexing code!).

The table below lists faceting times in milliseconds. I’ve named the current
implementation serial and my proposal parallel, for lack of better names. The
aggregation function is a no-op, so we’re only measuring the time spent outside
aggregation. The measurements are over a match-set of 100k docs, but the number
of docs does not have a large impact on the results because the aggregation
function isn’t doing any work.

| Number of Aggregations | Serial Faceting Time (ms) | Parallel
Faceting Time (ms) |
|----------------------------------|------------------------------------|--------------------------------------|
| 2 |
510 | 328 |
| 5 |
1211 | 775 |
| 10 |
2366 | 1301 |

If we use a MAX aggregation over a DocValue instead, the results tell a similar
story. In this case, the number of docs matters. I've attached results
for 10 docs and
100k docs.

| Number of Aggregations | Serial Faceting Time (ms) | Parallel
Faceting Time (ms) |
|----------------------------------|------------------------------------|--------------------------------------|
| 2 |
706 | 505 |
| 5 |
1618 | 1119 |
| 10 |
3152 | 2018 |

| Number of Aggregations | Serial Faceting Time (ms) | Parallel
Faceting Time (ms) |
|----------------------------------|------------------------------------|--------------------------------------|
| 2 |
904 | 655 |
| 5 |
2122 | 1491 |
| 10 |
5062 | 3317 |

With 10 aggregations, we're saving a second or more. That is significant for my
use-case.

I'd like to know if the test and results seem reasonable. If so, maybe
we can think
about providing this functionality.

Thanks,
Stefan

[1] https://github.com/stefanvodita/lucene/commit/3536546cd9f833150db001e8eede093723cf7663
[2] https://download.geonames.org/export/dump/allCountries.zip

On Fri, 17 Feb 2023 at 18:45, Greg Miller <gsmiller@gmail.com> wrote:
>
> Thanks for the follow up Stefan. If you find significant overhead
> associated with the multiple iterations, please keep challenging the
> current approach and suggest improvements. It's always good to revisit
> these things!
>
> Cheers,
> -Greg
>
> On Thu, Feb 16, 2023 at 1:32 PM Stefan Vodita <stefan.vodita@gmail.com>
> wrote:
>
> > Hi Greg,
> >
> > To better understand how much work gets duplicated, I went ahead
> > and modified FloatTaxonomyFacets as an example [1]. It doesn't look
> > too pretty, but it illustrates how I think multiple aggregations in one
> > iteration could work.
> >
> > Overall, you're right, there's not as much wasted work as I had
> > expected. I'll try to do a performance comparison to quantify precisely
> > how much time we could save, just in case.
> >
> > Thank you the help!
> > Stefan
> >
> > [1]
> > https://github.com/stefanvodita/lucene/commit/3227dabe746858fc81b9f6e4d2ac9b66e8c32684
> >
> > On Wed, 15 Feb 2023 at 15:48, Greg Miller <gsmiller@gmail.com> wrote:
> > >
> > > Hi Stefan-
> > >
> > > > In that case, iterating twice duplicates most of the work, correct?
> > >
> > > I'm not sure I'd agree that it duplicates "most" of the work. This is an
> > > association faceting example, which is a little bit of a special case in
> > > some ways. But, to your question, there is duplicated work here of
> > > re-loading the ordinals across the two aggregations, but I would suspect
> > > the more expensive work is actually computing the different aggregations,
> > > which is not duplicated. You're right that it would likely be more
> > > efficient to iterate the hits once, loading the ordinals once and
> > computing
> > > multiple aggregations in one pass. There's no facility for doing that
> > > currently in Lucene's faceting module, but you could always propose it!
> > :)
> > > That said, I'm not sure how common of a case this really is for the
> > > majority of users? But that's just a guess/assumption.
> > >
> > > Cheers,
> > > -Greg
> > >
> > > On Tue, Feb 14, 2023 at 3:19 AM Stefan Vodita <stefan.vodita@gmail.com>
> > > wrote:
> > >
> > > > Hi Greg,
> > > >
> > > > I see now where my example didn’t give enough info. In my mind, `Genre
> > /
> > > > Author nationality / Author name` is stored in one hierarchical facet
> > > > field.
> > > > The data we’re aggregating over, like publish date or price, are
> > stored in
> > > > DocValues.
> > > >
> > > > The demo package shows something similar [1], where the aggregation
> > > > is computed across a facet field using data from a `popularity`
> > DocValue.
> > > >
> > > > In the demo, we compute `sum(_score * sqrt(popularity))`, but what if
> > we
> > > > want several other different aggregations with respect to the same
> > facet
> > > > field? Maybe we want `max(popularity)`. In that case, iterating twice
> > > > duplicates most of the work, correct?
> > > >
> > > >
> > > > Stefan
> > > >
> > > > [1]
> > > >
> > https://github.com/apache/lucene/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/demo/src/java/org/apache/lucene/demo/facet/ExpressionAggregationFacetsExample.java#L91
> > > >
> > > > On Mon, 13 Feb 2023 at 22:46, Greg Miller <gsmiller@gmail.com> wrote:
> > > > >
> > > > > Hi Stefan-
> > > > >
> > > > > That helps, thanks. I'm a bit confused about where you're concerned
> > with
> > > > > iterating over the match set multiple times. Is this a situation
> > where
> > > > the
> > > > > ordinals you want to facet over are stored in different index
> > fields, so
> > > > > you have to create multiple Facets instances (one per field) to
> > compute
> > > > the
> > > > > aggregations? If that's the case, then yes—you have to iterate over
> > the
> > > > > match set multiple times (once per field). I'm not sure that's such
> > a big
> > > > > issue given that you're doing novel work during each iteration, so
> > the
> > > > only
> > > > > repetitive cost is actually iterating the hits. If the ordinals are
> > > > > "packed" into the same field though (which is the default in Lucene
> > if
> > > > > you're using taxonomy faceting), then you should only need to do a
> > single
> > > > > iteration over that field.
> > > > >
> > > > > Cheers,
> > > > > -Greg
> > > > >
> > > > > On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita <
> > stefan.vodita@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Greg,
> > > > > >
> > > > > > I’m assuming we have one match-set which was not constrained by any
> > > > > > of the categories we want to aggregate over, so it may have books
> > by
> > > > > > Mark Twain, books by American authors, and sci-fi books.
> > > > > >
> > > > > > Maybe we can imagine we obtained it by searching for a keyword, say
> > > > > > “Washington”, which is present in Mark Twain’s writing, and those
> > of
> > > > other
> > > > > > American authors, and in sci-fi novels too.
> > > > > >
> > > > > > Does that make the example clearer?
> > > > > >
> > > > > >
> > > > > > Stefan
> > > > > >
> > > > > >
> > > > > > On Sat, 11 Feb 2023 at 00:16, Greg Miller <gsmiller@gmail.com>
> > wrote:
> > > > > > >
> > > > > > > Hi Stefan-
> > > > > > >
> > > > > > > Can you clarify your example a little bit? It sounds like you
> > want to
> > > > > > facet
> > > > > > > over three different match sets (one constrained by "Mark Twain"
> > as
> > > > the
> > > > > > > author, one constrained by "American authors" and one
> > constrained by
> > > > the
> > > > > > > "sci-fi" genre). Is that correct?
> > > > > > >
> > > > > > > Cheers,
> > > > > > > -Greg
> > > > > > >
> > > > > > > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <
> > > > stefan.vodita@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > Let’s say I have an index of books, similar to the example in
> > the
> > > > facet
> > > > > > > > demo [1]
> > > > > > > > with a hierarchical facet field encapsulating `Genre / Author’s
> > > > > > > > nationality /
> > > > > > > > Author’s name`.
> > > > > > > >
> > > > > > > > I might like to find the latest publish date of a book written
> > by
> > > > Mark
> > > > > > > > Twain, the
> > > > > > > > sum of the prices of books written by American authors, and the
> > > > number
> > > > > > of
> > > > > > > > sci-fi novels.
> > > > > > > >
> > > > > > > > As far as I understand, this would require faceting 3 times
> > over
> > > > the
> > > > > > > > match-set,
> > > > > > > > one iteration for each aggregation of a different type
> > (max(date),
> > > > > > > > sum(price),
> > > > > > > > count). That seems inefficient if we could instead compute all
> > > > > > > > aggregations in
> > > > > > > > one pass.
> > > > > > > >
> > > > > > > > Is there a way to do that?
> > > > > > > >
> > > > > > > >
> > > > > > > > Stefan
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > >
> > > >
> > https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
> > > > > > > >
> > > > > > > >
> > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail:
> > java-user-unsubscribe@lucene.apache.org
> > > > > > > > For additional commands, e-mail:
> > java-user-help@lucene.apache.org
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > >
> > > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Computing multiple different aggregations over a match-set in one pass [ In reply to ]

gsmiller at gmail

Feb 23, 2023, 8:52 AM

Post #10 of 15 (1046 views)

Thanks for the detailed benchmarking Stefan! I think you've demonstrated
here that looping over the collected hits multiple times does in fact add
meaningful overhead. That's interesting to see!

As for whether-or-not to add functionality to the facets module that
supports this, I'm not convinced at this point. I think what you're
suggesting here—but please correct me if I'm wrong—is supporting
association faceting where the user wants to compute multiple association
aggregations for the same dimensions in a single pass. Where I'm struggling
to connect a real-world use-case though is what the user is going to
actually do with those multiple association values. The Facets API today (
https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/Facets.java)
has a pretty firm assumption built in that dimensions/paths have a single
value associated with them. So building some sort of association faceting
implementation that exposes more than one value associated with a given
dimension/path is a significant change to the current model, and I'm not
sure it supports enough real-world use to warrant the complexity.

OK, now disclaimer: Stefan and I work together so I think I have an idea of
what he's doing here...

What I think you're actually after here—and the one use-case I could
imagine some other users being interested in—is computing a single value
for each dimension/path that is actually an expression over _other_
aggregated values. For example, if we're using the geonames data you have
in your example, maybe the value you want to associate with a given path is
something like `max(population) + sum(elevation)`, where `max(population)`
and `sum(elevation)` are the result of two independent facet associations.
Then, you could combine those results though some expression to derive a
single value for a given path. That end result still fits the Facets API
well, but supporting something like this in Lucene requires a few other
primitives beyond just the ability to compute multiple associations at the
same time. Primarily, it needs some version of Expression + Bindings that
works for dimensions/paths. So I don't think the ability to compute
multiple associations at once is really the key missing feature here, and I
don't think it adds significant value on its own to warrant the complexity
of trying to expose it through the existings Facets API. Of course, there's
nothing preventing users from building this "multiple association"
functionality themselves.

That's my take on this, but maybe I'm missing some other use-cases that
could justify adding this capability in a general way? What do you think?

Cheers,
-Greg

On Fri, Feb 17, 2023 at 3:14 PM Stefan Vodita <stefan.vodita@gmail.com>
wrote:

> After benchmarking my implementation against the existing one, I think
> there is
> some meaningful overhead. I built a small driver [1] that runs the two
> solutions over
> a geo data [2] index (thank you Greg for providing the indexing code!).
>
> The table below lists faceting times in milliseconds. I’ve named the
> current
> implementation serial and my proposal parallel, for lack of better names.
> The
> aggregation function is a no-op, so we’re only measuring the time spent
> outside
> aggregation. The measurements are over a match-set of 100k docs, but the
> number
> of docs does not have a large impact on the results because the aggregation
> function isn’t doing any work.
>
> | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> Faceting Time (ms) |
>
> |----------------------------------|------------------------------------|--------------------------------------|
> | 2 |
> 510 | 328 |
> | 5 |
> 1211 | 775 |
> | 10 |
> 2366 | 1301 |
>
> If we use a MAX aggregation over a DocValue instead, the results tell a
> similar
> story. In this case, the number of docs matters. I've attached results
> for 10 docs and
> 100k docs.
>
> | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> Faceting Time (ms) |
>
> |----------------------------------|------------------------------------|--------------------------------------|
> | 2 |
> 706 | 505 |
> | 5 |
> 1618 | 1119 |
> | 10 |
> 3152 | 2018 |
>
> | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> Faceting Time (ms) |
>
> |----------------------------------|------------------------------------|--------------------------------------|
> | 2 |
> 904 | 655 |
> | 5 |
> 2122 | 1491 |
> | 10 |
> 5062 | 3317 |
>
> With 10 aggregations, we're saving a second or more. That is significant
> for my
> use-case.
>
> I'd like to know if the test and results seem reasonable. If so, maybe
> we can think
> about providing this functionality.
>
> Thanks,
> Stefan
>
> [1]
> https://github.com/stefanvodita/lucene/commit/3536546cd9f833150db001e8eede093723cf7663
> [2] https://download.geonames.org/export/dump/allCountries.zip
>
>
> On Fri, 17 Feb 2023 at 18:45, Greg Miller <gsmiller@gmail.com> wrote:
> >
> > Thanks for the follow up Stefan. If you find significant overhead
> > associated with the multiple iterations, please keep challenging the
> > current approach and suggest improvements. It's always good to revisit
> > these things!
> >
> > Cheers,
> > -Greg
> >
> > On Thu, Feb 16, 2023 at 1:32 PM Stefan Vodita <stefan.vodita@gmail.com>
> > wrote:
> >
> > > Hi Greg,
> > >
> > > To better understand how much work gets duplicated, I went ahead
> > > and modified FloatTaxonomyFacets as an example [1]. It doesn't look
> > > too pretty, but it illustrates how I think multiple aggregations in one
> > > iteration could work.
> > >
> > > Overall, you're right, there's not as much wasted work as I had
> > > expected. I'll try to do a performance comparison to quantify precisely
> > > how much time we could save, just in case.
> > >
> > > Thank you the help!
> > > Stefan
> > >
> > > [1]
> > >
> https://github.com/stefanvodita/lucene/commit/3227dabe746858fc81b9f6e4d2ac9b66e8c32684
> > >
> > > On Wed, 15 Feb 2023 at 15:48, Greg Miller <gsmiller@gmail.com> wrote:
> > > >
> > > > Hi Stefan-
> > > >
> > > > > In that case, iterating twice duplicates most of the work, correct?
> > > >
> > > > I'm not sure I'd agree that it duplicates "most" of the work. This
> is an
> > > > association faceting example, which is a little bit of a special
> case in
> > > > some ways. But, to your question, there is duplicated work here of
> > > > re-loading the ordinals across the two aggregations, but I would
> suspect
> > > > the more expensive work is actually computing the different
> aggregations,
> > > > which is not duplicated. You're right that it would likely be more
> > > > efficient to iterate the hits once, loading the ordinals once and
> > > computing
> > > > multiple aggregations in one pass. There's no facility for doing that
> > > > currently in Lucene's faceting module, but you could always propose
> it!
> > > :)
> > > > That said, I'm not sure how common of a case this really is for the
> > > > majority of users? But that's just a guess/assumption.
> > > >
> > > > Cheers,
> > > > -Greg
> > > >
> > > > On Tue, Feb 14, 2023 at 3:19 AM Stefan Vodita <
> stefan.vodita@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Greg,
> > > > >
> > > > > I see now where my example didn’t give enough info. In my mind,
> `Genre
> > > /
> > > > > Author nationality / Author name` is stored in one hierarchical
> facet
> > > > > field.
> > > > > The data we’re aggregating over, like publish date or price, are
> > > stored in
> > > > > DocValues.
> > > > >
> > > > > The demo package shows something similar [1], where the aggregation
> > > > > is computed across a facet field using data from a `popularity`
> > > DocValue.
> > > > >
> > > > > In the demo, we compute `sum(_score * sqrt(popularity))`, but what
> if
> > > we
> > > > > want several other different aggregations with respect to the same
> > > facet
> > > > > field? Maybe we want `max(popularity)`. In that case, iterating
> twice
> > > > > duplicates most of the work, correct?
> > > > >
> > > > >
> > > > > Stefan
> > > > >
> > > > > [1]
> > > > >
> > >
> https://github.com/apache/lucene/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/demo/src/java/org/apache/lucene/demo/facet/ExpressionAggregationFacetsExample.java#L91
> > > > >
> > > > > On Mon, 13 Feb 2023 at 22:46, Greg Miller <gsmiller@gmail.com>
> wrote:
> > > > > >
> > > > > > Hi Stefan-
> > > > > >
> > > > > > That helps, thanks. I'm a bit confused about where you're
> concerned
> > > with
> > > > > > iterating over the match set multiple times. Is this a situation
> > > where
> > > > > the
> > > > > > ordinals you want to facet over are stored in different index
> > > fields, so
> > > > > > you have to create multiple Facets instances (one per field) to
> > > compute
> > > > > the
> > > > > > aggregations? If that's the case, then yes—you have to iterate
> over
> > > the
> > > > > > match set multiple times (once per field). I'm not sure that's
> such
> > > a big
> > > > > > issue given that you're doing novel work during each iteration,
> so
> > > the
> > > > > only
> > > > > > repetitive cost is actually iterating the hits. If the ordinals
> are
> > > > > > "packed" into the same field though (which is the default in
> Lucene
> > > if
> > > > > > you're using taxonomy faceting), then you should only need to do
> a
> > > single
> > > > > > iteration over that field.
> > > > > >
> > > > > > Cheers,
> > > > > > -Greg
> > > > > >
> > > > > > On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita <
> > > stefan.vodita@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Greg,
> > > > > > >
> > > > > > > I’m assuming we have one match-set which was not constrained
> by any
> > > > > > > of the categories we want to aggregate over, so it may have
> books
> > > by
> > > > > > > Mark Twain, books by American authors, and sci-fi books.
> > > > > > >
> > > > > > > Maybe we can imagine we obtained it by searching for a
> keyword, say
> > > > > > > “Washington”, which is present in Mark Twain’s writing, and
> those
> > > of
> > > > > other
> > > > > > > American authors, and in sci-fi novels too.
> > > > > > >
> > > > > > > Does that make the example clearer?
> > > > > > >
> > > > > > >
> > > > > > > Stefan
> > > > > > >
> > > > > > >
> > > > > > > On Sat, 11 Feb 2023 at 00:16, Greg Miller <gsmiller@gmail.com>
> > > wrote:
> > > > > > > >
> > > > > > > > Hi Stefan-
> > > > > > > >
> > > > > > > > Can you clarify your example a little bit? It sounds like you
> > > want to
> > > > > > > facet
> > > > > > > > over three different match sets (one constrained by "Mark
> Twain"
> > > as
> > > > > the
> > > > > > > > author, one constrained by "American authors" and one
> > > constrained by
> > > > > the
> > > > > > > > "sci-fi" genre). Is that correct?
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > -Greg
> > > > > > > >
> > > > > > > > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <
> > > > > stefan.vodita@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > Let’s say I have an index of books, similar to the example
> in
> > > the
> > > > > facet
> > > > > > > > > demo [1]
> > > > > > > > > with a hierarchical facet field encapsulating `Genre /
> Author’s
> > > > > > > > > nationality /
> > > > > > > > > Author’s name`.
> > > > > > > > >
> > > > > > > > > I might like to find the latest publish date of a book
> written
> > > by
> > > > > Mark
> > > > > > > > > Twain, the
> > > > > > > > > sum of the prices of books written by American authors,
> and the
> > > > > number
> > > > > > > of
> > > > > > > > > sci-fi novels.
> > > > > > > > >
> > > > > > > > > As far as I understand, this would require faceting 3 times
> > > over
> > > > > the
> > > > > > > > > match-set,
> > > > > > > > > one iteration for each aggregation of a different type
> > > (max(date),
> > > > > > > > > sum(price),
> > > > > > > > > count). That seems inefficient if we could instead compute
> all
> > > > > > > > > aggregations in
> > > > > > > > > one pass.
> > > > > > > > >
> > > > > > > > > Is there a way to do that?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Stefan
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
> > > > > > > > >
> > > > > > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > > > > > To unsubscribe, e-mail:
> > > java-user-unsubscribe@lucene.apache.org
> > > > > > > > > For additional commands, e-mail:
> > > java-user-help@lucene.apache.org
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > > > > > > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Computing multiple different aggregations over a match-set in one pass [ In reply to ]

stefan.vodita at gmail

Feb 24, 2023, 1:55 PM

Post #11 of 15 (1040 views)

Hi everyone,

Greg and I discussed a bit offline. His assessment was right - I’m not looking
to compute multiple values per ordinal as an end in itself. That is only a means
to compute a single value which depends on other facet results. This
section from
the previous email explains it really well:

> For example, if we're using the geonames data you have in your example,
> maybe the value you want to associate with a given path is something like
> `max(population) + sum(elevation)`, where `max(population)` and `sum(elevation)`
> are the result of two independent facet associations. Then, you could combine
> those results though some expression to derive a single value for a given path.

Ideally, I could facet using an expression which binds other
aggregations. The user
experience might be as simple as defining the expression and making a single
faceting call. Has anyone worked on something similar?

Best,
Stefan

On Thu, 23 Feb 2023 at 16:53, Greg Miller <gsmiller@gmail.com> wrote:
>
> Thanks for the detailed benchmarking Stefan! I think you've demonstrated
> here that looping over the collected hits multiple times does in fact add
> meaningful overhead. That's interesting to see!
>
> As for whether-or-not to add functionality to the facets module that
> supports this, I'm not convinced at this point. I think what you're
> suggesting here—but please correct me if I'm wrong—is supporting
> association faceting where the user wants to compute multiple association
> aggregations for the same dimensions in a single pass. Where I'm struggling
> to connect a real-world use-case though is what the user is going to
> actually do with those multiple association values. The Facets API today (
> https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/Facets.java)
> has a pretty firm assumption built in that dimensions/paths have a single
> value associated with them. So building some sort of association faceting
> implementation that exposes more than one value associated with a given
> dimension/path is a significant change to the current model, and I'm not
> sure it supports enough real-world use to warrant the complexity.
>
> OK, now disclaimer: Stefan and I work together so I think I have an idea of
> what he's doing here...
>
> What I think you're actually after here—and the one use-case I could
> imagine some other users being interested in—is computing a single value
> for each dimension/path that is actually an expression over _other_
> aggregated values. For example, if we're using the geonames data you have
> in your example, maybe the value you want to associate with a given path is
> something like `max(population) + sum(elevation)`, where `max(population)`
> and `sum(elevation)` are the result of two independent facet associations.
> Then, you could combine those results though some expression to derive a
> single value for a given path. That end result still fits the Facets API
> well, but supporting something like this in Lucene requires a few other
> primitives beyond just the ability to compute multiple associations at the
> same time. Primarily, it needs some version of Expression + Bindings that
> works for dimensions/paths. So I don't think the ability to compute
> multiple associations at once is really the key missing feature here, and I
> don't think it adds significant value on its own to warrant the complexity
> of trying to expose it through the existings Facets API. Of course, there's
> nothing preventing users from building this "multiple association"
> functionality themselves.
>
> That's my take on this, but maybe I'm missing some other use-cases that
> could justify adding this capability in a general way? What do you think?
>
> Cheers,
> -Greg
>
> On Fri, Feb 17, 2023 at 3:14 PM Stefan Vodita <stefan.vodita@gmail.com>
> wrote:
>
> > After benchmarking my implementation against the existing one, I think
> > there is
> > some meaningful overhead. I built a small driver [1] that runs the two
> > solutions over
> > a geo data [2] index (thank you Greg for providing the indexing code!).
> >
> > The table below lists faceting times in milliseconds. I’ve named the
> > current
> > implementation serial and my proposal parallel, for lack of better names.
> > The
> > aggregation function is a no-op, so we’re only measuring the time spent
> > outside
> > aggregation. The measurements are over a match-set of 100k docs, but the
> > number
> > of docs does not have a large impact on the results because the aggregation
> > function isn’t doing any work.
> >
> > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > Faceting Time (ms) |
> >
> > |----------------------------------|------------------------------------|--------------------------------------|
> > | 2 |
> > 510 | 328 |
> > | 5 |
> > 1211 | 775 |
> > | 10 |
> > 2366 | 1301 |
> >
> > If we use a MAX aggregation over a DocValue instead, the results tell a
> > similar
> > story. In this case, the number of docs matters. I've attached results
> > for 10 docs and
> > 100k docs.
> >
> > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > Faceting Time (ms) |
> >
> > |----------------------------------|------------------------------------|--------------------------------------|
> > | 2 |
> > 706 | 505 |
> > | 5 |
> > 1618 | 1119 |
> > | 10 |
> > 3152 | 2018 |
> >
> > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > Faceting Time (ms) |
> >
> > |----------------------------------|------------------------------------|--------------------------------------|
> > | 2 |
> > 904 | 655 |
> > | 5 |
> > 2122 | 1491 |
> > | 10 |
> > 5062 | 3317 |
> >
> > With 10 aggregations, we're saving a second or more. That is significant
> > for my
> > use-case.
> >
> > I'd like to know if the test and results seem reasonable. If so, maybe
> > we can think
> > about providing this functionality.
> >
> > Thanks,
> > Stefan
> >
> > [1]
> > https://github.com/stefanvodita/lucene/commit/3536546cd9f833150db001e8eede093723cf7663
> > [2] https://download.geonames.org/export/dump/allCountries.zip
> >
> >
> > On Fri, 17 Feb 2023 at 18:45, Greg Miller <gsmiller@gmail.com> wrote:
> > >
> > > Thanks for the follow up Stefan. If you find significant overhead
> > > associated with the multiple iterations, please keep challenging the
> > > current approach and suggest improvements. It's always good to revisit
> > > these things!
> > >
> > > Cheers,
> > > -Greg
> > >
> > > On Thu, Feb 16, 2023 at 1:32 PM Stefan Vodita <stefan.vodita@gmail.com>
> > > wrote:
> > >
> > > > Hi Greg,
> > > >
> > > > To better understand how much work gets duplicated, I went ahead
> > > > and modified FloatTaxonomyFacets as an example [1]. It doesn't look
> > > > too pretty, but it illustrates how I think multiple aggregations in one
> > > > iteration could work.
> > > >
> > > > Overall, you're right, there's not as much wasted work as I had
> > > > expected. I'll try to do a performance comparison to quantify precisely
> > > > how much time we could save, just in case.
> > > >
> > > > Thank you the help!
> > > > Stefan
> > > >
> > > > [1]
> > > >
> > https://github.com/stefanvodita/lucene/commit/3227dabe746858fc81b9f6e4d2ac9b66e8c32684
> > > >
> > > > On Wed, 15 Feb 2023 at 15:48, Greg Miller <gsmiller@gmail.com> wrote:
> > > > >
> > > > > Hi Stefan-
> > > > >
> > > > > > In that case, iterating twice duplicates most of the work, correct?
> > > > >
> > > > > I'm not sure I'd agree that it duplicates "most" of the work. This
> > is an
> > > > > association faceting example, which is a little bit of a special
> > case in
> > > > > some ways. But, to your question, there is duplicated work here of
> > > > > re-loading the ordinals across the two aggregations, but I would
> > suspect
> > > > > the more expensive work is actually computing the different
> > aggregations,
> > > > > which is not duplicated. You're right that it would likely be more
> > > > > efficient to iterate the hits once, loading the ordinals once and
> > > > computing
> > > > > multiple aggregations in one pass. There's no facility for doing that
> > > > > currently in Lucene's faceting module, but you could always propose
> > it!
> > > > :)
> > > > > That said, I'm not sure how common of a case this really is for the
> > > > > majority of users? But that's just a guess/assumption.
> > > > >
> > > > > Cheers,
> > > > > -Greg
> > > > >
> > > > > On Tue, Feb 14, 2023 at 3:19 AM Stefan Vodita <
> > stefan.vodita@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Greg,
> > > > > >
> > > > > > I see now where my example didn’t give enough info. In my mind,
> > `Genre
> > > > /
> > > > > > Author nationality / Author name` is stored in one hierarchical
> > facet
> > > > > > field.
> > > > > > The data we’re aggregating over, like publish date or price, are
> > > > stored in
> > > > > > DocValues.
> > > > > >
> > > > > > The demo package shows something similar [1], where the aggregation
> > > > > > is computed across a facet field using data from a `popularity`
> > > > DocValue.
> > > > > >
> > > > > > In the demo, we compute `sum(_score * sqrt(popularity))`, but what
> > if
> > > > we
> > > > > > want several other different aggregations with respect to the same
> > > > facet
> > > > > > field? Maybe we want `max(popularity)`. In that case, iterating
> > twice
> > > > > > duplicates most of the work, correct?
> > > > > >
> > > > > >
> > > > > > Stefan
> > > > > >
> > > > > > [1]
> > > > > >
> > > >
> > https://github.com/apache/lucene/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/demo/src/java/org/apache/lucene/demo/facet/ExpressionAggregationFacetsExample.java#L91
> > > > > >
> > > > > > On Mon, 13 Feb 2023 at 22:46, Greg Miller <gsmiller@gmail.com>
> > wrote:
> > > > > > >
> > > > > > > Hi Stefan-
> > > > > > >
> > > > > > > That helps, thanks. I'm a bit confused about where you're
> > concerned
> > > > with
> > > > > > > iterating over the match set multiple times. Is this a situation
> > > > where
> > > > > > the
> > > > > > > ordinals you want to facet over are stored in different index
> > > > fields, so
> > > > > > > you have to create multiple Facets instances (one per field) to
> > > > compute
> > > > > > the
> > > > > > > aggregations? If that's the case, then yes—you have to iterate
> > over
> > > > the
> > > > > > > match set multiple times (once per field). I'm not sure that's
> > such
> > > > a big
> > > > > > > issue given that you're doing novel work during each iteration,
> > so
> > > > the
> > > > > > only
> > > > > > > repetitive cost is actually iterating the hits. If the ordinals
> > are
> > > > > > > "packed" into the same field though (which is the default in
> > Lucene
> > > > if
> > > > > > > you're using taxonomy faceting), then you should only need to do
> > a
> > > > single
> > > > > > > iteration over that field.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > -Greg
> > > > > > >
> > > > > > > On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita <
> > > > stefan.vodita@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Greg,
> > > > > > > >
> > > > > > > > I’m assuming we have one match-set which was not constrained
> > by any
> > > > > > > > of the categories we want to aggregate over, so it may have
> > books
> > > > by
> > > > > > > > Mark Twain, books by American authors, and sci-fi books.
> > > > > > > >
> > > > > > > > Maybe we can imagine we obtained it by searching for a
> > keyword, say
> > > > > > > > “Washington”, which is present in Mark Twain’s writing, and
> > those
> > > > of
> > > > > > other
> > > > > > > > American authors, and in sci-fi novels too.
> > > > > > > >
> > > > > > > > Does that make the example clearer?
> > > > > > > >
> > > > > > > >
> > > > > > > > Stefan
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sat, 11 Feb 2023 at 00:16, Greg Miller <gsmiller@gmail.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Stefan-
> > > > > > > > >
> > > > > > > > > Can you clarify your example a little bit? It sounds like you
> > > > want to
> > > > > > > > facet
> > > > > > > > > over three different match sets (one constrained by "Mark
> > Twain"
> > > > as
> > > > > > the
> > > > > > > > > author, one constrained by "American authors" and one
> > > > constrained by
> > > > > > the
> > > > > > > > > "sci-fi" genre). Is that correct?
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > -Greg
> > > > > > > > >
> > > > > > > > > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <
> > > > > > stefan.vodita@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > Let’s say I have an index of books, similar to the example
> > in
> > > > the
> > > > > > facet
> > > > > > > > > > demo [1]
> > > > > > > > > > with a hierarchical facet field encapsulating `Genre /
> > Author’s
> > > > > > > > > > nationality /
> > > > > > > > > > Author’s name`.
> > > > > > > > > >
> > > > > > > > > > I might like to find the latest publish date of a book
> > written
> > > > by
> > > > > > Mark
> > > > > > > > > > Twain, the
> > > > > > > > > > sum of the prices of books written by American authors,
> > and the
> > > > > > number
> > > > > > > > of
> > > > > > > > > > sci-fi novels.
> > > > > > > > > >
> > > > > > > > > > As far as I understand, this would require faceting 3 times
> > > > over
> > > > > > the
> > > > > > > > > > match-set,
> > > > > > > > > > one iteration for each aggregation of a different type
> > > > (max(date),
> > > > > > > > > > sum(price),
> > > > > > > > > > count). That seems inefficient if we could instead compute
> > all
> > > > > > > > > > aggregations in
> > > > > > > > > > one pass.
> > > > > > > > > >
> > > > > > > > > > Is there a way to do that?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Stefan
> > > > > > > > > >
> > > > > > > > > > [1]
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
> > > > > > > > > >
> > > > > > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > > > > > To unsubscribe, e-mail:
> > > > java-user-unsubscribe@lucene.apache.org
> > > > > > > > > > For additional commands, e-mail:
> > > > java-user-help@lucene.apache.org
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail:
> > java-user-unsubscribe@lucene.apache.org
> > > > > > > > For additional commands, e-mail:
> > java-user-help@lucene.apache.org
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > >
> > > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Computing multiple different aggregations over a match-set in one pass [ In reply to ]

gsmiller at gmail

Mar 5, 2023, 11:32 AM

Post #12 of 15 (998 views)

Hi Stefan-

I cobbled together a draft PR that I _think_ is what you're looking for so
we can have something to talk about. Please let me know if this misses the
mark, or is what you had in mind. If so, we could open an issue to propose
the idea of adding something like this. I'm not totally convinced I like it
(I think the expression syntax/API is a little wonky), but that's something
we could discuss in an issue.

https://github.com/apache/lucene/pull/12184

Cheers,
-Greg

On Fri, Feb 24, 2023 at 1:57?PM Stefan Vodita <stefan.vodita@gmail.com>
wrote:

> Hi everyone,
>
> Greg and I discussed a bit offline. His assessment was right - I’m not
> looking
> to compute multiple values per ordinal as an end in itself. That is only a
> means
> to compute a single value which depends on other facet results. This
> section from
> the previous email explains it really well:
>
> > For example, if we're using the geonames data you have in your example,
> > maybe the value you want to associate with a given path is something like
> > `max(population) + sum(elevation)`, where `max(population)` and
> `sum(elevation)`
> > are the result of two independent facet associations. Then, you could
> combine
> > those results though some expression to derive a single value for a
> given path.
>
> Ideally, I could facet using an expression which binds other
> aggregations. The user
> experience might be as simple as defining the expression and making a
> single
> faceting call. Has anyone worked on something similar?
>
> Best,
> Stefan
>
> On Thu, 23 Feb 2023 at 16:53, Greg Miller <gsmiller@gmail.com> wrote:
> >
> > Thanks for the detailed benchmarking Stefan! I think you've demonstrated
> > here that looping over the collected hits multiple times does in fact add
> > meaningful overhead. That's interesting to see!
> >
> > As for whether-or-not to add functionality to the facets module that
> > supports this, I'm not convinced at this point. I think what you're
> > suggesting here—but please correct me if I'm wrong—is supporting
> > association faceting where the user wants to compute multiple association
> > aggregations for the same dimensions in a single pass. Where I'm
> struggling
> > to connect a real-world use-case though is what the user is going to
> > actually do with those multiple association values. The Facets API today
> (
> >
> https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/Facets.java
> )
> > has a pretty firm assumption built in that dimensions/paths have a single
> > value associated with them. So building some sort of association faceting
> > implementation that exposes more than one value associated with a given
> > dimension/path is a significant change to the current model, and I'm not
> > sure it supports enough real-world use to warrant the complexity.
> >
> > OK, now disclaimer: Stefan and I work together so I think I have an idea
> of
> > what he's doing here...
> >
> > What I think you're actually after here—and the one use-case I could
> > imagine some other users being interested in—is computing a single value
> > for each dimension/path that is actually an expression over _other_
> > aggregated values. For example, if we're using the geonames data you have
> > in your example, maybe the value you want to associate with a given path
> is
> > something like `max(population) + sum(elevation)`, where
> `max(population)`
> > and `sum(elevation)` are the result of two independent facet
> associations.
> > Then, you could combine those results though some expression to derive a
> > single value for a given path. That end result still fits the Facets API
> > well, but supporting something like this in Lucene requires a few other
> > primitives beyond just the ability to compute multiple associations at
> the
> > same time. Primarily, it needs some version of Expression + Bindings that
> > works for dimensions/paths. So I don't think the ability to compute
> > multiple associations at once is really the key missing feature here,
> and I
> > don't think it adds significant value on its own to warrant the
> complexity
> > of trying to expose it through the existings Facets API. Of course,
> there's
> > nothing preventing users from building this "multiple association"
> > functionality themselves.
> >
> > That's my take on this, but maybe I'm missing some other use-cases that
> > could justify adding this capability in a general way? What do you think?
> >
> > Cheers,
> > -Greg
> >
> > On Fri, Feb 17, 2023 at 3:14 PM Stefan Vodita <stefan.vodita@gmail.com>
> > wrote:
> >
> > > After benchmarking my implementation against the existing one, I think
> > > there is
> > > some meaningful overhead. I built a small driver [1] that runs the two
> > > solutions over
> > > a geo data [2] index (thank you Greg for providing the indexing code!).
> > >
> > > The table below lists faceting times in milliseconds. I’ve named the
> > > current
> > > implementation serial and my proposal parallel, for lack of better
> names.
> > > The
> > > aggregation function is a no-op, so we’re only measuring the time spent
> > > outside
> > > aggregation. The measurements are over a match-set of 100k docs, but
> the
> > > number
> > > of docs does not have a large impact on the results because the
> aggregation
> > > function isn’t doing any work.
> > >
> > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > > Faceting Time (ms) |
> > >
> > >
> |----------------------------------|------------------------------------|--------------------------------------|
> > > 510 | 328 |
> > > | 5 |
> > > 1211 | 775 |
> > > | 10 |
> > > 2366 | 1301 |
> > >
> > > If we use a MAX aggregation over a DocValue instead, the results tell a
> > > similar
> > > story. In this case, the number of docs matters. I've attached results
> > > for 10 docs and
> > > 100k docs.
> > >
> > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > > Faceting Time (ms) |
> > >
> > >
> |----------------------------------|------------------------------------|--------------------------------------|
> > > 706 | 505 |
> > > | 5 |
> > > 1618 | 1119 |
> > > | 10 |
> > > 3152 | 2018 |
> > >
> > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > > Faceting Time (ms) |
> > >
> > >
> |----------------------------------|------------------------------------|--------------------------------------|
> > > | 2 |
> > > 904 | 655 |
> > > | 5 |
> > > 2122 | 1491 |
> > > | 10 |
> > > 5062 | 3317 |
> > >
> > > With 10 aggregations, we're saving a second or more. That is
> significant
> > > for my
> > > use-case.
> > >
> > > I'd like to know if the test and results seem reasonable. If so, maybe
> > > we can think
> > > about providing this functionality.
> > >
> > > Thanks,
> > > Stefan
> > >
> > > [1]
> > >
> https://github.com/stefanvodita/lucene/commit/3536546cd9f833150db001e8eede093723cf7663
> > > [2] https://download.geonames.org/export/dump/allCountries.zip
> > >
> > >
> > > On Fri, 17 Feb 2023 at 18:45, Greg Miller <gsmiller@gmail.com> wrote:
> > > >
> > > > Thanks for the follow up Stefan. If you find significant overhead
> > > > associated with the multiple iterations, please keep challenging the
> > > > current approach and suggest improvements. It's always good to
> revisit
> > > > these things!
> > > >
> > > > Cheers,
> > > > -Greg
> > > >
> > > > On Thu, Feb 16, 2023 at 1:32 PM Stefan Vodita <
> stefan.vodita@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Greg,
> > > > >
> > > > > To better understand how much work gets duplicated, I went ahead
> > > > > and modified FloatTaxonomyFacets as an example [1]. It doesn't look
> > > > > too pretty, but it illustrates how I think multiple aggregations
> in one
> > > > > iteration could work.
> > > > >
> > > > > Overall, you're right, there's not as much wasted work as I had
> > > > > expected. I'll try to do a performance comparison to quantify
> precisely
> > > > > how much time we could save, just in case.
> > > > >
> > > > > Thank you the help!
> > > > > Stefan
> > > > >
> > > > > [1]
> > > > >
> > >
> https://github.com/stefanvodita/lucene/commit/3227dabe746858fc81b9f6e4d2ac9b66e8c32684
> > > > >
> > > > > On Wed, 15 Feb 2023 at 15:48, Greg Miller <gsmiller@gmail.com>
> wrote:
> > > > > >
> > > > > > Hi Stefan-
> > > > > >
> > > > > > > In that case, iterating twice duplicates most of the work,
> correct?
> > > > > >
> > > > > > I'm not sure I'd agree that it duplicates "most" of the work.
> This
> > > is an
> > > > > > association faceting example, which is a little bit of a special
> > > case in
> > > > > > some ways. But, to your question, there is duplicated work here
> of
> > > > > > re-loading the ordinals across the two aggregations, but I would
> > > suspect
> > > > > > the more expensive work is actually computing the different
> > > aggregations,
> > > > > > which is not duplicated. You're right that it would likely be
> more
> > > > > > efficient to iterate the hits once, loading the ordinals once and
> > > > > computing
> > > > > > multiple aggregations in one pass. There's no facility for doing
> that
> > > > > > currently in Lucene's faceting module, but you could always
> propose
> > > it!
> > > > > :)
> > > > > > That said, I'm not sure how common of a case this really is for
> the
> > > > > > majority of users? But that's just a guess/assumption.
> > > > > >
> > > > > > Cheers,
> > > > > > -Greg
> > > > > >
> > > > > > On Tue, Feb 14, 2023 at 3:19 AM Stefan Vodita <
> > > stefan.vodita@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Greg,
> > > > > > >
> > > > > > > I see now where my example didn’t give enough info. In my mind,
> > > `Genre
> > > > > /
> > > > > > > Author nationality / Author name` is stored in one hierarchical
> > > facet
> > > > > > > field.
> > > > > > > The data we’re aggregating over, like publish date or price,
> are
> > > > > stored in
> > > > > > > DocValues.
> > > > > > >
> > > > > > > The demo package shows something similar [1], where the
> aggregation
> > > > > > > is computed across a facet field using data from a `popularity`
> > > > > DocValue.
> > > > > > >
> > > > > > > In the demo, we compute `sum(_score * sqrt(popularity))`, but
> what
> > > if
> > > > > we
> > > > > > > want several other different aggregations with respect to the
> same
> > > > > facet
> > > > > > > field? Maybe we want `max(popularity)`. In that case, iterating
> > > twice
> > > > > > > duplicates most of the work, correct?
> > > > > > >
> > > > > > >
> > > > > > > Stefan
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > >
> > >
> https://github.com/apache/lucene/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/demo/src/java/org/apache/lucene/demo/facet/ExpressionAggregationFacetsExample.java#L91
> > > > > > >
> > > > > > > On Mon, 13 Feb 2023 at 22:46, Greg Miller <gsmiller@gmail.com>
> > > wrote:
> > > > > > > >
> > > > > > > > Hi Stefan-
> > > > > > > >
> > > > > > > > That helps, thanks. I'm a bit confused about where you're
> > > concerned
> > > > > with
> > > > > > > > iterating over the match set multiple times. Is this a
> situation
> > > > > where
> > > > > > > the
> > > > > > > > ordinals you want to facet over are stored in different index
> > > > > fields, so
> > > > > > > > you have to create multiple Facets instances (one per field)
> to
> > > > > compute
> > > > > > > the
> > > > > > > > aggregations? If that's the case, then yes—you have to
> iterate
> > > over
> > > > > the
> > > > > > > > match set multiple times (once per field). I'm not sure
> that's
> > > such
> > > > > a big
> > > > > > > > issue given that you're doing novel work during each
> iteration,
> > > so
> > > > > the
> > > > > > > only
> > > > > > > > repetitive cost is actually iterating the hits. If the
> ordinals
> > > are
> > > > > > > > "packed" into the same field though (which is the default in
> > > Lucene
> > > > > if
> > > > > > > > you're using taxonomy faceting), then you should only need
> to do
> > > a
> > > > > single
> > > > > > > > iteration over that field.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > -Greg
> > > > > > > >
> > > > > > > > On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita <
> > > > > stefan.vodita@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Greg,
> > > > > > > > >
> > > > > > > > > I’m assuming we have one match-set which was not
> constrained
> > > by any
> > > > > > > > > of the categories we want to aggregate over, so it may have
> > > books
> > > > > by
> > > > > > > > > Mark Twain, books by American authors, and sci-fi books.
> > > > > > > > >
> > > > > > > > > Maybe we can imagine we obtained it by searching for a
> > > keyword, say
> > > > > > > > > “Washington”, which is present in Mark Twain’s writing, and
> > > those
> > > > > of
> > > > > > > other
> > > > > > > > > American authors, and in sci-fi novels too.
> > > > > > > > >
> > > > > > > > > Does that make the example clearer?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Stefan
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Sat, 11 Feb 2023 at 00:16, Greg Miller <
> gsmiller@gmail.com>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Stefan-
> > > > > > > > > >
> > > > > > > > > > Can you clarify your example a little bit? It sounds
> like you
> > > > > want to
> > > > > > > > > facet
> > > > > > > > > > over three different match sets (one constrained by "Mark
> > > Twain"
> > > > > as
> > > > > > > the
> > > > > > > > > > author, one constrained by "American authors" and one
> > > > > constrained by
> > > > > > > the
> > > > > > > > > > "sci-fi" genre). Is that correct?
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > -Greg
> > > > > > > > > >
> > > > > > > > > > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <
> > > > > > > stefan.vodita@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > Let’s say I have an index of books, similar to the
> example
> > > in
> > > > > the
> > > > > > > facet
> > > > > > > > > > > demo [1]
> > > > > > > > > > > with a hierarchical facet field encapsulating `Genre /
> > > Author’s
> > > > > > > > > > > nationality /
> > > > > > > > > > > Author’s name`.
> > > > > > > > > > >
> > > > > > > > > > > I might like to find the latest publish date of a book
> > > written
> > > > > by
> > > > > > > Mark
> > > > > > > > > > > Twain, the
> > > > > > > > > > > sum of the prices of books written by American authors,
> > > and the
> > > > > > > number
> > > > > > > > > of
> > > > > > > > > > > sci-fi novels.
> > > > > > > > > > >
> > > > > > > > > > > As far as I understand, this would require faceting 3
> times
> > > > > over
> > > > > > > the
> > > > > > > > > > > match-set,
> > > > > > > > > > > one iteration for each aggregation of a different type
> > > > > (max(date),
> > > > > > > > > > > sum(price),
> > > > > > > > > > > count). That seems inefficient if we could instead
> compute
> > > all
> > > > > > > > > > > aggregations in
> > > > > > > > > > > one pass.
> > > > > > > > > > >
> > > > > > > > > > > Is there a way to do that?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Stefan
> > > > > > > > > > >
> > > > > > > > > > > [1]
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > > > > > To unsubscribe, e-mail:
> > > > > java-user-unsubscribe@lucene.apache.org
> > > > > > > > > > > For additional commands, e-mail:
> > > > > java-user-help@lucene.apache.org
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > > > > > To unsubscribe, e-mail:
> > > java-user-unsubscribe@lucene.apache.org
> > > > > > > > > For additional commands, e-mail:
> > > java-user-help@lucene.apache.org
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > > > > > > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Computing multiple different aggregations over a match-set in one pass [ In reply to ]

stefan.vodita at gmail

Mar 6, 2023, 7:21 AM

Post #13 of 15 (994 views)

Hi Greg,

The PR looks great. I think it's a useful feature to have and it helps with the
use-case we were discussing. I left a comment with some other ideas that I'd
like to explore.

Thank you for coding this up,
Stefan

On Sun, 5 Mar 2023 at 19:33, Greg Miller <gsmiller@gmail.com> wrote:
>
> Hi Stefan-
>
> I cobbled together a draft PR that I _think_ is what you're looking for so
> we can have something to talk about. Please let me know if this misses the
> mark, or is what you had in mind. If so, we could open an issue to propose
> the idea of adding something like this. I'm not totally convinced I like it
> (I think the expression syntax/API is a little wonky), but that's something
> we could discuss in an issue.
>
> https://github.com/apache/lucene/pull/12184
>
> Cheers,
> -Greg
>
> On Fri, Feb 24, 2023 at 1:57?PM Stefan Vodita <stefan.vodita@gmail.com>
> wrote:
>
> > Hi everyone,
> >
> > Greg and I discussed a bit offline. His assessment was right - I’m not
> > looking
> > to compute multiple values per ordinal as an end in itself. That is only a
> > means
> > to compute a single value which depends on other facet results. This
> > section from
> > the previous email explains it really well:
> >
> > > For example, if we're using the geonames data you have in your example,
> > > maybe the value you want to associate with a given path is something like
> > > `max(population) + sum(elevation)`, where `max(population)` and
> > `sum(elevation)`
> > > are the result of two independent facet associations. Then, you could
> > combine
> > > those results though some expression to derive a single value for a
> > given path.
> >
> > Ideally, I could facet using an expression which binds other
> > aggregations. The user
> > experience might be as simple as defining the expression and making a
> > single
> > faceting call. Has anyone worked on something similar?
> >
> > Best,
> > Stefan
> >
> > On Thu, 23 Feb 2023 at 16:53, Greg Miller <gsmiller@gmail.com> wrote:
> > >
> > > Thanks for the detailed benchmarking Stefan! I think you've demonstrated
> > > here that looping over the collected hits multiple times does in fact add
> > > meaningful overhead. That's interesting to see!
> > >
> > > As for whether-or-not to add functionality to the facets module that
> > > supports this, I'm not convinced at this point. I think what you're
> > > suggesting here—but please correct me if I'm wrong—is supporting
> > > association faceting where the user wants to compute multiple association
> > > aggregations for the same dimensions in a single pass. Where I'm
> > struggling
> > > to connect a real-world use-case though is what the user is going to
> > > actually do with those multiple association values. The Facets API today
> > (
> > >
> > https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/Facets.java
> > )
> > > has a pretty firm assumption built in that dimensions/paths have a single
> > > value associated with them. So building some sort of association faceting
> > > implementation that exposes more than one value associated with a given
> > > dimension/path is a significant change to the current model, and I'm not
> > > sure it supports enough real-world use to warrant the complexity.
> > >
> > > OK, now disclaimer: Stefan and I work together so I think I have an idea
> > of
> > > what he's doing here...
> > >
> > > What I think you're actually after here—and the one use-case I could
> > > imagine some other users being interested in—is computing a single value
> > > for each dimension/path that is actually an expression over _other_
> > > aggregated values. For example, if we're using the geonames data you have
> > > in your example, maybe the value you want to associate with a given path
> > is
> > > something like `max(population) + sum(elevation)`, where
> > `max(population)`
> > > and `sum(elevation)` are the result of two independent facet
> > associations.
> > > Then, you could combine those results though some expression to derive a
> > > single value for a given path. That end result still fits the Facets API
> > > well, but supporting something like this in Lucene requires a few other
> > > primitives beyond just the ability to compute multiple associations at
> > the
> > > same time. Primarily, it needs some version of Expression + Bindings that
> > > works for dimensions/paths. So I don't think the ability to compute
> > > multiple associations at once is really the key missing feature here,
> > and I
> > > don't think it adds significant value on its own to warrant the
> > complexity
> > > of trying to expose it through the existings Facets API. Of course,
> > there's
> > > nothing preventing users from building this "multiple association"
> > > functionality themselves.
> > >
> > > That's my take on this, but maybe I'm missing some other use-cases that
> > > could justify adding this capability in a general way? What do you think?
> > >
> > > Cheers,
> > > -Greg
> > >
> > > On Fri, Feb 17, 2023 at 3:14 PM Stefan Vodita <stefan.vodita@gmail.com>
> > > wrote:
> > >
> > > > After benchmarking my implementation against the existing one, I think
> > > > there is
> > > > some meaningful overhead. I built a small driver [1] that runs the two
> > > > solutions over
> > > > a geo data [2] index (thank you Greg for providing the indexing code!).
> > > >
> > > > The table below lists faceting times in milliseconds. I’ve named the
> > > > current
> > > > implementation serial and my proposal parallel, for lack of better
> > names.
> > > > The
> > > > aggregation function is a no-op, so we’re only measuring the time spent
> > > > outside
> > > > aggregation. The measurements are over a match-set of 100k docs, but
> > the
> > > > number
> > > > of docs does not have a large impact on the results because the
> > aggregation
> > > > function isn’t doing any work.
> > > >
> > > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > > > Faceting Time (ms) |
> > > >
> > > >
> > |----------------------------------|------------------------------------|--------------------------------------|
> > > > 510 | 328 |
> > > > | 5 |
> > > > 1211 | 775 |
> > > > | 10 |
> > > > 2366 | 1301 |
> > > >
> > > > If we use a MAX aggregation over a DocValue instead, the results tell a
> > > > similar
> > > > story. In this case, the number of docs matters. I've attached results
> > > > for 10 docs and
> > > > 100k docs.
> > > >
> > > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > > > Faceting Time (ms) |
> > > >
> > > >
> > |----------------------------------|------------------------------------|--------------------------------------|
> > > > 706 | 505 |
> > > > | 5 |
> > > > 1618 | 1119 |
> > > > | 10 |
> > > > 3152 | 2018 |
> > > >
> > > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > > > Faceting Time (ms) |
> > > >
> > > >
> > |----------------------------------|------------------------------------|--------------------------------------|
> > > > | 2 |
> > > > 904 | 655 |
> > > > | 5 |
> > > > 2122 | 1491 |
> > > > | 10 |
> > > > 5062 | 3317 |
> > > >
> > > > With 10 aggregations, we're saving a second or more. That is
> > significant
> > > > for my
> > > > use-case.
> > > >
> > > > I'd like to know if the test and results seem reasonable. If so, maybe
> > > > we can think
> > > > about providing this functionality.
> > > >
> > > > Thanks,
> > > > Stefan
> > > >
> > > > [1]
> > > >
> > https://github.com/stefanvodita/lucene/commit/3536546cd9f833150db001e8eede093723cf7663
> > > > [2] https://download.geonames.org/export/dump/allCountries.zip
> > > >
> > > >
> > > > On Fri, 17 Feb 2023 at 18:45, Greg Miller <gsmiller@gmail.com> wrote:
> > > > >
> > > > > Thanks for the follow up Stefan. If you find significant overhead
> > > > > associated with the multiple iterations, please keep challenging the
> > > > > current approach and suggest improvements. It's always good to
> > revisit
> > > > > these things!
> > > > >
> > > > > Cheers,
> > > > > -Greg
> > > > >
> > > > > On Thu, Feb 16, 2023 at 1:32 PM Stefan Vodita <
> > stefan.vodita@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Greg,
> > > > > >
> > > > > > To better understand how much work gets duplicated, I went ahead
> > > > > > and modified FloatTaxonomyFacets as an example [1]. It doesn't look
> > > > > > too pretty, but it illustrates how I think multiple aggregations
> > in one
> > > > > > iteration could work.
> > > > > >
> > > > > > Overall, you're right, there's not as much wasted work as I had
> > > > > > expected. I'll try to do a performance comparison to quantify
> > precisely
> > > > > > how much time we could save, just in case.
> > > > > >
> > > > > > Thank you the help!
> > > > > > Stefan
> > > > > >
> > > > > > [1]
> > > > > >
> > > >
> > https://github.com/stefanvodita/lucene/commit/3227dabe746858fc81b9f6e4d2ac9b66e8c32684
> > > > > >
> > > > > > On Wed, 15 Feb 2023 at 15:48, Greg Miller <gsmiller@gmail.com>
> > wrote:
> > > > > > >
> > > > > > > Hi Stefan-
> > > > > > >
> > > > > > > > In that case, iterating twice duplicates most of the work,
> > correct?
> > > > > > >
> > > > > > > I'm not sure I'd agree that it duplicates "most" of the work.
> > This
> > > > is an
> > > > > > > association faceting example, which is a little bit of a special
> > > > case in
> > > > > > > some ways. But, to your question, there is duplicated work here
> > of
> > > > > > > re-loading the ordinals across the two aggregations, but I would
> > > > suspect
> > > > > > > the more expensive work is actually computing the different
> > > > aggregations,
> > > > > > > which is not duplicated. You're right that it would likely be
> > more
> > > > > > > efficient to iterate the hits once, loading the ordinals once and
> > > > > > computing
> > > > > > > multiple aggregations in one pass. There's no facility for doing
> > that
> > > > > > > currently in Lucene's faceting module, but you could always
> > propose
> > > > it!
> > > > > > :)
> > > > > > > That said, I'm not sure how common of a case this really is for
> > the
> > > > > > > majority of users? But that's just a guess/assumption.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > -Greg
> > > > > > >
> > > > > > > On Tue, Feb 14, 2023 at 3:19 AM Stefan Vodita <
> > > > stefan.vodita@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Greg,
> > > > > > > >
> > > > > > > > I see now where my example didn’t give enough info. In my mind,
> > > > `Genre
> > > > > > /
> > > > > > > > Author nationality / Author name` is stored in one hierarchical
> > > > facet
> > > > > > > > field.
> > > > > > > > The data we’re aggregating over, like publish date or price,
> > are
> > > > > > stored in
> > > > > > > > DocValues.
> > > > > > > >
> > > > > > > > The demo package shows something similar [1], where the
> > aggregation
> > > > > > > > is computed across a facet field using data from a `popularity`
> > > > > > DocValue.
> > > > > > > >
> > > > > > > > In the demo, we compute `sum(_score * sqrt(popularity))`, but
> > what
> > > > if
> > > > > > we
> > > > > > > > want several other different aggregations with respect to the
> > same
> > > > > > facet
> > > > > > > > field? Maybe we want `max(popularity)`. In that case, iterating
> > > > twice
> > > > > > > > duplicates most of the work, correct?
> > > > > > > >
> > > > > > > >
> > > > > > > > Stefan
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > >
> > > >
> > https://github.com/apache/lucene/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/demo/src/java/org/apache/lucene/demo/facet/ExpressionAggregationFacetsExample.java#L91
> > > > > > > >
> > > > > > > > On Mon, 13 Feb 2023 at 22:46, Greg Miller <gsmiller@gmail.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Stefan-
> > > > > > > > >
> > > > > > > > > That helps, thanks. I'm a bit confused about where you're
> > > > concerned
> > > > > > with
> > > > > > > > > iterating over the match set multiple times. Is this a
> > situation
> > > > > > where
> > > > > > > > the
> > > > > > > > > ordinals you want to facet over are stored in different index
> > > > > > fields, so
> > > > > > > > > you have to create multiple Facets instances (one per field)
> > to
> > > > > > compute
> > > > > > > > the
> > > > > > > > > aggregations? If that's the case, then yes—you have to
> > iterate
> > > > over
> > > > > > the
> > > > > > > > > match set multiple times (once per field). I'm not sure
> > that's
> > > > such
> > > > > > a big
> > > > > > > > > issue given that you're doing novel work during each
> > iteration,
> > > > so
> > > > > > the
> > > > > > > > only
> > > > > > > > > repetitive cost is actually iterating the hits. If the
> > ordinals
> > > > are
> > > > > > > > > "packed" into the same field though (which is the default in
> > > > Lucene
> > > > > > if
> > > > > > > > > you're using taxonomy faceting), then you should only need
> > to do
> > > > a
> > > > > > single
> > > > > > > > > iteration over that field.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > -Greg
> > > > > > > > >
> > > > > > > > > On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita <
> > > > > > stefan.vodita@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Greg,
> > > > > > > > > >
> > > > > > > > > > I’m assuming we have one match-set which was not
> > constrained
> > > > by any
> > > > > > > > > > of the categories we want to aggregate over, so it may have
> > > > books
> > > > > > by
> > > > > > > > > > Mark Twain, books by American authors, and sci-fi books.
> > > > > > > > > >
> > > > > > > > > > Maybe we can imagine we obtained it by searching for a
> > > > keyword, say
> > > > > > > > > > “Washington”, which is present in Mark Twain’s writing, and
> > > > those
> > > > > > of
> > > > > > > > other
> > > > > > > > > > American authors, and in sci-fi novels too.
> > > > > > > > > >
> > > > > > > > > > Does that make the example clearer?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Stefan
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Sat, 11 Feb 2023 at 00:16, Greg Miller <
> > gsmiller@gmail.com>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi Stefan-
> > > > > > > > > > >
> > > > > > > > > > > Can you clarify your example a little bit? It sounds
> > like you
> > > > > > want to
> > > > > > > > > > facet
> > > > > > > > > > > over three different match sets (one constrained by "Mark
> > > > Twain"
> > > > > > as
> > > > > > > > the
> > > > > > > > > > > author, one constrained by "American authors" and one
> > > > > > constrained by
> > > > > > > > the
> > > > > > > > > > > "sci-fi" genre). Is that correct?
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > -Greg
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <
> > > > > > > > stefan.vodita@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi all,
> > > > > > > > > > > >
> > > > > > > > > > > > Let’s say I have an index of books, similar to the
> > example
> > > > in
> > > > > > the
> > > > > > > > facet
> > > > > > > > > > > > demo [1]
> > > > > > > > > > > > with a hierarchical facet field encapsulating `Genre /
> > > > Author’s
> > > > > > > > > > > > nationality /
> > > > > > > > > > > > Author’s name`.
> > > > > > > > > > > >
> > > > > > > > > > > > I might like to find the latest publish date of a book
> > > > written
> > > > > > by
> > > > > > > > Mark
> > > > > > > > > > > > Twain, the
> > > > > > > > > > > > sum of the prices of books written by American authors,
> > > > and the
> > > > > > > > number
> > > > > > > > > > of
> > > > > > > > > > > > sci-fi novels.
> > > > > > > > > > > >
> > > > > > > > > > > > As far as I understand, this would require faceting 3
> > times
> > > > > > over
> > > > > > > > the
> > > > > > > > > > > > match-set,
> > > > > > > > > > > > one iteration for each aggregation of a different type
> > > > > > (max(date),
> > > > > > > > > > > > sum(price),
> > > > > > > > > > > > count). That seems inefficient if we could instead
> > compute
> > > > all
> > > > > > > > > > > > aggregations in
> > > > > > > > > > > > one pass.
> > > > > > > > > > > >
> > > > > > > > > > > > Is there a way to do that?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Stefan
> > > > > > > > > > > >
> > > > > > > > > > > > [1]
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > >
> > > > ---------------------------------------------------------------------
> > > > > > > > > > > > To unsubscribe, e-mail:
> > > > > > java-user-unsubscribe@lucene.apache.org
> > > > > > > > > > > > For additional commands, e-mail:
> > > > > > java-user-help@lucene.apache.org
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > > > > > To unsubscribe, e-mail:
> > > > java-user-unsubscribe@lucene.apache.org
> > > > > > > > > > For additional commands, e-mail:
> > > > java-user-help@lucene.apache.org
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail:
> > java-user-unsubscribe@lucene.apache.org
> > > > > > > > For additional commands, e-mail:
> > java-user-help@lucene.apache.org
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > >
> > > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Computing multiple different aggregations over a match-set in one pass [ In reply to ]

gsmiller at gmail

Mar 6, 2023, 9:45 AM

Post #14 of 15 (991 views)

Hi Stefan-

I opened https://github.com/apache/lucene/issues/12190 where we can discuss
this further. Thanks for raising the idea!

Cheers,
-Greg

On Mon, Mar 6, 2023 at 7:21?AM Stefan Vodita <stefan.vodita@gmail.com>
wrote:

> Hi Greg,
>
> The PR looks great. I think it's a useful feature to have and it helps
> with the
> use-case we were discussing. I left a comment with some other ideas that
> I'd
> like to explore.
>
> Thank you for coding this up,
> Stefan
>
> On Sun, 5 Mar 2023 at 19:33, Greg Miller <gsmiller@gmail.com> wrote:
> >
> > Hi Stefan-
> >
> > I cobbled together a draft PR that I _think_ is what you're looking for
> so
> > we can have something to talk about. Please let me know if this misses
> the
> > mark, or is what you had in mind. If so, we could open an issue to
> propose
> > the idea of adding something like this. I'm not totally convinced I like
> it
> > (I think the expression syntax/API is a little wonky), but that's
> something
> > we could discuss in an issue.
> >
> > https://github.com/apache/lucene/pull/12184
> >
> > Cheers,
> > -Greg
> >
> > On Fri, Feb 24, 2023 at 1:57?PM Stefan Vodita <stefan.vodita@gmail.com>
> > wrote:
> >
> > > Hi everyone,
> > >
> > > Greg and I discussed a bit offline. His assessment was right - I’m not
> > > looking
> > > to compute multiple values per ordinal as an end in itself. That is
> only a
> > > means
> > > to compute a single value which depends on other facet results. This
> > > section from
> > > the previous email explains it really well:
> > >
> > > > For example, if we're using the geonames data you have in your
> example,
> > > > maybe the value you want to associate with a given path is something
> like
> > > > `max(population) + sum(elevation)`, where `max(population)` and
> > > `sum(elevation)`
> > > > are the result of two independent facet associations. Then, you could
> > > combine
> > > > those results though some expression to derive a single value for a
> > > given path.
> > >
> > > Ideally, I could facet using an expression which binds other
> > > aggregations. The user
> > > experience might be as simple as defining the expression and making a
> > > single
> > > faceting call. Has anyone worked on something similar?
> > >
> > > Best,
> > > Stefan
> > >
> > > On Thu, 23 Feb 2023 at 16:53, Greg Miller <gsmiller@gmail.com> wrote:
> > > >
> > > > Thanks for the detailed benchmarking Stefan! I think you've
> demonstrated
> > > > here that looping over the collected hits multiple times does in
> fact add
> > > > meaningful overhead. That's interesting to see!
> > > >
> > > > As for whether-or-not to add functionality to the facets module that
> > > > supports this, I'm not convinced at this point. I think what you're
> > > > suggesting here—but please correct me if I'm wrong—is supporting
> > > > association faceting where the user wants to compute multiple
> association
> > > > aggregations for the same dimensions in a single pass. Where I'm
> > > struggling
> > > > to connect a real-world use-case though is what the user is going to
> > > > actually do with those multiple association values. The Facets API
> today
> > > (
> > > >
> > >
> https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/Facets.java
> > > )
> > > > has a pretty firm assumption built in that dimensions/paths have a
> single
> > > > value associated with them. So building some sort of association
> faceting
> > > > implementation that exposes more than one value associated with a
> given
> > > > dimension/path is a significant change to the current model, and I'm
> not
> > > > sure it supports enough real-world use to warrant the complexity.
> > > >
> > > > OK, now disclaimer: Stefan and I work together so I think I have an
> idea
> > > of
> > > > what he's doing here...
> > > >
> > > > What I think you're actually after here—and the one use-case I could
> > > > imagine some other users being interested in—is computing a single
> value
> > > > for each dimension/path that is actually an expression over _other_
> > > > aggregated values. For example, if we're using the geonames data you
> have
> > > > in your example, maybe the value you want to associate with a given
> path
> > > is
> > > > something like `max(population) + sum(elevation)`, where
> > > `max(population)`
> > > > and `sum(elevation)` are the result of two independent facet
> > > associations.
> > > > Then, you could combine those results though some expression to
> derive a
> > > > single value for a given path. That end result still fits the Facets
> API
> > > > well, but supporting something like this in Lucene requires a few
> other
> > > > primitives beyond just the ability to compute multiple associations
> at
> > > the
> > > > same time. Primarily, it needs some version of Expression + Bindings
> that
> > > > works for dimensions/paths. So I don't think the ability to compute
> > > > multiple associations at once is really the key missing feature here,
> > > and I
> > > > don't think it adds significant value on its own to warrant the
> > > complexity
> > > > of trying to expose it through the existings Facets API. Of course,
> > > there's
> > > > nothing preventing users from building this "multiple association"
> > > > functionality themselves.
> > > >
> > > > That's my take on this, but maybe I'm missing some other use-cases
> that
> > > > could justify adding this capability in a general way? What do you
> think?
> > > >
> > > > Cheers,
> > > > -Greg
> > > >
> > > > On Fri, Feb 17, 2023 at 3:14 PM Stefan Vodita <
> stefan.vodita@gmail.com>
> > > > wrote:
> > > >
> > > > > After benchmarking my implementation against the existing one, I
> think
> > > > > there is
> > > > > some meaningful overhead. I built a small driver [1] that runs the
> two
> > > > > solutions over
> > > > > a geo data [2] index (thank you Greg for providing the indexing
> code!).
> > > > >
> > > > > The table below lists faceting times in milliseconds. I’ve named
> the
> > > > > current
> > > > > implementation serial and my proposal parallel, for lack of better
> > > names.
> > > > > The
> > > > > aggregation function is a no-op, so we’re only measuring the time
> spent
> > > > > outside
> > > > > aggregation. The measurements are over a match-set of 100k docs,
> but
> > > the
> > > > > number
> > > > > of docs does not have a large impact on the results because the
> > > aggregation
> > > > > function isn’t doing any work.
> > > > >
> > > > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > > > > Faceting Time (ms) |
> > > > >
> > > > >
> > >
> |----------------------------------|------------------------------------|--------------------------------------|
> > > > > 510 | 328 |
> > > > > | 5 |
> > > > > 1211 | 775 |
> > > > > | 10 |
> > > > > 2366 | 1301 |
> > > > >
> > > > > If we use a MAX aggregation over a DocValue instead, the results
> tell a
> > > > > similar
> > > > > story. In this case, the number of docs matters. I've attached
> results
> > > > > for 10 docs and
> > > > > 100k docs.
> > > > >
> > > > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > > > > Faceting Time (ms) |
> > > > >
> > > > >
> > >
> |----------------------------------|------------------------------------|--------------------------------------|
> > > > > 706 | 505 |
> > > > > | 5 |
> > > > > 1618 | 1119 |
> > > > > | 10 |
> > > > > 3152 | 2018 |
> > > > >
> > > > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > > > > Faceting Time (ms) |
> > > > >
> > > > >
> > >
> |----------------------------------|------------------------------------|--------------------------------------|
> > > > > | 2 |
> > > > > 904 | 655 |
> > > > > | 5 |
> > > > > 2122 | 1491 |
> > > > > | 10 |
> > > > > 5062 | 3317 |
> > > > >
> > > > > With 10 aggregations, we're saving a second or more. That is
> > > significant
> > > > > for my
> > > > > use-case.
> > > > >
> > > > > I'd like to know if the test and results seem reasonable. If so,
> maybe
> > > > > we can think
> > > > > about providing this functionality.
> > > > >
> > > > > Thanks,
> > > > > Stefan
> > > > >
> > > > > [1]
> > > > >
> > >
> https://github.com/stefanvodita/lucene/commit/3536546cd9f833150db001e8eede093723cf7663
> > > > > [2] https://download.geonames.org/export/dump/allCountries.zip
> > > > >
> > > > >
> > > > > On Fri, 17 Feb 2023 at 18:45, Greg Miller <gsmiller@gmail.com>
> wrote:
> > > > > >
> > > > > > Thanks for the follow up Stefan. If you find significant overhead
> > > > > > associated with the multiple iterations, please keep challenging
> the
> > > > > > current approach and suggest improvements. It's always good to
> > > revisit
> > > > > > these things!
> > > > > >
> > > > > > Cheers,
> > > > > > -Greg
> > > > > >
> > > > > > On Thu, Feb 16, 2023 at 1:32 PM Stefan Vodita <
> > > stefan.vodita@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Greg,
> > > > > > >
> > > > > > > To better understand how much work gets duplicated, I went
> ahead
> > > > > > > and modified FloatTaxonomyFacets as an example [1]. It doesn't
> look
> > > > > > > too pretty, but it illustrates how I think multiple
> aggregations
> > > in one
> > > > > > > iteration could work.
> > > > > > >
> > > > > > > Overall, you're right, there's not as much wasted work as I had
> > > > > > > expected. I'll try to do a performance comparison to quantify
> > > precisely
> > > > > > > how much time we could save, just in case.
> > > > > > >
> > > > > > > Thank you the help!
> > > > > > > Stefan
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > >
> > >
> https://github.com/stefanvodita/lucene/commit/3227dabe746858fc81b9f6e4d2ac9b66e8c32684
> > > > > > >
> > > > > > > On Wed, 15 Feb 2023 at 15:48, Greg Miller <gsmiller@gmail.com>
> > > wrote:
> > > > > > > >
> > > > > > > > Hi Stefan-
> > > > > > > >
> > > > > > > > > In that case, iterating twice duplicates most of the work,
> > > correct?
> > > > > > > >
> > > > > > > > I'm not sure I'd agree that it duplicates "most" of the work.
> > > This
> > > > > is an
> > > > > > > > association faceting example, which is a little bit of a
> special
> > > > > case in
> > > > > > > > some ways. But, to your question, there is duplicated work
> here
> > > of
> > > > > > > > re-loading the ordinals across the two aggregations, but I
> would
> > > > > suspect
> > > > > > > > the more expensive work is actually computing the different
> > > > > aggregations,
> > > > > > > > which is not duplicated. You're right that it would likely be
> > > more
> > > > > > > > efficient to iterate the hits once, loading the ordinals
> once and
> > > > > > > computing
> > > > > > > > multiple aggregations in one pass. There's no facility for
> doing
> > > that
> > > > > > > > currently in Lucene's faceting module, but you could always
> > > propose
> > > > > it!
> > > > > > > :)
> > > > > > > > That said, I'm not sure how common of a case this really is
> for
> > > the
> > > > > > > > majority of users? But that's just a guess/assumption.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > -Greg
> > > > > > > >
> > > > > > > > On Tue, Feb 14, 2023 at 3:19 AM Stefan Vodita <
> > > > > stefan.vodita@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Greg,
> > > > > > > > >
> > > > > > > > > I see now where my example didn’t give enough info. In my
> mind,
> > > > > `Genre
> > > > > > > /
> > > > > > > > > Author nationality / Author name` is stored in one
> hierarchical
> > > > > facet
> > > > > > > > > field.
> > > > > > > > > The data we’re aggregating over, like publish date or
> price,
> > > are
> > > > > > > stored in
> > > > > > > > > DocValues.
> > > > > > > > >
> > > > > > > > > The demo package shows something similar [1], where the
> > > aggregation
> > > > > > > > > is computed across a facet field using data from a
> `popularity`
> > > > > > > DocValue.
> > > > > > > > >
> > > > > > > > > In the demo, we compute `sum(_score * sqrt(popularity))`,
> but
> > > what
> > > > > if
> > > > > > > we
> > > > > > > > > want several other different aggregations with respect to
> the
> > > same
> > > > > > > facet
> > > > > > > > > field? Maybe we want `max(popularity)`. In that case,
> iterating
> > > > > twice
> > > > > > > > > duplicates most of the work, correct?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Stefan
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> https://github.com/apache/lucene/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/demo/src/java/org/apache/lucene/demo/facet/ExpressionAggregationFacetsExample.java#L91
> > > > > > > > >
> > > > > > > > > On Mon, 13 Feb 2023 at 22:46, Greg Miller <
> gsmiller@gmail.com>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Stefan-
> > > > > > > > > >
> > > > > > > > > > That helps, thanks. I'm a bit confused about where you're
> > > > > concerned
> > > > > > > with
> > > > > > > > > > iterating over the match set multiple times. Is this a
> > > situation
> > > > > > > where
> > > > > > > > > the
> > > > > > > > > > ordinals you want to facet over are stored in different
> index
> > > > > > > fields, so
> > > > > > > > > > you have to create multiple Facets instances (one per
> field)
> > > to
> > > > > > > compute
> > > > > > > > > the
> > > > > > > > > > aggregations? If that's the case, then yes—you have to
> > > iterate
> > > > > over
> > > > > > > the
> > > > > > > > > > match set multiple times (once per field). I'm not sure
> > > that's
> > > > > such
> > > > > > > a big
> > > > > > > > > > issue given that you're doing novel work during each
> > > iteration,
> > > > > so
> > > > > > > the
> > > > > > > > > only
> > > > > > > > > > repetitive cost is actually iterating the hits. If the
> > > ordinals
> > > > > are
> > > > > > > > > > "packed" into the same field though (which is the
> default in
> > > > > Lucene
> > > > > > > if
> > > > > > > > > > you're using taxonomy faceting), then you should only
> need
> > > to do
> > > > > a
> > > > > > > single
> > > > > > > > > > iteration over that field.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > -Greg
> > > > > > > > > >
> > > > > > > > > > On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita <
> > > > > > > stefan.vodita@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Greg,
> > > > > > > > > > >
> > > > > > > > > > > I’m assuming we have one match-set which was not
> > > constrained
> > > > > by any
> > > > > > > > > > > of the categories we want to aggregate over, so it may
> have
> > > > > books
> > > > > > > by
> > > > > > > > > > > Mark Twain, books by American authors, and sci-fi
> books.
> > > > > > > > > > >
> > > > > > > > > > > Maybe we can imagine we obtained it by searching for a
> > > > > keyword, say
> > > > > > > > > > > “Washington”, which is present in Mark Twain’s
> writing, and
> > > > > those
> > > > > > > of
> > > > > > > > > other
> > > > > > > > > > > American authors, and in sci-fi novels too.
> > > > > > > > > > >
> > > > > > > > > > > Does that make the example clearer?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Stefan
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Sat, 11 Feb 2023 at 00:16, Greg Miller <
> > > gsmiller@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Stefan-
> > > > > > > > > > > >
> > > > > > > > > > > > Can you clarify your example a little bit? It sounds
> > > like you
> > > > > > > want to
> > > > > > > > > > > facet
> > > > > > > > > > > > over three different match sets (one constrained by
> "Mark
> > > > > Twain"
> > > > > > > as
> > > > > > > > > the
> > > > > > > > > > > > author, one constrained by "American authors" and one
> > > > > > > constrained by
> > > > > > > > > the
> > > > > > > > > > > > "sci-fi" genre). Is that correct?
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > -Greg
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <
> > > > > > > > > stefan.vodita@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Let’s say I have an index of books, similar to the
> > > example
> > > > > in
> > > > > > > the
> > > > > > > > > facet
> > > > > > > > > > > > > demo [1]
> > > > > > > > > > > > > with a hierarchical facet field encapsulating
> `Genre /
> > > > > Author’s
> > > > > > > > > > > > > nationality /
> > > > > > > > > > > > > Author’s name`.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I might like to find the latest publish date of a
> book
> > > > > written
> > > > > > > by
> > > > > > > > > Mark
> > > > > > > > > > > > > Twain, the
> > > > > > > > > > > > > sum of the prices of books written by American
> authors,
> > > > > and the
> > > > > > > > > number
> > > > > > > > > > > of
> > > > > > > > > > > > > sci-fi novels.
> > > > > > > > > > > > >
> > > > > > > > > > > > > As far as I understand, this would require
> faceting 3
> > > times
> > > > > > > over
> > > > > > > > > the
> > > > > > > > > > > > > match-set,
> > > > > > > > > > > > > one iteration for each aggregation of a different
> type
> > > > > > > (max(date),
> > > > > > > > > > > > > sum(price),
> > > > > > > > > > > > > count). That seems inefficient if we could instead
> > > compute
> > > > > all
> > > > > > > > > > > > > aggregations in
> > > > > > > > > > > > > one pass.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Is there a way to do that?
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Stefan
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > > > > > > > > > To unsubscribe, e-mail:
> > > > > > > java-user-unsubscribe@lucene.apache.org
> > > > > > > > > > > > > For additional commands, e-mail:
> > > > > > > java-user-help@lucene.apache.org
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > > > > > To unsubscribe, e-mail:
> > > > > java-user-unsubscribe@lucene.apache.org
> > > > > > > > > > > For additional commands, e-mail:
> > > > > java-user-help@lucene.apache.org
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > > > > > To unsubscribe, e-mail:
> > > java-user-unsubscribe@lucene.apache.org
> > > > > > > > > For additional commands, e-mail:
> > > java-user-help@lucene.apache.org
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > > > > > > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Computing multiple different aggregations over a match-set in one pass [ In reply to ]

stefan.vodita at gmail

Sep 9, 2023, 2:20 PM

Post #15 of 15 (546 views)

Hi everyone,

I ended up using the idea of doing multiple aggregations in one go and it
was
a nice improvement. Maybe we can reconsider introducing this? I've opened an
issue [1] and published a PR [2] based on the code I had previously shared,
with some extra tests and a few improvements.

Stefan

[1] https://github.com/apache/lucene/issues/12546
[2] https://github.com/apache/lucene/pull/12547

On Mon, 6 Mar 2023 at 19:46, Greg Miller <gsmiller@gmail.com> wrote:

> Hi Stefan-
>
> I opened https://github.com/apache/lucene/issues/12190 where we can
> discuss
> this further. Thanks for raising the idea!
>
> Cheers,
> -Greg
>
> On Mon, Mar 6, 2023 at 7:21?AM Stefan Vodita <stefan.vodita@gmail.com>
> wrote:
>
> > Hi Greg,
> >
> > The PR looks great. I think it's a useful feature to have and it helps
> > with the
> > use-case we were discussing. I left a comment with some other ideas that
> > I'd
> > like to explore.
> >
> > Thank you for coding this up,
> > Stefan
> >
> > On Sun, 5 Mar 2023 at 19:33, Greg Miller <gsmiller@gmail.com> wrote:
> > >
> > > Hi Stefan-
> > >
> > > I cobbled together a draft PR that I _think_ is what you're looking for
> > so
> > > we can have something to talk about. Please let me know if this misses
> > the
> > > mark, or is what you had in mind. If so, we could open an issue to
> > propose
> > > the idea of adding something like this. I'm not totally convinced I
> like
> > it
> > > (I think the expression syntax/API is a little wonky), but that's
> > something
> > > we could discuss in an issue.
> > >
> > > https://github.com/apache/lucene/pull/12184
> > >
> > > Cheers,
> > > -Greg
> > >
> > > On Fri, Feb 24, 2023 at 1:57?PM Stefan Vodita <stefan.vodita@gmail.com
> >
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > Greg and I discussed a bit offline. His assessment was right - I’m
> not
> > > > looking
> > > > to compute multiple values per ordinal as an end in itself. That is
> > only a
> > > > means
> > > > to compute a single value which depends on other facet results. This
> > > > section from
> > > > the previous email explains it really well:
> > > >
> > > > > For example, if we're using the geonames data you have in your
> > example,
> > > > > maybe the value you want to associate with a given path is
> something
> > like
> > > > > `max(population) + sum(elevation)`, where `max(population)` and
> > > > `sum(elevation)`
> > > > > are the result of two independent facet associations. Then, you
> could
> > > > combine
> > > > > those results though some expression to derive a single value for a
> > > > given path.
> > > >
> > > > Ideally, I could facet using an expression which binds other
> > > > aggregations. The user
> > > > experience might be as simple as defining the expression and making a
> > > > single
> > > > faceting call. Has anyone worked on something similar?
> > > >
> > > > Best,
> > > > Stefan
> > > >
> > > > On Thu, 23 Feb 2023 at 16:53, Greg Miller <gsmiller@gmail.com>
> wrote:
> > > > >
> > > > > Thanks for the detailed benchmarking Stefan! I think you've
> > demonstrated
> > > > > here that looping over the collected hits multiple times does in
> > fact add
> > > > > meaningful overhead. That's interesting to see!
> > > > >
> > > > > As for whether-or-not to add functionality to the facets module
> that
> > > > > supports this, I'm not convinced at this point. I think what you're
> > > > > suggesting here—but please correct me if I'm wrong—is supporting
> > > > > association faceting where the user wants to compute multiple
> > association
> > > > > aggregations for the same dimensions in a single pass. Where I'm
> > > > struggling
> > > > > to connect a real-world use-case though is what the user is going
> to
> > > > > actually do with those multiple association values. The Facets API
> > today
> > > > (
> > > > >
> > > >
> >
> https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/Facets.java
> > > > )
> > > > > has a pretty firm assumption built in that dimensions/paths have a
> > single
> > > > > value associated with them. So building some sort of association
> > faceting
> > > > > implementation that exposes more than one value associated with a
> > given
> > > > > dimension/path is a significant change to the current model, and
> I'm
> > not
> > > > > sure it supports enough real-world use to warrant the complexity.
> > > > >
> > > > > OK, now disclaimer: Stefan and I work together so I think I have an
> > idea
> > > > of
> > > > > what he's doing here...
> > > > >
> > > > > What I think you're actually after here—and the one use-case I
> could
> > > > > imagine some other users being interested in—is computing a single
> > value
> > > > > for each dimension/path that is actually an expression over _other_
> > > > > aggregated values. For example, if we're using the geonames data
> you
> > have
> > > > > in your example, maybe the value you want to associate with a given
> > path
> > > > is
> > > > > something like `max(population) + sum(elevation)`, where
> > > > `max(population)`
> > > > > and `sum(elevation)` are the result of two independent facet
> > > > associations.
> > > > > Then, you could combine those results though some expression to
> > derive a
> > > > > single value for a given path. That end result still fits the
> Facets
> > API
> > > > > well, but supporting something like this in Lucene requires a few
> > other
> > > > > primitives beyond just the ability to compute multiple associations
> > at
> > > > the
> > > > > same time. Primarily, it needs some version of Expression +
> Bindings
> > that
> > > > > works for dimensions/paths. So I don't think the ability to compute
> > > > > multiple associations at once is really the key missing feature
> here,
> > > > and I
> > > > > don't think it adds significant value on its own to warrant the
> > > > complexity
> > > > > of trying to expose it through the existings Facets API. Of course,
> > > > there's
> > > > > nothing preventing users from building this "multiple association"
> > > > > functionality themselves.
> > > > >
> > > > > That's my take on this, but maybe I'm missing some other use-cases
> > that
> > > > > could justify adding this capability in a general way? What do you
> > think?
> > > > >
> > > > > Cheers,
> > > > > -Greg
> > > > >
> > > > > On Fri, Feb 17, 2023 at 3:14 PM Stefan Vodita <
> > stefan.vodita@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > After benchmarking my implementation against the existing one, I
> > think
> > > > > > there is
> > > > > > some meaningful overhead. I built a small driver [1] that runs
> the
> > two
> > > > > > solutions over
> > > > > > a geo data [2] index (thank you Greg for providing the indexing
> > code!).
> > > > > >
> > > > > > The table below lists faceting times in milliseconds. I’ve named
> > the
> > > > > > current
> > > > > > implementation serial and my proposal parallel, for lack of
> better
> > > > names.
> > > > > > The
> > > > > > aggregation function is a no-op, so we’re only measuring the time
> > spent
> > > > > > outside
> > > > > > aggregation. The measurements are over a match-set of 100k docs,
> > but
> > > > the
> > > > > > number
> > > > > > of docs does not have a large impact on the results because the
> > > > aggregation
> > > > > > function isn’t doing any work.
> > > > > >
> > > > > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > > > > > Faceting Time (ms) |
> > > > > >
> > > > > >
> > > >
> >
> |----------------------------------|------------------------------------|--------------------------------------|
> > > > > > | 2 |
> > > > > > 510 | 328 |
> > > > > > | 5 |
> > > > > > 1211 | 775 |
> > > > > > | 10 |
> > > > > > 2366 | 1301 |
> > > > > >
> > > > > > If we use a MAX aggregation over a DocValue instead, the results
> > tell a
> > > > > > similar
> > > > > > story. In this case, the number of docs matters. I've attached
> > results
> > > > > > for 10 docs and
> > > > > > 100k docs.
> > > > > >
> > > > > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > > > > > Faceting Time (ms) |
> > > > > >
> > > > > >
> > > >
> >
> |----------------------------------|------------------------------------|--------------------------------------|
> > > > > > | 2 |
> > > > > > 706 | 505 |
> > > > > > | 5 |
> > > > > > 1618 | 1119 |
> > > > > > | 10 |
> > > > > > 3152 | 2018 |
> > > > > >
> > > > > > | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> > > > > > Faceting Time (ms) |
> > > > > >
> > > > > >
> > > >
> >
> |----------------------------------|------------------------------------|--------------------------------------|
> > > > > > | 2 |
> > > > > > 904 | 655 |
> > > > > > | 5 |
> > > > > > 2122 | 1491 |
> > > > > > | 10 |
> > > > > > 5062 | 3317 |
> > > > > >
> > > > > > With 10 aggregations, we're saving a second or more. That is
> > > > significant
> > > > > > for my
> > > > > > use-case.
> > > > > >
> > > > > > I'd like to know if the test and results seem reasonable. If so,
> > maybe
> > > > > > we can think
> > > > > > about providing this functionality.
> > > > > >
> > > > > > Thanks,
> > > > > > Stefan
> > > > > >
> > > > > > [1]
> > > > > >
> > > >
> >
> https://github.com/stefanvodita/lucene/commit/3536546cd9f833150db001e8eede093723cf7663
> > > > > > [2] https://download.geonames.org/export/dump/allCountries.zip
> > > > > >
> > > > > >
> > > > > > On Fri, 17 Feb 2023 at 18:45, Greg Miller <gsmiller@gmail.com>
> > wrote:
> > > > > > >
> > > > > > > Thanks for the follow up Stefan. If you find significant
> overhead
> > > > > > > associated with the multiple iterations, please keep
> challenging
> > the
> > > > > > > current approach and suggest improvements. It's always good to
> > > > revisit
> > > > > > > these things!
> > > > > > >
> > > > > > > Cheers,
> > > > > > > -Greg
> > > > > > >
> > > > > > > On Thu, Feb 16, 2023 at 1:32 PM Stefan Vodita <
> > > > stefan.vodita@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Greg,
> > > > > > > >
> > > > > > > > To better understand how much work gets duplicated, I went
> > ahead
> > > > > > > > and modified FloatTaxonomyFacets as an example [1]. It
> doesn't
> > look
> > > > > > > > too pretty, but it illustrates how I think multiple
> > aggregations
> > > > in one
> > > > > > > > iteration could work.
> > > > > > > >
> > > > > > > > Overall, you're right, there's not as much wasted work as I
> had
> > > > > > > > expected. I'll try to do a performance comparison to quantify
> > > > precisely
> > > > > > > > how much time we could save, just in case.
> > > > > > > >
> > > > > > > > Thank you the help!
> > > > > > > > Stefan
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > >
> > > >
> >
> https://github.com/stefanvodita/lucene/commit/3227dabe746858fc81b9f6e4d2ac9b66e8c32684
> > > > > > > >
> > > > > > > > On Wed, 15 Feb 2023 at 15:48, Greg Miller <
> gsmiller@gmail.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Stefan-
> > > > > > > > >
> > > > > > > > > > In that case, iterating twice duplicates most of the
> work,
> > > > correct?
> > > > > > > > >
> > > > > > > > > I'm not sure I'd agree that it duplicates "most" of the
> work.
> > > > This
> > > > > > is an
> > > > > > > > > association faceting example, which is a little bit of a
> > special
> > > > > > case in
> > > > > > > > > some ways. But, to your question, there is duplicated work
> > here
> > > > of
> > > > > > > > > re-loading the ordinals across the two aggregations, but I
> > would
> > > > > > suspect
> > > > > > > > > the more expensive work is actually computing the different
> > > > > > aggregations,
> > > > > > > > > which is not duplicated. You're right that it would likely
> be
> > > > more
> > > > > > > > > efficient to iterate the hits once, loading the ordinals
> > once and
> > > > > > > > computing
> > > > > > > > > multiple aggregations in one pass. There's no facility for
> > doing
> > > > that
> > > > > > > > > currently in Lucene's faceting module, but you could always
> > > > propose
> > > > > > it!
> > > > > > > > :)
> > > > > > > > > That said, I'm not sure how common of a case this really is
> > for
> > > > the
> > > > > > > > > majority of users? But that's just a guess/assumption.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > -Greg
> > > > > > > > >
> > > > > > > > > On Tue, Feb 14, 2023 at 3:19 AM Stefan Vodita <
> > > > > > stefan.vodita@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Greg,
> > > > > > > > > >
> > > > > > > > > > I see now where my example didn’t give enough info. In my
> > mind,
> > > > > > `Genre
> > > > > > > > /
> > > > > > > > > > Author nationality / Author name` is stored in one
> > hierarchical
> > > > > > facet
> > > > > > > > > > field.
> > > > > > > > > > The data we’re aggregating over, like publish date or
> > price,
> > > > are
> > > > > > > > stored in
> > > > > > > > > > DocValues.
> > > > > > > > > >
> > > > > > > > > > The demo package shows something similar [1], where the
> > > > aggregation
> > > > > > > > > > is computed across a facet field using data from a
> > `popularity`
> > > > > > > > DocValue.
> > > > > > > > > >
> > > > > > > > > > In the demo, we compute `sum(_score * sqrt(popularity))`,
> > but
> > > > what
> > > > > > if
> > > > > > > > we
> > > > > > > > > > want several other different aggregations with respect to
> > the
> > > > same
> > > > > > > > facet
> > > > > > > > > > field? Maybe we want `max(popularity)`. In that case,
> > iterating
> > > > > > twice
> > > > > > > > > > duplicates most of the work, correct?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Stefan
> > > > > > > > > >
> > > > > > > > > > [1]
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >
> https://github.com/apache/lucene/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/demo/src/java/org/apache/lucene/demo/facet/ExpressionAggregationFacetsExample.java#L91
> > > > > > > > > >
> > > > > > > > > > On Mon, 13 Feb 2023 at 22:46, Greg Miller <
> > gsmiller@gmail.com>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi Stefan-
> > > > > > > > > > >
> > > > > > > > > > > That helps, thanks. I'm a bit confused about where
> you're
> > > > > > concerned
> > > > > > > > with
> > > > > > > > > > > iterating over the match set multiple times. Is this a
> > > > situation
> > > > > > > > where
> > > > > > > > > > the
> > > > > > > > > > > ordinals you want to facet over are stored in different
> > index
> > > > > > > > fields, so
> > > > > > > > > > > you have to create multiple Facets instances (one per
> > field)
> > > > to
> > > > > > > > compute
> > > > > > > > > > the
> > > > > > > > > > > aggregations? If that's the case, then yes—you have to
> > > > iterate
> > > > > > over
> > > > > > > > the
> > > > > > > > > > > match set multiple times (once per field). I'm not sure
> > > > that's
> > > > > > such
> > > > > > > > a big
> > > > > > > > > > > issue given that you're doing novel work during each
> > > > iteration,
> > > > > > so
> > > > > > > > the
> > > > > > > > > > only
> > > > > > > > > > > repetitive cost is actually iterating the hits. If the
> > > > ordinals
> > > > > > are
> > > > > > > > > > > "packed" into the same field though (which is the
> > default in
> > > > > > Lucene
> > > > > > > > if
> > > > > > > > > > > you're using taxonomy faceting), then you should only
> > need
> > > > to do
> > > > > > a
> > > > > > > > single
> > > > > > > > > > > iteration over that field.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > -Greg
> > > > > > > > > > >
> > > > > > > > > > > On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita <
> > > > > > > > stefan.vodita@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Greg,
> > > > > > > > > > > >
> > > > > > > > > > > > I’m assuming we have one match-set which was not
> > > > constrained
> > > > > > by any
> > > > > > > > > > > > of the categories we want to aggregate over, so it
> may
> > have
> > > > > > books
> > > > > > > > by
> > > > > > > > > > > > Mark Twain, books by American authors, and sci-fi
> > books.
> > > > > > > > > > > >
> > > > > > > > > > > > Maybe we can imagine we obtained it by searching for
> a
> > > > > > keyword, say
> > > > > > > > > > > > “Washington”, which is present in Mark Twain’s
> > writing, and
> > > > > > those
> > > > > > > > of
> > > > > > > > > > other
> > > > > > > > > > > > American authors, and in sci-fi novels too.
> > > > > > > > > > > >
> > > > > > > > > > > > Does that make the example clearer?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Stefan
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Sat, 11 Feb 2023 at 00:16, Greg Miller <
> > > > gsmiller@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Stefan-
> > > > > > > > > > > > >
> > > > > > > > > > > > > Can you clarify your example a little bit? It
> sounds
> > > > like you
> > > > > > > > want to
> > > > > > > > > > > > facet
> > > > > > > > > > > > > over three different match sets (one constrained by
> > "Mark
> > > > > > Twain"
> > > > > > > > as
> > > > > > > > > > the
> > > > > > > > > > > > > author, one constrained by "American authors" and
> one
> > > > > > > > constrained by
> > > > > > > > > > the
> > > > > > > > > > > > > "sci-fi" genre). Is that correct?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > -Greg
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <
> > > > > > > > > > stefan.vodita@gmail.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Let’s say I have an index of books, similar to
> the
> > > > example
> > > > > > in
> > > > > > > > the
> > > > > > > > > > facet
> > > > > > > > > > > > > > demo [1]
> > > > > > > > > > > > > > with a hierarchical facet field encapsulating
> > `Genre /
> > > > > > Author’s
> > > > > > > > > > > > > > nationality /
> > > > > > > > > > > > > > Author’s name`.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I might like to find the latest publish date of a
> > book
> > > > > > written
> > > > > > > > by
> > > > > > > > > > Mark
> > > > > > > > > > > > > > Twain, the
> > > > > > > > > > > > > > sum of the prices of books written by American
> > authors,
> > > > > > and the
> > > > > > > > > > number
> > > > > > > > > > > > of
> > > > > > > > > > > > > > sci-fi novels.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > As far as I understand, this would require
> > faceting 3
> > > > times
> > > > > > > > over
> > > > > > > > > > the
> > > > > > > > > > > > > > match-set,
> > > > > > > > > > > > > > one iteration for each aggregation of a different
> > type
> > > > > > > > (max(date),
> > > > > > > > > > > > > > sum(price),
> > > > > > > > > > > > > > count). That seems inefficient if we could
> instead
> > > > compute
> > > > > > all
> > > > > > > > > > > > > > aggregations in
> > > > > > > > > > > > > > one pass.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Is there a way to do that?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Stefan
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >
> https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > > > > > > > > > To unsubscribe, e-mail:
> > > > > > > > java-user-unsubscribe@lucene.apache.org
> > > > > > > > > > > > > > For additional commands, e-mail:
> > > > > > > > java-user-help@lucene.apache.org
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > >
> > > > ---------------------------------------------------------------------
> > > > > > > > > > > > To unsubscribe, e-mail:
> > > > > > java-user-unsubscribe@lucene.apache.org
> > > > > > > > > > > > For additional commands, e-mail:
> > > > > > java-user-help@lucene.apache.org
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > > > > > To unsubscribe, e-mail:
> > > > java-user-unsubscribe@lucene.apache.org
> > > > > > > > > > For additional commands, e-mail:
> > > > java-user-help@lucene.apache.org
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail:
> > java-user-unsubscribe@lucene.apache.org
> > > > > > > > For additional commands, e-mail:
> > java-user-help@lucene.apache.org
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > > > >
> > > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>