Thanks for the detailed benchmarking Stefan! I think you've demonstrated
here that looping over the collected hits multiple times does in fact add
meaningful overhead. That's interesting to see!
As for whether-or-not to add functionality to the facets module that
supports this, I'm not convinced at this point. I think what you're
suggesting here—but please correct me if I'm wrong—is supporting
association faceting where the user wants to compute multiple association
aggregations for the same dimensions in a single pass. Where I'm struggling
to connect a real-world use-case though is what the user is going to
actually do with those multiple association values. The Facets API today (
https://github.com/apache/lucene/blob/main/lucene/facet/src/java/org/apache/lucene/facet/Facets.java)
has a pretty firm assumption built in that dimensions/paths have a single
value associated with them. So building some sort of association faceting
implementation that exposes more than one value associated with a given
dimension/path is a significant change to the current model, and I'm not
sure it supports enough real-world use to warrant the complexity.
OK, now disclaimer: Stefan and I work together so I think I have an idea of
what he's doing here...
What I think you're actually after here—and the one use-case I could
imagine some other users being interested in—is computing a single value
for each dimension/path that is actually an expression over _other_
aggregated values. For example, if we're using the geonames data you have
in your example, maybe the value you want to associate with a given path is
something like `max(population) + sum(elevation)`, where `max(population)`
and `sum(elevation)` are the result of two independent facet associations.
Then, you could combine those results though some expression to derive a
single value for a given path. That end result still fits the Facets API
well, but supporting something like this in Lucene requires a few other
primitives beyond just the ability to compute multiple associations at the
same time. Primarily, it needs some version of Expression + Bindings that
works for dimensions/paths. So I don't think the ability to compute
multiple associations at once is really the key missing feature here, and I
don't think it adds significant value on its own to warrant the complexity
of trying to expose it through the existings Facets API. Of course, there's
nothing preventing users from building this "multiple association"
functionality themselves.
That's my take on this, but maybe I'm missing some other use-cases that
could justify adding this capability in a general way? What do you think?
Cheers,
-Greg
On Fri, Feb 17, 2023 at 3:14 PM Stefan Vodita <stefan.vodita@gmail.com>
wrote:
> After benchmarking my implementation against the existing one, I think
> there is
> some meaningful overhead. I built a small driver [1] that runs the two
> solutions over
> a geo data [2] index (thank you Greg for providing the indexing code!).
>
> The table below lists faceting times in milliseconds. I’ve named the
> current
> implementation serial and my proposal parallel, for lack of better names.
> The
> aggregation function is a no-op, so we’re only measuring the time spent
> outside
> aggregation. The measurements are over a match-set of 100k docs, but the
> number
> of docs does not have a large impact on the results because the aggregation
> function isn’t doing any work.
>
> | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> Faceting Time (ms) |
>
> |----------------------------------|------------------------------------|--------------------------------------|
> | 2 |
> 510 | 328 |
> | 5 |
> 1211 | 775 |
> | 10 |
> 2366 | 1301 |
>
> If we use a MAX aggregation over a DocValue instead, the results tell a
> similar
> story. In this case, the number of docs matters. I've attached results
> for 10 docs and
> 100k docs.
>
> | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> Faceting Time (ms) |
>
> |----------------------------------|------------------------------------|--------------------------------------|
> | 2 |
> 706 | 505 |
> | 5 |
> 1618 | 1119 |
> | 10 |
> 3152 | 2018 |
>
> | Number of Aggregations | Serial Faceting Time (ms) | Parallel
> Faceting Time (ms) |
>
> |----------------------------------|------------------------------------|--------------------------------------|
> | 2 |
> 904 | 655 |
> | 5 |
> 2122 | 1491 |
> | 10 |
> 5062 | 3317 |
>
> With 10 aggregations, we're saving a second or more. That is significant
> for my
> use-case.
>
> I'd like to know if the test and results seem reasonable. If so, maybe
> we can think
> about providing this functionality.
>
> Thanks,
> Stefan
>
> [1]
> https://github.com/stefanvodita/lucene/commit/3536546cd9f833150db001e8eede093723cf7663
> [2] https://download.geonames.org/export/dump/allCountries.zip
>
>
> On Fri, 17 Feb 2023 at 18:45, Greg Miller <gsmiller@gmail.com> wrote:
> >
> > Thanks for the follow up Stefan. If you find significant overhead
> > associated with the multiple iterations, please keep challenging the
> > current approach and suggest improvements. It's always good to revisit
> > these things!
> >
> > Cheers,
> > -Greg
> >
> > On Thu, Feb 16, 2023 at 1:32 PM Stefan Vodita <stefan.vodita@gmail.com>
> > wrote:
> >
> > > Hi Greg,
> > >
> > > To better understand how much work gets duplicated, I went ahead
> > > and modified FloatTaxonomyFacets as an example [1]. It doesn't look
> > > too pretty, but it illustrates how I think multiple aggregations in one
> > > iteration could work.
> > >
> > > Overall, you're right, there's not as much wasted work as I had
> > > expected. I'll try to do a performance comparison to quantify precisely
> > > how much time we could save, just in case.
> > >
> > > Thank you the help!
> > > Stefan
> > >
> > > [1]
> > >
> https://github.com/stefanvodita/lucene/commit/3227dabe746858fc81b9f6e4d2ac9b66e8c32684
> > >
> > > On Wed, 15 Feb 2023 at 15:48, Greg Miller <gsmiller@gmail.com> wrote:
> > > >
> > > > Hi Stefan-
> > > >
> > > > > In that case, iterating twice duplicates most of the work, correct?
> > > >
> > > > I'm not sure I'd agree that it duplicates "most" of the work. This
> is an
> > > > association faceting example, which is a little bit of a special
> case in
> > > > some ways. But, to your question, there is duplicated work here of
> > > > re-loading the ordinals across the two aggregations, but I would
> suspect
> > > > the more expensive work is actually computing the different
> aggregations,
> > > > which is not duplicated. You're right that it would likely be more
> > > > efficient to iterate the hits once, loading the ordinals once and
> > > computing
> > > > multiple aggregations in one pass. There's no facility for doing that
> > > > currently in Lucene's faceting module, but you could always propose
> it!
> > > :)
> > > > That said, I'm not sure how common of a case this really is for the
> > > > majority of users? But that's just a guess/assumption.
> > > >
> > > > Cheers,
> > > > -Greg
> > > >
> > > > On Tue, Feb 14, 2023 at 3:19 AM Stefan Vodita <
> stefan.vodita@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Greg,
> > > > >
> > > > > I see now where my example didn’t give enough info. In my mind,
> `Genre
> > > /
> > > > > Author nationality / Author name` is stored in one hierarchical
> facet
> > > > > field.
> > > > > The data we’re aggregating over, like publish date or price, are
> > > stored in
> > > > > DocValues.
> > > > >
> > > > > The demo package shows something similar [1], where the aggregation
> > > > > is computed across a facet field using data from a `popularity`
> > > DocValue.
> > > > >
> > > > > In the demo, we compute `sum(_score * sqrt(popularity))`, but what
> if
> > > we
> > > > > want several other different aggregations with respect to the same
> > > facet
> > > > > field? Maybe we want `max(popularity)`. In that case, iterating
> twice
> > > > > duplicates most of the work, correct?
> > > > >
> > > > >
> > > > > Stefan
> > > > >
> > > > > [1]
> > > > >
> > >
> https://github.com/apache/lucene/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/demo/src/java/org/apache/lucene/demo/facet/ExpressionAggregationFacetsExample.java#L91
> > > > >
> > > > > On Mon, 13 Feb 2023 at 22:46, Greg Miller <gsmiller@gmail.com>
> wrote:
> > > > > >
> > > > > > Hi Stefan-
> > > > > >
> > > > > > That helps, thanks. I'm a bit confused about where you're
> concerned
> > > with
> > > > > > iterating over the match set multiple times. Is this a situation
> > > where
> > > > > the
> > > > > > ordinals you want to facet over are stored in different index
> > > fields, so
> > > > > > you have to create multiple Facets instances (one per field) to
> > > compute
> > > > > the
> > > > > > aggregations? If that's the case, then yes—you have to iterate
> over
> > > the
> > > > > > match set multiple times (once per field). I'm not sure that's
> such
> > > a big
> > > > > > issue given that you're doing novel work during each iteration,
> so
> > > the
> > > > > only
> > > > > > repetitive cost is actually iterating the hits. If the ordinals
> are
> > > > > > "packed" into the same field though (which is the default in
> Lucene
> > > if
> > > > > > you're using taxonomy faceting), then you should only need to do
> a
> > > single
> > > > > > iteration over that field.
> > > > > >
> > > > > > Cheers,
> > > > > > -Greg
> > > > > >
> > > > > > On Sat, Feb 11, 2023 at 2:27 AM Stefan Vodita <
> > > stefan.vodita@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Greg,
> > > > > > >
> > > > > > > I’m assuming we have one match-set which was not constrained
> by any
> > > > > > > of the categories we want to aggregate over, so it may have
> books
> > > by
> > > > > > > Mark Twain, books by American authors, and sci-fi books.
> > > > > > >
> > > > > > > Maybe we can imagine we obtained it by searching for a
> keyword, say
> > > > > > > “Washington”, which is present in Mark Twain’s writing, and
> those
> > > of
> > > > > other
> > > > > > > American authors, and in sci-fi novels too.
> > > > > > >
> > > > > > > Does that make the example clearer?
> > > > > > >
> > > > > > >
> > > > > > > Stefan
> > > > > > >
> > > > > > >
> > > > > > > On Sat, 11 Feb 2023 at 00:16, Greg Miller <gsmiller@gmail.com>
> > > wrote:
> > > > > > > >
> > > > > > > > Hi Stefan-
> > > > > > > >
> > > > > > > > Can you clarify your example a little bit? It sounds like you
> > > want to
> > > > > > > facet
> > > > > > > > over three different match sets (one constrained by "Mark
> Twain"
> > > as
> > > > > the
> > > > > > > > author, one constrained by "American authors" and one
> > > constrained by
> > > > > the
> > > > > > > > "sci-fi" genre). Is that correct?
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > -Greg
> > > > > > > >
> > > > > > > > On Fri, Feb 10, 2023 at 11:33 AM Stefan Vodita <
> > > > > stefan.vodita@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > Let’s say I have an index of books, similar to the example
> in
> > > the
> > > > > facet
> > > > > > > > > demo [1]
> > > > > > > > > with a hierarchical facet field encapsulating `Genre /
> Author’s
> > > > > > > > > nationality /
> > > > > > > > > Author’s name`.
> > > > > > > > >
> > > > > > > > > I might like to find the latest publish date of a book
> written
> > > by
> > > > > Mark
> > > > > > > > > Twain, the
> > > > > > > > > sum of the prices of books written by American authors,
> and the
> > > > > number
> > > > > > > of
> > > > > > > > > sci-fi novels.
> > > > > > > > >
> > > > > > > > > As far as I understand, this would require faceting 3 times
> > > over
> > > > > the
> > > > > > > > > match-set,
> > > > > > > > > one iteration for each aggregation of a different type
> > > (max(date),
> > > > > > > > > sum(price),
> > > > > > > > > count). That seems inefficient if we could instead compute
> all
> > > > > > > > > aggregations in
> > > > > > > > > one pass.
> > > > > > > > >
> > > > > > > > > Is there a way to do that?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Stefan
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> https://javadoc.io/doc/org.apache.lucene/lucene-demo/latest/org/apache/lucene/demo/facet/package-summary.html
> > > > > > > > >
> > > > > > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > > > > > To unsubscribe, e-mail:
> > > java-user-unsubscribe@lucene.apache.org
> > > > > > > > > For additional commands, e-mail:
> > > java-user-help@lucene.apache.org
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > > > > > > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>