Mailing List Archive

Index ordinal data in the taxonomy
Hi everyone,

I work on the Lucene product search team at Amazon. We’ve been considering
indexing scoring signals for ordinals into the taxonomy, which could reduce
index size for some use-cases.

Example

Let's consider a library of research papers, where each paper is represented by
a Lucene document and the paper's author is a facet field in that document. For
each author we store the total number of citations. We want to compute a
measure of each author's impact, the total number of citations divided by
the number of articles published.

Implementation

Each author will be assigned an ordinal in the taxonomy. Lucene doesn't
currently support storing data about an ordinal, but the taxonomy is itself a
Lucene index, where each ordinal is represented by a document. Right now, the
ordinal document has only a few fields allowing it to model the taxonomy
structure, but we could conceivably add arbitrary fields to the ordinal
documents. We would index the total number of citations an author has as a
DocValue in the corresponding ordinal document.

Advantages

The alternative would be to denormalize data about the authors and have it on
each doc that references that author. This leads to duplication. Since Lucene
already has a document representation of the author (the ordinal doc), it
makes sense conceptually that data about the author should be associated
with the ordinal doc.


I'm curious if anyone else has tried something like this and if the approach
seems reasonable. I’ve made an attempt to code it and I can open a PR if this
sounds like a useful feature.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Index ordinal data in the taxonomy [ In reply to ]
Hi Stefan,

This sounds interesting and useful. It's like static scores for Lucene
documents, only that we will apply them to ordinals. Since I assume it's
not a very common use case though, do you know if this new functionality
affects existing use cases? For example, will it change the API in
non-backward compatible way, or impact faceted search performance for the
common case?

Do you intend to support arbitrary signals, or only numeric ones? Numeric
signals will allow you to efficiently update the taxonomy index's ordinal
documents without updating the documents themselves (which will change
their ordinal!!). Other signals don't support this sort of update (yet), so
you might run into the issue of not being able to update them. And at least
for the author-citation-signal, that's definitely something you'll want to
update (unless you rebuild the index from time to time, when the signals
are updated).

Have you considered an alternative implementation of pulling that info from
another source during retrieval? Just curious what would be the performance
implications, since an alternative source can give you the flexibility of
supporting other signals which are more complicated to update, but won't
affect the taxonomy index.

Generally though, I don't see a reason not to support it.

Shai

On Thu, May 11, 2023 at 1:03?PM Stefan Vodita <stefan.vodita@gmail.com>
wrote:

> Hi everyone,
>
> I work on the Lucene product search team at Amazon. We’ve been considering
> indexing scoring signals for ordinals into the taxonomy, which could reduce
> index size for some use-cases.
>
> Example
>
> Let's consider a library of research papers, where each paper is
> represented by
> a Lucene document and the paper's author is a facet field in that
> document. For
> each author we store the total number of citations. We want to compute a
> measure of each author's impact, the total number of citations divided by
> the number of articles published.
>
> Implementation
>
> Each author will be assigned an ordinal in the taxonomy. Lucene doesn't
> currently support storing data about an ordinal, but the taxonomy is
> itself a
> Lucene index, where each ordinal is represented by a document. Right now,
> the
> ordinal document has only a few fields allowing it to model the taxonomy
> structure, but we could conceivably add arbitrary fields to the ordinal
> documents. We would index the total number of citations an author has as a
> DocValue in the corresponding ordinal document.
>
> Advantages
>
> The alternative would be to denormalize data about the authors and have it
> on
> each doc that references that author. This leads to duplication. Since
> Lucene
> already has a document representation of the author (the ordinal doc), it
> makes sense conceptually that data about the author should be associated
> with the ordinal doc.
>
>
> I'm curious if anyone else has tried something like this and if the
> approach
> seems reasonable. I’ve made an attempt to code it and I can open a PR if
> this
> sounds like a useful feature.
>
> Stefan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Index ordinal data in the taxonomy [ In reply to ]
Hello Shai,

Thank you for the feedback! I'll try to answer each of the questions.

> will it change the API in non-backward compatible way, or impact faceted search performance for the common case?

The new API could overload FacetsConfig.build or provide a new method in
TaxonomyWriter to plug in ordinal data. It doesn't have to change the
functionality that already exists. A taxonomy index in the common case would be
indistinguishable before and after this change.

> Do you intend to support arbitrary signals, or only numeric ones?

This is a crucial question. I'd like to take one small step forward and leave
room for us to make improvements later. There's two approaches we could take
initially, which I think you've already identified in your email:

1. Allow only updatabe DocValues as ordinal data. This could become limiting at
some point, but maybe it's a good first solution.

2. Disallow updating ordinal data. New ordinal data can only come in when a new
taxonomy gets built.

For the Amazon product search use case, option 2 is slightly better. We would
build new indexes more often than we would get ordinal data updates. But I'm
not sure what the better option is in the general case. This is where I'd like
feedback from other users. Maybe there's also some other approach I haven't
thought of.

> Have you considered an alternative implementation of pulling that info from another source during retrieval?

Yes, we've considered things like a local database or a separate index.
I haven't done a performance test, but my guess is that having the ordinal
data in the taxonomy is as fast as it gets for use-cases like the faceting
aggregation example in my previous email. Even if that isn't the case, the
taxonomy solution is more convenient and less burdensome from an operational
standpoint.


I hope that's useful. Thanks again for the feedback,

Stefan

On Thu, 11 May 2023 at 16:53, Shai Erera <serera@gmail.com> wrote:
>
> Hi Stefan,
>
> This sounds interesting and useful. It's like static scores for Lucene documents, only that we will apply them to ordinals. Since I assume it's not a very common use case though, do you know if this new functionality affects existing use cases? For example, will it change the API in non-backward compatible way, or impact faceted search performance for the common case?
>
> Do you intend to support arbitrary signals, or only numeric ones? Numeric signals will allow you to efficiently update the taxonomy index's ordinal documents without updating the documents themselves (which will change their ordinal!!). Other signals don't support this sort of update (yet), so you might run into the issue of not being able to update them. And at least for the author-citation-signal, that's definitely something you'll want to update (unless you rebuild the index from time to time, when the signals are updated).
>
> Have you considered an alternative implementation of pulling that info from another source during retrieval? Just curious what would be the performance implications, since an alternative source can give you the flexibility of supporting other signals which are more complicated to update, but won't affect the taxonomy index.
>
> Generally though, I don't see a reason not to support it.
>
> Shai
>
> On Thu, May 11, 2023 at 1:03?PM Stefan Vodita <stefan.vodita@gmail.com> wrote:
>>
>> Hi everyone,
>>
>> I work on the Lucene product search team at Amazon. We’ve been considering
>> indexing scoring signals for ordinals into the taxonomy, which could reduce
>> index size for some use-cases.
>>
>> Example
>>
>> Let's consider a library of research papers, where each paper is represented by
>> a Lucene document and the paper's author is a facet field in that document. For
>> each author we store the total number of citations. We want to compute a
>> measure of each author's impact, the total number of citations divided by
>> the number of articles published.
>>
>> Implementation
>>
>> Each author will be assigned an ordinal in the taxonomy. Lucene doesn't
>> currently support storing data about an ordinal, but the taxonomy is itself a
>> Lucene index, where each ordinal is represented by a document. Right now, the
>> ordinal document has only a few fields allowing it to model the taxonomy
>> structure, but we could conceivably add arbitrary fields to the ordinal
>> documents. We would index the total number of citations an author has as a
>> DocValue in the corresponding ordinal document.
>>
>> Advantages
>>
>> The alternative would be to denormalize data about the authors and have it on
>> each doc that references that author. This leads to duplication. Since Lucene
>> already has a document representation of the author (the ordinal doc), it
>> makes sense conceptually that data about the author should be associated
>> with the ordinal doc.
>>
>>
>> I'm curious if anyone else has tried something like this and if the approach
>> seems reasonable. I’ve made an attempt to code it and I can open a PR if this
>> sounds like a useful feature.
>>
>> Stefan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Index ordinal data in the taxonomy [ In reply to ]
Hi

> There's two approaches we could take initially,

Both approaches look fine to me. As long as we expose the right API. I
assume that if we use updatable DV, then we'll have a proper API on
TaxoWrite to update the fields, but otherwise (if we'll only allow updating
during Taxo rewrite) we won't have any update API. Another option is to
allow these rewrites during taxonomy merges, something we can think about.

> Yes, we've considered things like a local database or a separate index.

Another approach is to treat this like a rescore query: you aggregate the
facets without their signals and then rescore the top-K (100, 1000, 10000)
facets according to external signals. Just another idea to think about
(yes, it's not perfect, but it might work OK-ish?)

Shai

On Sat, May 13, 2023 at 6:45?PM Stefan Vodita <stefan.vodita@gmail.com>
wrote:

> Hello Shai,
>
> Thank you for the feedback! I'll try to answer each of the questions.
>
> > will it change the API in non-backward compatible way, or impact faceted
> search performance for the common case?
>
> The new API could overload FacetsConfig.build or provide a new method in
> TaxonomyWriter to plug in ordinal data. It doesn't have to change the
> functionality that already exists. A taxonomy index in the common case
> would be
> indistinguishable before and after this change.
>
> > Do you intend to support arbitrary signals, or only numeric ones?
>
> This is a crucial question. I'd like to take one small step forward and
> leave
> room for us to make improvements later. There's two approaches we could
> take
> initially, which I think you've already identified in your email:
>
> 1. Allow only updatabe DocValues as ordinal data. This could become
> limiting at
> some point, but maybe it's a good first solution.
>
> 2. Disallow updating ordinal data. New ordinal data can only come in when
> a new
> taxonomy gets built.
>
> For the Amazon product search use case, option 2 is slightly better. We
> would
> build new indexes more often than we would get ordinal data updates. But
> I'm
> not sure what the better option is in the general case. This is where I'd
> like
> feedback from other users. Maybe there's also some other approach I haven't
> thought of.
>
> > Have you considered an alternative implementation of pulling that info
> from another source during retrieval?
>
> Yes, we've considered things like a local database or a separate index.
> I haven't done a performance test, but my guess is that having the ordinal
> data in the taxonomy is as fast as it gets for use-cases like the faceting
> aggregation example in my previous email. Even if that isn't the case, the
> taxonomy solution is more convenient and less burdensome from an
> operational
> standpoint.
>
>
> I hope that's useful. Thanks again for the feedback,
>
> Stefan
>
> On Thu, 11 May 2023 at 16:53, Shai Erera <serera@gmail.com> wrote:
> >
> > Hi Stefan,
> >
> > This sounds interesting and useful. It's like static scores for Lucene
> documents, only that we will apply them to ordinals. Since I assume it's
> not a very common use case though, do you know if this new functionality
> affects existing use cases? For example, will it change the API in
> non-backward compatible way, or impact faceted search performance for the
> common case?
> >
> > Do you intend to support arbitrary signals, or only numeric ones?
> Numeric signals will allow you to efficiently update the taxonomy index's
> ordinal documents without updating the documents themselves (which will
> change their ordinal!!). Other signals don't support this sort of update
> (yet), so you might run into the issue of not being able to update them.
> And at least for the author-citation-signal, that's definitely something
> you'll want to update (unless you rebuild the index from time to time, when
> the signals are updated).
> >
> > Have you considered an alternative implementation of pulling that info
> from another source during retrieval? Just curious what would be the
> performance implications, since an alternative source can give you the
> flexibility of supporting other signals which are more complicated to
> update, but won't affect the taxonomy index.
> >
> > Generally though, I don't see a reason not to support it.
> >
> > Shai
> >
> > On Thu, May 11, 2023 at 1:03?PM Stefan Vodita <stefan.vodita@gmail.com>
> wrote:
> >>
> >> Hi everyone,
> >>
> >> I work on the Lucene product search team at Amazon. We’ve been
> considering
> >> indexing scoring signals for ordinals into the taxonomy, which could
> reduce
> >> index size for some use-cases.
> >>
> >> Example
> >>
> >> Let's consider a library of research papers, where each paper is
> represented by
> >> a Lucene document and the paper's author is a facet field in that
> document. For
> >> each author we store the total number of citations. We want to compute a
> >> measure of each author's impact, the total number of citations divided
> by
> >> the number of articles published.
> >>
> >> Implementation
> >>
> >> Each author will be assigned an ordinal in the taxonomy. Lucene doesn't
> >> currently support storing data about an ordinal, but the taxonomy is
> itself a
> >> Lucene index, where each ordinal is represented by a document. Right
> now, the
> >> ordinal document has only a few fields allowing it to model the taxonomy
> >> structure, but we could conceivably add arbitrary fields to the ordinal
> >> documents. We would index the total number of citations an author has
> as a
> >> DocValue in the corresponding ordinal document.
> >>
> >> Advantages
> >>
> >> The alternative would be to denormalize data about the authors and have
> it on
> >> each doc that references that author. This leads to duplication. Since
> Lucene
> >> already has a document representation of the author (the ordinal doc),
> it
> >> makes sense conceptually that data about the author should be associated
> >> with the ordinal doc.
> >>
> >>
> >> I'm curious if anyone else has tried something like this and if the
> approach
> >> seems reasonable. I’ve made an attempt to code it and I can open a PR
> if this
> >> sounds like a useful feature.
> >>
> >> Stefan
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Index ordinal data in the taxonomy [ In reply to ]
Hello,

I’ve opened an issue [1] to continue this discussion and a PR [2]
showing an easy
way to add data about the ordinals to the taxonomy. Let me know if you
think it's
reasonable.

Thank you,
Stefan

[1] https://github.com/apache/lucene/issues/12336
[2] https://github.com/apache/lucene/pull/12337

On Sun, 14 May 2023 at 06:52, Shai Erera <serera@gmail.com> wrote:
>
> Hi
>
> > There's two approaches we could take initially,
>
> Both approaches look fine to me. As long as we expose the right API. I assume that if we use updatable DV, then we'll have a proper API on TaxoWrite to update the fields, but otherwise (if we'll only allow updating during Taxo rewrite) we won't have any update API. Another option is to allow these rewrites during taxonomy merges, something we can think about.
>
> > Yes, we've considered things like a local database or a separate index.
>
> Another approach is to treat this like a rescore query: you aggregate the facets without their signals and then rescore the top-K (100, 1000, 10000) facets according to external signals. Just another idea to think about (yes, it's not perfect, but it might work OK-ish?)
>
> Shai
>
> On Sat, May 13, 2023 at 6:45?PM Stefan Vodita <stefan.vodita@gmail.com> wrote:
>>
>> Hello Shai,
>>
>> Thank you for the feedback! I'll try to answer each of the questions.
>>
>> > will it change the API in non-backward compatible way, or impact faceted search performance for the common case?
>>
>> The new API could overload FacetsConfig.build or provide a new method in
>> TaxonomyWriter to plug in ordinal data. It doesn't have to change the
>> functionality that already exists. A taxonomy index in the common case would be
>> indistinguishable before and after this change.
>>
>> > Do you intend to support arbitrary signals, or only numeric ones?
>>
>> This is a crucial question. I'd like to take one small step forward and leave
>> room for us to make improvements later. There's two approaches we could take
>> initially, which I think you've already identified in your email:
>>
>> 1. Allow only updatabe DocValues as ordinal data. This could become limiting at
>> some point, but maybe it's a good first solution.
>>
>> 2. Disallow updating ordinal data. New ordinal data can only come in when a new
>> taxonomy gets built.
>>
>> For the Amazon product search use case, option 2 is slightly better. We would
>> build new indexes more often than we would get ordinal data updates. But I'm
>> not sure what the better option is in the general case. This is where I'd like
>> feedback from other users. Maybe there's also some other approach I haven't
>> thought of.
>>
>> > Have you considered an alternative implementation of pulling that info from another source during retrieval?
>>
>> Yes, we've considered things like a local database or a separate index.
>> I haven't done a performance test, but my guess is that having the ordinal
>> data in the taxonomy is as fast as it gets for use-cases like the faceting
>> aggregation example in my previous email. Even if that isn't the case, the
>> taxonomy solution is more convenient and less burdensome from an operational
>> standpoint.
>>
>>
>> I hope that's useful. Thanks again for the feedback,
>>
>> Stefan
>>
>> On Thu, 11 May 2023 at 16:53, Shai Erera <serera@gmail.com> wrote:
>> >
>> > Hi Stefan,
>> >
>> > This sounds interesting and useful. It's like static scores for Lucene documents, only that we will apply them to ordinals. Since I assume it's not a very common use case though, do you know if this new functionality affects existing use cases? For example, will it change the API in non-backward compatible way, or impact faceted search performance for the common case?
>> >
>> > Do you intend to support arbitrary signals, or only numeric ones? Numeric signals will allow you to efficiently update the taxonomy index's ordinal documents without updating the documents themselves (which will change their ordinal!!). Other signals don't support this sort of update (yet), so you might run into the issue of not being able to update them. And at least for the author-citation-signal, that's definitely something you'll want to update (unless you rebuild the index from time to time, when the signals are updated).
>> >
>> > Have you considered an alternative implementation of pulling that info from another source during retrieval? Just curious what would be the performance implications, since an alternative source can give you the flexibility of supporting other signals which are more complicated to update, but won't affect the taxonomy index.
>> >
>> > Generally though, I don't see a reason not to support it.
>> >
>> > Shai
>> >
>> > On Thu, May 11, 2023 at 1:03?PM Stefan Vodita <stefan.vodita@gmail.com> wrote:
>> >>
>> >> Hi everyone,
>> >>
>> >> I work on the Lucene product search team at Amazon. We’ve been considering
>> >> indexing scoring signals for ordinals into the taxonomy, which could reduce
>> >> index size for some use-cases.
>> >>
>> >> Example
>> >>
>> >> Let's consider a library of research papers, where each paper is represented by
>> >> a Lucene document and the paper's author is a facet field in that document. For
>> >> each author we store the total number of citations. We want to compute a
>> >> measure of each author's impact, the total number of citations divided by
>> >> the number of articles published.
>> >>
>> >> Implementation
>> >>
>> >> Each author will be assigned an ordinal in the taxonomy. Lucene doesn't
>> >> currently support storing data about an ordinal, but the taxonomy is itself a
>> >> Lucene index, where each ordinal is represented by a document. Right now, the
>> >> ordinal document has only a few fields allowing it to model the taxonomy
>> >> structure, but we could conceivably add arbitrary fields to the ordinal
>> >> documents. We would index the total number of citations an author has as a
>> >> DocValue in the corresponding ordinal document.
>> >>
>> >> Advantages
>> >>
>> >> The alternative would be to denormalize data about the authors and have it on
>> >> each doc that references that author. This leads to duplication. Since Lucene
>> >> already has a document representation of the author (the ordinal doc), it
>> >> makes sense conceptually that data about the author should be associated
>> >> with the ordinal doc.
>> >>
>> >>
>> >> I'm curious if anyone else has tried something like this and if the approach
>> >> seems reasonable. I’ve made an attempt to code it and I can open a PR if this
>> >> sounds like a useful feature.
>> >>
>> >> Stefan
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org