Mailing List Archive

Payloads for each term
Hi Lucene Devs,
I have a need to store a sparse feature vector on a per term
basis. The total number of possible dimensions are small (~50) and known at
indexing time. The feature values will be used in scoring along with corpus
statistics. It looks like payloads
<https://cwiki.apache.org/confluence/display/LUCENE/Payload+Planning#> were
created for this exact same purpose but some workaround is needed to
minimize the performance penalty as mentioned on the wiki
<https://cwiki.apache.org/confluence/display/LUCENE/Payload+Planning#> .

An alternative is to override *term frequency* to be a *pointer* in a
*Map<pointer,
Feature_Vector>* serialized and stored in *BinaryDocValues*. At query time,
the matching *docId *will be used to advance the pointer to the starting
offset of this map*. *The term frequency will be used to perform lookup
into the serialized map to retrieve the* Feature_Vector. *That's my current
plan but I haven't benchmarked it.

The problem that I am trying to solve is to *reduce the index bloat* and
*eliminate* *unnecessary seeks* as currently these ~50 dimensions are
stored as separate fields in the index with very high term overlap and
Lucene does not share Terms dictionary across different fields. This itself
can be a new feature for Lucene but will reqiure lots of work I imagine.

Any ideas are welcome :-)

Thanks
-Ankur
Re: Payloads for each term [ In reply to ]
Hi Ankur,
Indeed payloads are the standard way to solve this problem. For light
queries with a few top N results that should be efficient. For multi-term
queries that could become penalizing if you need to access the payloads of
too many terms.
Also, there is an experimental PostingsFormat called
SharedTermsUniformSplit (class named STUniformSplitPostingsFormat) that
would allow you to effectively share the overlapping terms in the index
while having 50 fields. This would solve the index bloat issue, but would
not fully solve the seeks issue. You might want to benchmark this approach
too.

Bruno

Le ven. 23 oct. 2020 à 02:48, Ankur Goel <ankur.goel79@gmail.com> a écrit :

> Hi Lucene Devs,
> I have a need to store a sparse feature vector on a per term
> basis. The total number of possible dimensions are small (~50) and known at
> indexing time. The feature values will be used in scoring along with corpus
> statistics. It looks like payloads
> <https://cwiki.apache.org/confluence/display/LUCENE/Payload+Planning#> were
> created for this exact same purpose but some workaround is needed to
> minimize the performance penalty as mentioned on the wiki
> <https://cwiki.apache.org/confluence/display/LUCENE/Payload+Planning#> .
>
> An alternative is to override *term frequency* to be a *pointer* in a *Map<pointer,
> Feature_Vector>* serialized and stored in *BinaryDocValues*. At query
> time, the matching *docId *will be used to advance the pointer to the
> starting offset of this map*. *The term frequency will be used to perform
> lookup into the serialized map to retrieve the* Feature_Vector. *That's
> my current plan but I haven't benchmarked it.
>
> The problem that I am trying to solve is to *reduce the index bloat* and
> *eliminate* *unnecessary seeks* as currently these ~50 dimensions are
> stored as separate fields in the index with very high term overlap and
> Lucene does not share Terms dictionary across different fields. This itself
> can be a new feature for Lucene but will reqiure lots of work I imagine.
>
> Any ideas are welcome :-)
>
> Thanks
> -Ankur
>
Re: Payloads for each term [ In reply to ]
In addition to payloads having kinda of high-ish overhead (slow down
indexing, do not compress very well I think, and slow down search as you
must pull positions), they are also sort of a forced fit for your use case,
right? Because a payload in Lucene is per-term-position, whereas you
really need this vector per-term (irrespective of the positions where that
term occurs in each document)?

Your second solution is an intriguing one. So you would use Lucene's
custom term frequencies to store indices into that per-document map encoded
into a BinaryDocValues field? During indexing I guess you would need a
TokenFilter that hands out these indices in order (0, 1, 2, ...) based on
the unique terms it sees, and after all tokens are done, it exports a
byte[] serialized map? Hmm, except term frequency 0 is not allowed, so
you'd need to + 1 to all indices.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Oct 26, 2020 at 6:16 AM Bruno Roustant <bruno.roustant@gmail.com>
wrote:

> Hi Ankur,
> Indeed payloads are the standard way to solve this problem. For light
> queries with a few top N results that should be efficient. For multi-term
> queries that could become penalizing if you need to access the payloads of
> too many terms.
> Also, there is an experimental PostingsFormat called
> SharedTermsUniformSplit (class named STUniformSplitPostingsFormat) that
> would allow you to effectively share the overlapping terms in the index
> while having 50 fields. This would solve the index bloat issue, but would
> not fully solve the seeks issue. You might want to benchmark this approach
> too.
>
> Bruno
>
> Le ven. 23 oct. 2020 à 02:48, Ankur Goel <ankur.goel79@gmail.com> a
> écrit :
>
>> Hi Lucene Devs,
>> I have a need to store a sparse feature vector on a per term
>> basis. The total number of possible dimensions are small (~50) and known at
>> indexing time. The feature values will be used in scoring along with corpus
>> statistics. It looks like payloads
>> <https://cwiki.apache.org/confluence/display/LUCENE/Payload+Planning#> were
>> created for this exact same purpose but some workaround is needed to
>> minimize the performance penalty as mentioned on the wiki
>> <https://cwiki.apache.org/confluence/display/LUCENE/Payload+Planning#> .
>>
>> An alternative is to override *term frequency* to be a *pointer* in a *Map<pointer,
>> Feature_Vector>* serialized and stored in *BinaryDocValues*. At query
>> time, the matching *docId *will be used to advance the pointer to the
>> starting offset of this map*. *The term frequency will be used to
>> perform lookup into the serialized map to retrieve the* Feature_Vector. *That's
>> my current plan but I haven't benchmarked it.
>>
>> The problem that I am trying to solve is to *reduce the index bloat* and
>> *eliminate* *unnecessary seeks* as currently these ~50 dimensions are
>> stored as separate fields in the index with very high term overlap and
>> Lucene does not share Terms dictionary across different fields. This itself
>> can be a new feature for Lucene but will reqiure lots of work I imagine.
>>
>> Any ideas are welcome :-)
>>
>> Thanks
>> -Ankur
>>
>
Re: Payloads for each term [ In reply to ]
Also, be aware that recent Lucene versions enabled compression for
BinaryDocValues fields, which might hurt performance of your second
solution.

This compression is not yet something you can easily turn off, but there
are ongoing discussions/PRs about how to make it more easily configurable
for applications that really care more about search CPU cost over index
size for BinaryDocValues fields:
https://issues.apache.org/jira/browse/LUCENE-9378

Mike McCandless

http://blog.mikemccandless.com


On Fri, Nov 6, 2020 at 10:21 AM Michael McCandless <
lucene@mikemccandless.com> wrote:

> In addition to payloads having kinda of high-ish overhead (slow down
> indexing, do not compress very well I think, and slow down search as you
> must pull positions), they are also sort of a forced fit for your use case,
> right? Because a payload in Lucene is per-term-position, whereas you
> really need this vector per-term (irrespective of the positions where that
> term occurs in each document)?
>
> Your second solution is an intriguing one. So you would use Lucene's
> custom term frequencies to store indices into that per-document map encoded
> into a BinaryDocValues field? During indexing I guess you would need a
> TokenFilter that hands out these indices in order (0, 1, 2, ...) based on
> the unique terms it sees, and after all tokens are done, it exports a
> byte[] serialized map? Hmm, except term frequency 0 is not allowed, so
> you'd need to + 1 to all indices.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Oct 26, 2020 at 6:16 AM Bruno Roustant <bruno.roustant@gmail.com>
> wrote:
>
>> Hi Ankur,
>> Indeed payloads are the standard way to solve this problem. For light
>> queries with a few top N results that should be efficient. For multi-term
>> queries that could become penalizing if you need to access the payloads of
>> too many terms.
>> Also, there is an experimental PostingsFormat called
>> SharedTermsUniformSplit (class named STUniformSplitPostingsFormat) that
>> would allow you to effectively share the overlapping terms in the index
>> while having 50 fields. This would solve the index bloat issue, but would
>> not fully solve the seeks issue. You might want to benchmark this approach
>> too.
>>
>> Bruno
>>
>> Le ven. 23 oct. 2020 à 02:48, Ankur Goel <ankur.goel79@gmail.com> a
>> écrit :
>>
>>> Hi Lucene Devs,
>>> I have a need to store a sparse feature vector on a per term
>>> basis. The total number of possible dimensions are small (~50) and known at
>>> indexing time. The feature values will be used in scoring along with corpus
>>> statistics. It looks like payloads
>>> <https://cwiki.apache.org/confluence/display/LUCENE/Payload+Planning#> were
>>> created for this exact same purpose but some workaround is needed to
>>> minimize the performance penalty as mentioned on the wiki
>>> <https://cwiki.apache.org/confluence/display/LUCENE/Payload+Planning#> .
>>>
>>> An alternative is to override *term frequency* to be a *pointer* in a *Map<pointer,
>>> Feature_Vector>* serialized and stored in *BinaryDocValues*. At query
>>> time, the matching *docId *will be used to advance the pointer to the
>>> starting offset of this map*. *The term frequency will be used to
>>> perform lookup into the serialized map to retrieve the* Feature_Vector.
>>> *That's my current plan but I haven't benchmarked it.
>>>
>>> The problem that I am trying to solve is to *reduce the index bloat* and
>>> *eliminate* *unnecessary seeks* as currently these ~50 dimensions are
>>> stored as separate fields in the index with very high term overlap and
>>> Lucene does not share Terms dictionary across different fields. This itself
>>> can be a new feature for Lucene but will reqiure lots of work I imagine.
>>>
>>> Any ideas are welcome :-)
>>>
>>> Thanks
>>> -Ankur
>>>
>>
Re: Payloads for each term [ In reply to ]
For sparse vectors, we found that Lucene's FeatureField
<https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/document/FeatureField.java>
could
also be useful. It stores features as terms and feature values as term
frequencies, and provides several convenient functions to calculate scores
based on feature values.

On Fri, Nov 6, 2020 at 11:16 AM Michael McCandless <
lucene@mikemccandless.com> wrote:

> Also, be aware that recent Lucene versions enabled compression for
> BinaryDocValues fields, which might hurt performance of your second
> solution.
>
> This compression is not yet something you can easily turn off, but there
> are ongoing discussions/PRs about how to make it more easily configurable
> for applications that really care more about search CPU cost over index
> size for BinaryDocValues fields:
> https://issues.apache.org/jira/browse/LUCENE-9378
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Nov 6, 2020 at 10:21 AM Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> In addition to payloads having kinda of high-ish overhead (slow down
>> indexing, do not compress very well I think, and slow down search as you
>> must pull positions), they are also sort of a forced fit for your use case,
>> right? Because a payload in Lucene is per-term-position, whereas you
>> really need this vector per-term (irrespective of the positions where that
>> term occurs in each document)?
>>
>> Your second solution is an intriguing one. So you would use Lucene's
>> custom term frequencies to store indices into that per-document map encoded
>> into a BinaryDocValues field? During indexing I guess you would need a
>> TokenFilter that hands out these indices in order (0, 1, 2, ...) based on
>> the unique terms it sees, and after all tokens are done, it exports a
>> byte[] serialized map? Hmm, except term frequency 0 is not allowed, so
>> you'd need to + 1 to all indices.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Mon, Oct 26, 2020 at 6:16 AM Bruno Roustant <bruno.roustant@gmail.com>
>> wrote:
>>
>>> Hi Ankur,
>>> Indeed payloads are the standard way to solve this problem. For light
>>> queries with a few top N results that should be efficient. For multi-term
>>> queries that could become penalizing if you need to access the payloads of
>>> too many terms.
>>> Also, there is an experimental PostingsFormat called
>>> SharedTermsUniformSplit (class named STUniformSplitPostingsFormat) that
>>> would allow you to effectively share the overlapping terms in the index
>>> while having 50 fields. This would solve the index bloat issue, but would
>>> not fully solve the seeks issue. You might want to benchmark this approach
>>> too.
>>>
>>> Bruno
>>>
>>> Le ven. 23 oct. 2020 à 02:48, Ankur Goel <ankur.goel79@gmail.com> a
>>> écrit :
>>>
>>>> Hi Lucene Devs,
>>>> I have a need to store a sparse feature vector on a per term
>>>> basis. The total number of possible dimensions are small (~50) and known at
>>>> indexing time. The feature values will be used in scoring along with corpus
>>>> statistics. It looks like payloads
>>>> <https://cwiki.apache.org/confluence/display/LUCENE/Payload+Planning#> were
>>>> created for this exact same purpose but some workaround is needed to
>>>> minimize the performance penalty as mentioned on the wiki
>>>> <https://cwiki.apache.org/confluence/display/LUCENE/Payload+Planning#>
>>>> .
>>>>
>>>> An alternative is to override *term frequency* to be a *pointer* in a *Map<pointer,
>>>> Feature_Vector>* serialized and stored in *BinaryDocValues*. At query
>>>> time, the matching *docId *will be used to advance the pointer to the
>>>> starting offset of this map*. *The term frequency will be used to
>>>> perform lookup into the serialized map to retrieve the*
>>>> Feature_Vector. *That's my current plan but I haven't benchmarked it.
>>>>
>>>> The problem that I am trying to solve is to *reduce the index bloat* and
>>>> *eliminate* *unnecessary seeks* as currently these ~50 dimensions are
>>>> stored as separate fields in the index with very high term overlap and
>>>> Lucene does not share Terms dictionary across different fields. This itself
>>>> can be a new feature for Lucene but will reqiure lots of work I imagine.
>>>>
>>>> Any ideas are welcome :-)
>>>>
>>>> Thanks
>>>> -Ankur
>>>>
>>>
Re: Payloads for each term [ In reply to ]
Thanks everyone for helpful suggestions.

@Mayya
In my use case these features are not term independent which is the primary
use case for FeatureField as per the documentation.

The FeatureField solution stores features as terms and values as term
frequencies. This means that it relies on the inverted index data structure
with the following logical representation.

[feature_field:feature1] -->[doc1:val1] -->[doc2:val2]

[feature_field:feature2] --> [doc2:val3]

If this is correct then it is not very different from my current solution
where feature1 and feature2 are term dependent, although the reason for
using only 16-bits for storing feature value is not clear from the FeatureField
javadoc
<https://github.com/apache/lucene-solr/blob/2d583eaba7ab8eb778bebbc5557bae29ea481830/lucene/core/src/java/org/apache/lucene/document/FeatureField.java#L63-L66>

The STUniformSplitPostingsFormat based solution proposed by Bruno looks
like an interesting option allowing different fields to share overlapping
terms while retaining the efficiency of inverted index data structure.
However I am not sure of its production readiness since it is marked
experimental.

-Ankur

On Wed, Nov 11, 2020 at 1:37 PM Mayya Sharipova <mayya.sharipova@elastic.co>
wrote:

> For sparse vectors, we found that Lucene's FeatureField
> <https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/document/FeatureField.java> could
> also be useful. It stores features as terms and feature values as term
> frequencies, and provides several convenient functions to calculate scores
> based on feature values.
>
> On Fri, Nov 6, 2020 at 11:16 AM Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> Also, be aware that recent Lucene versions enabled compression for
>> BinaryDocValues fields, which might hurt performance of your second
>> solution.
>>
>> This compression is not yet something you can easily turn off, but there
>> are ongoing discussions/PRs about how to make it more easily configurable
>> for applications that really care more about search CPU cost over index
>> size for BinaryDocValues fields:
>> https://issues.apache.org/jira/browse/LUCENE-9378
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, Nov 6, 2020 at 10:21 AM Michael McCandless <
>> lucene@mikemccandless.com> wrote:
>>
>>> In addition to payloads having kinda of high-ish overhead (slow down
>>> indexing, do not compress very well I think, and slow down search as you
>>> must pull positions), they are also sort of a forced fit for your use case,
>>> right? Because a payload in Lucene is per-term-position, whereas you
>>> really need this vector per-term (irrespective of the positions where that
>>> term occurs in each document)?
>>>
>>> Your second solution is an intriguing one. So you would use Lucene's
>>> custom term frequencies to store indices into that per-document map encoded
>>> into a BinaryDocValues field? During indexing I guess you would need a
>>> TokenFilter that hands out these indices in order (0, 1, 2, ...) based on
>>> the unique terms it sees, and after all tokens are done, it exports a
>>> byte[] serialized map? Hmm, except term frequency 0 is not allowed, so
>>> you'd need to + 1 to all indices.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Mon, Oct 26, 2020 at 6:16 AM Bruno Roustant <bruno.roustant@gmail.com>
>>> wrote:
>>>
>>>> Hi Ankur,
>>>> Indeed payloads are the standard way to solve this problem. For light
>>>> queries with a few top N results that should be efficient. For multi-term
>>>> queries that could become penalizing if you need to access the payloads of
>>>> too many terms.
>>>> Also, there is an experimental PostingsFormat called
>>>> SharedTermsUniformSplit (class named STUniformSplitPostingsFormat) that
>>>> would allow you to effectively share the overlapping terms in the index
>>>> while having 50 fields. This would solve the index bloat issue, but would
>>>> not fully solve the seeks issue. You might want to benchmark this approach
>>>> too.
>>>>
>>>> Bruno
>>>>
>>>> Le ven. 23 oct. 2020 à 02:48, Ankur Goel <ankur.goel79@gmail.com> a
>>>> écrit :
>>>>
>>>>> Hi Lucene Devs,
>>>>> I have a need to store a sparse feature vector on a per
>>>>> term basis. The total number of possible dimensions are small (~50) and
>>>>> known at indexing time. The feature values will be used in scoring along
>>>>> with corpus statistics. It looks like payloads
>>>>> <https://cwiki.apache.org/confluence/display/LUCENE/Payload+Planning#> were
>>>>> created for this exact same purpose but some workaround is needed to
>>>>> minimize the performance penalty as mentioned on the wiki
>>>>> <https://cwiki.apache.org/confluence/display/LUCENE/Payload+Planning#>
>>>>> .
>>>>>
>>>>> An alternative is to override *term frequency* to be a *pointer* in a *Map<pointer,
>>>>> Feature_Vector>* serialized and stored in *BinaryDocValues*. At query
>>>>> time, the matching *docId *will be used to advance the pointer to the
>>>>> starting offset of this map*. *The term frequency will be used to
>>>>> perform lookup into the serialized map to retrieve the*
>>>>> Feature_Vector. *That's my current plan but I haven't benchmarked it.
>>>>>
>>>>> The problem that I am trying to solve is to *reduce the index bloat* and
>>>>> *eliminate* *unnecessary seeks* as currently these ~50 dimensions are
>>>>> stored as separate fields in the index with very high term overlap and
>>>>> Lucene does not share Terms dictionary across different fields. This itself
>>>>> can be a new feature for Lucene but will reqiure lots of work I imagine.
>>>>>
>>>>> Any ideas are welcome :-)
>>>>>
>>>>> Thanks
>>>>> -Ankur
>>>>>
>>>>
Re: Payloads for each term [ In reply to ]
STUniformSplitPostingsFormat is used in production at a massive scale,
helping reduce overall memory needs a ton. I highly recommend it :-)
notwithstanding the main caveat of any non-default format:

"lucene.experimental" can be applied for different reasons -- the main one
is backwards-compatibility. Consequently, if you upgrade from 8.3 (which
added this new format) to 8.6 or whatever, there is no guarantee that 8.6
can read an index written with 8.3 if you use any postingsFormat or other
codec components besides the default implementations. All the others are
marked lucene.experimental. This is not hypothetical; incompatibilities
happen, and they have happened throughout 8.x. When incompatibilities
happen, you could re-index -- best option. Or port the older
implementation forward. You could also do an awkward multi-step conversion
process using the latest backwards-compatible index as an intermediate
bridge.

What frustrates me is not the incompatibilities themselves; maintaining
back-compat with just the default formats is plenty of work, I've
observed. What frustrates me It's that I don't think we (project
maintainers) pay enough attention to doing what I think is the bear minimum
-- communicate the incompatibility in CHANGES.txt, deserving its own
section. UniformSplit was released in 8.3. It's index format has changed
in 8.4 (LUCENE-9027) and in 8.5 (LUCENE-9116, LUCENE-9135) (this list
might not be exhaustive). One/two of those issues were sweeping changes
affecting and breaking the index compatibility of many other formats
likewise. Additionally, I think we should consistently apply and then bump
a version suffix on the internal format names when we break them so that
you get a clear error --
e.g. SharedTermsUniformSplit83->SharedTermsUniformSplit84 [2] but
surprisingly to me, this has shown to be contentious.

[1]: https://issues.apache.org/jira/browse/SOLR-14254 for some background
on the fallout from one such breakage.
[2]: https://issues.apache.org/jira/browse/LUCENE-9248 but the feedback was
in some other issue; I can't find it now :-/

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Nov 16, 2020 at 11:17 PM Ankur Goel <ankur.goel79@gmail.com> wrote:

> Thanks everyone for helpful suggestions.
>
> @Mayya
> In my use case these features are not term independent which is the
> primary use case for FeatureField as per the documentation.
>
> The FeatureField solution stores features as terms and values as term
> frequencies. This means that it relies on the inverted index data structure
> with the following logical representation.
>
> [feature_field:feature1] -->[doc1:val1] -->[doc2:val2]
>
> [feature_field:feature2] --> [doc2:val3]
>
> If this is correct then it is not very different from my current solution
> where feature1 and feature2 are term dependent, although the reason for
> using only 16-bits for storing feature value is not clear from the FeatureField
> javadoc
> <https://github.com/apache/lucene-solr/blob/2d583eaba7ab8eb778bebbc5557bae29ea481830/lucene/core/src/java/org/apache/lucene/document/FeatureField.java#L63-L66>
>
> The STUniformSplitPostingsFormat based solution proposed by Bruno looks
> like an interesting option allowing different fields to share overlapping
> terms while retaining the efficiency of inverted index data structure.
> However I am not sure of its production readiness since it is marked
> experimental.
>
> -Ankur
>
> On Wed, Nov 11, 2020 at 1:37 PM Mayya Sharipova <
> mayya.sharipova@elastic.co> wrote:
>
>> For sparse vectors, we found that Lucene's FeatureField
>> <https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/document/FeatureField.java> could
>> also be useful. It stores features as terms and feature values as term
>> frequencies, and provides several convenient functions to calculate scores
>> based on feature values.
>>
>> On Fri, Nov 6, 2020 at 11:16 AM Michael McCandless <
>> lucene@mikemccandless.com> wrote:
>>
>>> Also, be aware that recent Lucene versions enabled compression for
>>> BinaryDocValues fields, which might hurt performance of your second
>>> solution.
>>>
>>> This compression is not yet something you can easily turn off, but there
>>> are ongoing discussions/PRs about how to make it more easily configurable
>>> for applications that really care more about search CPU cost over index
>>> size for BinaryDocValues fields:
>>> https://issues.apache.org/jira/browse/LUCENE-9378
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Fri, Nov 6, 2020 at 10:21 AM Michael McCandless <
>>> lucene@mikemccandless.com> wrote:
>>>
>>>> In addition to payloads having kinda of high-ish overhead (slow down
>>>> indexing, do not compress very well I think, and slow down search as you
>>>> must pull positions), they are also sort of a forced fit for your use case,
>>>> right? Because a payload in Lucene is per-term-position, whereas you
>>>> really need this vector per-term (irrespective of the positions where that
>>>> term occurs in each document)?
>>>>
>>>> Your second solution is an intriguing one. So you would use Lucene's
>>>> custom term frequencies to store indices into that per-document map encoded
>>>> into a BinaryDocValues field? During indexing I guess you would need a
>>>> TokenFilter that hands out these indices in order (0, 1, 2, ...) based on
>>>> the unique terms it sees, and after all tokens are done, it exports a
>>>> byte[] serialized map? Hmm, except term frequency 0 is not allowed, so
>>>> you'd need to + 1 to all indices.
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>>
>>>> On Mon, Oct 26, 2020 at 6:16 AM Bruno Roustant <
>>>> bruno.roustant@gmail.com> wrote:
>>>>
>>>>> Hi Ankur,
>>>>> Indeed payloads are the standard way to solve this problem. For light
>>>>> queries with a few top N results that should be efficient. For multi-term
>>>>> queries that could become penalizing if you need to access the payloads of
>>>>> too many terms.
>>>>> Also, there is an experimental PostingsFormat called
>>>>> SharedTermsUniformSplit (class named STUniformSplitPostingsFormat) that
>>>>> would allow you to effectively share the overlapping terms in the index
>>>>> while having 50 fields. This would solve the index bloat issue, but would
>>>>> not fully solve the seeks issue. You might want to benchmark this approach
>>>>> too.
>>>>>
>>>>> Bruno
>>>>>
>>>>> Le ven. 23 oct. 2020 à 02:48, Ankur Goel <ankur.goel79@gmail.com> a
>>>>> écrit :
>>>>>
>>>>>> Hi Lucene Devs,
>>>>>> I have a need to store a sparse feature vector on a per
>>>>>> term basis. The total number of possible dimensions are small (~50) and
>>>>>> known at indexing time. The feature values will be used in scoring along
>>>>>> with corpus statistics. It looks like payloads
>>>>>> <https://cwiki.apache.org/confluence/display/LUCENE/Payload+Planning#> were
>>>>>> created for this exact same purpose but some workaround is needed to
>>>>>> minimize the performance penalty as mentioned on the wiki
>>>>>> <https://cwiki.apache.org/confluence/display/LUCENE/Payload+Planning#>
>>>>>> .
>>>>>>
>>>>>> An alternative is to override *term frequency* to be a *pointer* in
>>>>>> a *Map<pointer, Feature_Vector>* serialized and stored in
>>>>>> *BinaryDocValues*. At query time, the matching *docId *will be used
>>>>>> to advance the pointer to the starting offset of this map*. *The term
>>>>>> frequency will be used to perform lookup into the serialized map to
>>>>>> retrieve the* Feature_Vector. *That's my current plan but I haven't
>>>>>> benchmarked it.
>>>>>>
>>>>>> The problem that I am trying to solve is to *reduce the index bloat* and
>>>>>> *eliminate* *unnecessary seeks* as currently these ~50 dimensions
>>>>>> are stored as separate fields in the index with very high term overlap and
>>>>>> Lucene does not share Terms dictionary across different fields. This itself
>>>>>> can be a new feature for Lucene but will reqiure lots of work I imagine.
>>>>>>
>>>>>> Any ideas are welcome :-)
>>>>>>
>>>>>> Thanks
>>>>>> -Ankur
>>>>>>
>>>>>
Re: Payloads for each term [ In reply to ]
Oh interesting! I did not know about this FeatureField (link was to
the old repo, now gone:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/document/FeatureField.java
worked for me)

On Wed, Nov 11, 2020 at 4:37 PM Mayya Sharipova
<mayya.sharipova@elastic.co.invalid> wrote:
>
> For sparse vectors, we found that Lucene's FeatureField could also be useful. It stores features as terms and feature values as term frequencies, and provides several convenient functions to calculate scores based on feature values.
>
> On Fri, Nov 6, 2020 at 11:16 AM Michael McCandless <lucene@mikemccandless.com> wrote:
>>
>> Also, be aware that recent Lucene versions enabled compression for BinaryDocValues fields, which might hurt performance of your second solution.
>>
>> This compression is not yet something you can easily turn off, but there are ongoing discussions/PRs about how to make it more easily configurable for applications that really care more about search CPU cost over index size for BinaryDocValues fields: https://issues.apache.org/jira/browse/LUCENE-9378
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, Nov 6, 2020 at 10:21 AM Michael McCandless <lucene@mikemccandless.com> wrote:
>>>
>>> In addition to payloads having kinda of high-ish overhead (slow down indexing, do not compress very well I think, and slow down search as you must pull positions), they are also sort of a forced fit for your use case, right? Because a payload in Lucene is per-term-position, whereas you really need this vector per-term (irrespective of the positions where that term occurs in each document)?
>>>
>>> Your second solution is an intriguing one. So you would use Lucene's custom term frequencies to store indices into that per-document map encoded into a BinaryDocValues field? During indexing I guess you would need a TokenFilter that hands out these indices in order (0, 1, 2, ...) based on the unique terms it sees, and after all tokens are done, it exports a byte[] serialized map? Hmm, except term frequency 0 is not allowed, so you'd need to + 1 to all indices.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Mon, Oct 26, 2020 at 6:16 AM Bruno Roustant <bruno.roustant@gmail.com> wrote:
>>>>
>>>> Hi Ankur,
>>>> Indeed payloads are the standard way to solve this problem. For light queries with a few top N results that should be efficient. For multi-term queries that could become penalizing if you need to access the payloads of too many terms.
>>>> Also, there is an experimental PostingsFormat called SharedTermsUniformSplit (class named STUniformSplitPostingsFormat) that would allow you to effectively share the overlapping terms in the index while having 50 fields. This would solve the index bloat issue, but would not fully solve the seeks issue. You might want to benchmark this approach too.
>>>>
>>>> Bruno
>>>>
>>>> Le ven. 23 oct. 2020 à 02:48, Ankur Goel <ankur.goel79@gmail.com> a écrit :
>>>>>
>>>>> Hi Lucene Devs,
>>>>> I have a need to store a sparse feature vector on a per term basis. The total number of possible dimensions are small (~50) and known at indexing time. The feature values will be used in scoring along with corpus statistics. It looks like payloads were created for this exact same purpose but some workaround is needed to minimize the performance penalty as mentioned on the wiki .
>>>>>
>>>>> An alternative is to override term frequency to be a pointer in a Map<pointer, Feature_Vector> serialized and stored in BinaryDocValues. At query time, the matching docId will be used to advance the pointer to the starting offset of this map. The term frequency will be used to perform lookup into the serialized map to retrieve the Feature_Vector. That's my current plan but I haven't benchmarked it.
>>>>>
>>>>> The problem that I am trying to solve is to reduce the index bloat and eliminate unnecessary seeks as currently these ~50 dimensions are stored as separate fields in the index with very high term overlap and Lucene does not share Terms dictionary across different fields. This itself can be a new feature for Lucene but will reqiure lots of work I imagine.
>>>>>
>>>>> Any ideas are welcome :-)
>>>>>
>>>>> Thanks
>>>>> -Ankur

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org