Mailing List Archive

additional term meta data
Hi folks:

We like to propose a feature to add additional per-term metadata to the
term diction.

Currently, the TermsEnum API returns docFreq as its only meta-data. We
needed a way to quickly get the first and last doc id in the postings
without having to scan through the entire postings list.

We have created a PR on our own fork and we would like to contribute this
back to the community. Please let us know if this is something that's
useful and/or fits Lucene's roadmap, we would be happy to submit a patch.

https://github.com/dashbase/lucene-solr/pull/1

Thank you

-John
Re: additional term meta data [ In reply to ]
how to access first and last doc-id?
for which lucene version will you be targeting your merge?

Request: please submit testcase to show proper operation

Thanks John!
martin-

________________________________
From: John Wang <john.wang@gmail.com>
Sent: Tuesday, January 5, 2021 8:19 PM
To: dev@lucene.apache.org <dev@lucene.apache.org>
Subject: additional term meta data

Hi folks:

We like to propose a feature to add additional per-term metadata to the term diction.

Currently, the TermsEnum API returns docFreq as its only meta-data. We needed a way to quickly get the first and last doc id in the postings without having to scan through the entire postings list.

We have created a PR on our own fork and we would like to contribute this back to the community. Please let us know if this is something that's useful and/or fits Lucene's roadmap, we would be happy to submit a patch.

https://github.com/dashbase/lucene-solr/pull/1

Thank you

-John
Re: additional term meta data [ In reply to ]
how to access first and last?
which version will you be merging

________________________________
From: John Wang <john.wang@gmail.com>
Sent: Tuesday, January 5, 2021 8:19 PM
To: dev@lucene.apache.org <dev@lucene.apache.org>
Subject: additional term meta data

Hi folks:

We like to propose a feature to add additional per-term metadata to the term diction.

Currently, the TermsEnum API returns docFreq as its only meta-data. We needed a way to quickly get the first and last doc id in the postings without having to scan through the entire postings list.

We have created a PR on our own fork and we would like to contribute this back to the community. Please let us know if this is something that's useful and/or fits Lucene's roadmap, we would be happy to submit a patch.

https://github.com/dashbase/lucene-solr/pull/1

Thank you

-John
Re: additional term meta data [ In reply to ]
Hey Martin:

There is a test case in the PR we created on our own fork:
https://github.com/dashbase/lucene-solr/pull/1, which also contains some
example code on how to access in the PR description.

Here is the link to the beginning of the tests:
https://github.com/dashbase/lucene-solr/blob/posting-last-docid/lucene/core/src/test/org/apache/lucene/codecs/lucene84/TestLucene84PostingsFormat.java#L142

I am not sure which version this should be applied to, currently, it was
based on master as of a few days ago. We intend to patch 8.7 for our own
environment.

Any advice or feedback is much appreciated.

Thank you!

-John

On Wed, Jan 6, 2021 at 3:28 AM Martin Gainty <mgainty@hotmail.com> wrote:

> how to access first and last?
> which version will you be merging
>
> ------------------------------
> *From:* John Wang <john.wang@gmail.com>
> *Sent:* Tuesday, January 5, 2021 8:19 PM
> *To:* dev@lucene.apache.org <dev@lucene.apache.org>
> *Subject:* additional term meta data
>
> Hi folks:
>
> We like to propose a feature to add additional per-term metadata to the
> term diction.
>
> Currently, the TermsEnum API returns docFreq as its only meta-data. We
> needed a way to quickly get the first and last doc id in the postings
> without having to scan through the entire postings list.
>
> We have created a PR on our own fork and we would like to contribute this
> back to the community. Please let us know if this is something that's
> useful and/or fits Lucene's roadmap, we would be happy to submit a patch.
>
> https://github.com/dashbase/lucene-solr/pull/1
>
> Thank you
>
> -John
>
Re: additional term meta data [ In reply to ]
appears you are targeting 9.0 for your code
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90FieldInfosFormat.java<https://github.com/dashbase/lucene-solr/pull/1/files#diff-224246aa19a54dd91fc495a6bbf7d75b26dbeaa3aceab058214d68fcbb38d24c>
(Lucene90FIeldInfosFormat.java is not contained in either 8.4 or 8.7 distros)

<RANT>
someone had the bright idea to nuke ant 8.x build.xml without consulting anyone
not a fan of ant but the execution model of gradle is woefully inflexible in comparison to maven
</RANT>

i will try with 90 distro to get the codecs/lucene90/Lucene90FieldInfosFormat and recompile and hopefully your TestLucene84PostingsFormat will run w/o fail or error

Thx
martin-

________________________________
From: John Wang <john.wang@gmail.com>
Sent: Wednesday, January 6, 2021 10:15 AM
To: dev@lucene.apache.org <dev@lucene.apache.org>
Subject: Re: additional term meta data

Hey Martin:

There is a test case in the PR we created on our own fork: https://github.com/dashbase/lucene-solr/pull/1, which also contains some example code on how to access in the PR description.

Here is the link to the beginning of the tests: https://github.com/dashbase/lucene-solr/blob/posting-last-docid/lucene/core/src/test/org/apache/lucene/codecs/lucene84/TestLucene84PostingsFormat.java#L142

I am not sure which version this should be applied to, currently, it was based on master as of a few days ago. We intend to patch 8.7 for our own environment.

Any advice or feedback is much appreciated.

Thank you!

-John

On Wed, Jan 6, 2021 at 3:28 AM Martin Gainty <mgainty@hotmail.com<mailto:mgainty@hotmail.com>> wrote:
how to access first and last?
which version will you be merging

________________________________
From: John Wang <john.wang@gmail.com<mailto:john.wang@gmail.com>>
Sent: Tuesday, January 5, 2021 8:19 PM
To: dev@lucene.apache.org<mailto:dev@lucene.apache.org> <dev@lucene.apache.org<mailto:dev@lucene.apache.org>>
Subject: additional term meta data

Hi folks:

We like to propose a feature to add additional per-term metadata to the term diction.

Currently, the TermsEnum API returns docFreq as its only meta-data. We needed a way to quickly get the first and last doc id in the postings without having to scan through the entire postings list.

We have created a PR on our own fork and we would like to contribute this back to the community. Please let us know if this is something that's useful and/or fits Lucene's roadmap, we would be happy to submit a patch.

https://github.com/dashbase/lucene-solr/pull/1

Thank you

-John
Re: additional term meta data [ In reply to ]
Thank you, Martin!

You can apply the patch to the 8.7 build by just ignoring the changes to
Lucene90xxx. Appreciate the help and guidance!

-John


On Wed, Jan 6, 2021 at 10:36 AM Martin Gainty <mgainty@hotmail.com> wrote:

> appears you are targeting 9.0 for your code
>
> lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90FieldInfosFormat.java
> <https://github.com/dashbase/lucene-solr/pull/1/files#diff-224246aa19a54dd91fc495a6bbf7d75b26dbeaa3aceab058214d68fcbb38d24c>
> (Lucene90FIeldInfosFormat.java is not contained in either 8.4 or 8.7
> distros)
>
> <RANT>
> someone had the bright idea to nuke ant 8.x build.xml without consulting
> anyone
> not a fan of ant but the execution model of gradle is woefully inflexible
> in comparison to maven
> </RANT>
>
> i will try with 90 distro to get the
> codecs/lucene90/Lucene90FieldInfosFormat and recompile and hopefully your
> TestLucene84PostingsFormat will run w/o fail or error
>
> Thx
> martin-
>
> ------------------------------
> *From:* John Wang <john.wang@gmail.com>
> *Sent:* Wednesday, January 6, 2021 10:15 AM
> *To:* dev@lucene.apache.org <dev@lucene.apache.org>
> *Subject:* Re: additional term meta data
>
> Hey Martin:
>
> There is a test case in the PR we created on our own fork:
> https://github.com/dashbase/lucene-solr/pull/1, which also contains some
> example code on how to access in the PR description.
>
> Here is the link to the beginning of the tests:
> https://github.com/dashbase/lucene-solr/blob/posting-last-docid/lucene/core/src/test/org/apache/lucene/codecs/lucene84/TestLucene84PostingsFormat.java#L142
>
> I am not sure which version this should be applied to, currently, it was
> based on master as of a few days ago. We intend to patch 8.7 for our own
> environment.
>
> Any advice or feedback is much appreciated.
>
> Thank you!
>
> -John
>
> On Wed, Jan 6, 2021 at 3:28 AM Martin Gainty <mgainty@hotmail.com> wrote:
>
> how to access first and last?
> which version will you be merging
>
> ------------------------------
> *From:* John Wang <john.wang@gmail.com>
> *Sent:* Tuesday, January 5, 2021 8:19 PM
> *To:* dev@lucene.apache.org <dev@lucene.apache.org>
> *Subject:* additional term meta data
>
> Hi folks:
>
> We like to propose a feature to add additional per-term metadata to the
> term diction.
>
> Currently, the TermsEnum API returns docFreq as its only meta-data. We
> needed a way to quickly get the first and last doc id in the postings
> without having to scan through the entire postings list.
>
> We have created a PR on our own fork and we would like to contribute this
> back to the community. Please let us know if this is something that's
> useful and/or fits Lucene's roadmap, we would be happy to submit a patch.
>
> https://github.com/dashbase/lucene-solr/pull/1
>
> Thank you
>
> -John
>
>
Re: additional term meta data [ In reply to ]
John, can you explain what the usecase for such a new API is? I don't
see a user of the API in your code. Is there a query you can optimize
with this or what is the reasoning behind this change? I personally
think it's quite invasive to add this information and there must be a
good reason to add this to the TermsEnum? I also don't think we should
have an option on the field for this if we add it but if we don't do
that it's quite a heavy change so I am on the fence if we should even
consider this?
I wonder if you can use the TermsEnum#getAttributeSource() API instead
and add this as a dedicated attribute which is present if the info is
stored. That way you can build your own PostingsFormat that does store
this information?

simon

On Wed, Jan 6, 2021 at 8:06 PM John Wang <john.wang@gmail.com> wrote:
>
> Thank you, Martin!
>
> You can apply the patch to the 8.7 build by just ignoring the changes to Lucene90xxx. Appreciate the help and guidance!
>
> -John
>
>
> On Wed, Jan 6, 2021 at 10:36 AM Martin Gainty <mgainty@hotmail.com> wrote:
>>
>> appears you are targeting 9.0 for your code
>> lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90FieldInfosFormat.java
>> (Lucene90FIeldInfosFormat.java is not contained in either 8.4 or 8.7 distros)
>>
>> <RANT>
>> someone had the bright idea to nuke ant 8.x build.xml without consulting anyone
>> not a fan of ant but the execution model of gradle is woefully inflexible in comparison to maven
>> </RANT>
>>
>> i will try with 90 distro to get the codecs/lucene90/Lucene90FieldInfosFormat and recompile and hopefully your TestLucene84PostingsFormat will run w/o fail or error
>>
>> Thx
>> martin-
>>
>> ________________________________
>> From: John Wang <john.wang@gmail.com>
>> Sent: Wednesday, January 6, 2021 10:15 AM
>> To: dev@lucene.apache.org <dev@lucene.apache.org>
>> Subject: Re: additional term meta data
>>
>> Hey Martin:
>>
>> There is a test case in the PR we created on our own fork: https://github.com/dashbase/lucene-solr/pull/1, which also contains some example code on how to access in the PR description.
>>
>> Here is the link to the beginning of the tests: https://github.com/dashbase/lucene-solr/blob/posting-last-docid/lucene/core/src/test/org/apache/lucene/codecs/lucene84/TestLucene84PostingsFormat.java#L142
>>
>> I am not sure which version this should be applied to, currently, it was based on master as of a few days ago. We intend to patch 8.7 for our own environment.
>>
>> Any advice or feedback is much appreciated.
>>
>> Thank you!
>>
>> -John
>>
>> On Wed, Jan 6, 2021 at 3:28 AM Martin Gainty <mgainty@hotmail.com> wrote:
>>
>> how to access first and last?
>> which version will you be merging
>>
>> ________________________________
>> From: John Wang <john.wang@gmail.com>
>> Sent: Tuesday, January 5, 2021 8:19 PM
>> To: dev@lucene.apache.org <dev@lucene.apache.org>
>> Subject: additional term meta data
>>
>> Hi folks:
>>
>> We like to propose a feature to add additional per-term metadata to the term diction.
>>
>> Currently, the TermsEnum API returns docFreq as its only meta-data. We needed a way to quickly get the first and last doc id in the postings without having to scan through the entire postings list.
>>
>> We have created a PR on our own fork and we would like to contribute this back to the community. Please let us know if this is something that's useful and/or fits Lucene's roadmap, we would be happy to submit a patch.
>>
>> https://github.com/dashbase/lucene-solr/pull/1
>>
>> Thank you
>>
>> -John

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: additional term meta data [ In reply to ]
Hi Simon:

This might be specific to us, it makes sense not making such core changes
If not needed.

Here is our use case anyway:

We first sort the index in time order, so docids can be used as proxy for
time.

In the VoIP world, we are using Lucene to stitch call flows, which is
similar to the APM/tracing use case. To optimally get the range of the
transaction, using first and last docid helps without the need to traverse
the posting list.

It would be ideal for us to not have to modify Lucene, would be great to
understand how getting AttributeSource helps with this case. Let me spend
some time learning about it.

Thank you for the suggestion!

-John




On Fri, Jan 8, 2021 at 11:19 PM Simon Willnauer <simon.willnauer@gmail.com>
wrote:

> John, can you explain what the usecase for such a new API is? I don't
> see a user of the API in your code. Is there a query you can optimize
> with this or what is the reasoning behind this change? I personally
> think it's quite invasive to add this information and there must be a
> good reason to add this to the TermsEnum? I also don't think we should
> have an option on the field for this if we add it but if we don't do
> that it's quite a heavy change so I am on the fence if we should even
> consider this?
> I wonder if you can use the TermsEnum#getAttributeSource() API instead
> and add this as a dedicated attribute which is present if the info is
> stored. That way you can build your own PostingsFormat that does store
> this information?
>
> simon
>
> On Wed, Jan 6, 2021 at 8:06 PM John Wang <john.wang@gmail.com> wrote:
> >
> > Thank you, Martin!
> >
> > You can apply the patch to the 8.7 build by just ignoring the changes to
> Lucene90xxx. Appreciate the help and guidance!
> >
> > -John
> >
> >
> > On Wed, Jan 6, 2021 at 10:36 AM Martin Gainty <mgainty@hotmail.com>
> wrote:
> >>
> >> appears you are targeting 9.0 for your code
> >>
> lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90FieldInfosFormat.java
> >> (Lucene90FIeldInfosFormat.java is not contained in either 8.4 or 8.7
> distros)
> >>
> >> <RANT>
> >> someone had the bright idea to nuke ant 8.x build.xml without
> consulting anyone
> >> not a fan of ant but the execution model of gradle is woefully
> inflexible in comparison to maven
> >> </RANT>
> >>
> >> i will try with 90 distro to get the
> codecs/lucene90/Lucene90FieldInfosFormat and recompile and hopefully your
> TestLucene84PostingsFormat will run w/o fail or error
> >>
> >> Thx
> >> martin-
> >>
> >> ________________________________
> >> From: John Wang <john.wang@gmail.com>
> >> Sent: Wednesday, January 6, 2021 10:15 AM
> >> To: dev@lucene.apache.org <dev@lucene.apache.org>
> >> Subject: Re: additional term meta data
> >>
> >> Hey Martin:
> >>
> >> There is a test case in the PR we created on our own fork:
> https://github.com/dashbase/lucene-solr/pull/1, which also contains some
> example code on how to access in the PR description.
> >>
> >> Here is the link to the beginning of the tests:
> https://github.com/dashbase/lucene-solr/blob/posting-last-docid/lucene/core/src/test/org/apache/lucene/codecs/lucene84/TestLucene84PostingsFormat.java#L142
> >>
> >> I am not sure which version this should be applied to, currently, it
> was based on master as of a few days ago. We intend to patch 8.7 for our
> own environment.
> >>
> >> Any advice or feedback is much appreciated.
> >>
> >> Thank you!
> >>
> >> -John
> >>
> >> On Wed, Jan 6, 2021 at 3:28 AM Martin Gainty <mgainty@hotmail.com>
> wrote:
> >>
> >> how to access first and last?
> >> which version will you be merging
> >>
> >> ________________________________
> >> From: John Wang <john.wang@gmail.com>
> >> Sent: Tuesday, January 5, 2021 8:19 PM
> >> To: dev@lucene.apache.org <dev@lucene.apache.org>
> >> Subject: additional term meta data
> >>
> >> Hi folks:
> >>
> >> We like to propose a feature to add additional per-term metadata to the
> term diction.
> >>
> >> Currently, the TermsEnum API returns docFreq as its only meta-data. We
> needed a way to quickly get the first and last doc id in the postings
> without having to scan through the entire postings list.
> >>
> >> We have created a PR on our own fork and we would like to contribute
> this back to the community. Please let us know if this is something that's
> useful and/or fits Lucene's roadmap, we would be happy to submit a patch.
> >>
> >> https://github.com/dashbase/lucene-solr/pull/1
> >>
> >> Thank you
> >>
> >> -John
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: additional term meta data [ In reply to ]
close to finish testing but i need help finding this testcase
RamUsageTester

any ideas?

Thanks John!
martin

________________________________
From: Martin Gainty <mgainty@hotmail.com>
Sent: Wednesday, January 6, 2021 6:28 AM
To: dev@lucene.apache.org <dev@lucene.apache.org>
Subject: Re: additional term meta data

how to access first and last?
which version will you be merging

________________________________
From: John Wang <john.wang@gmail.com>
Sent: Tuesday, January 5, 2021 8:19 PM
To: dev@lucene.apache.org <dev@lucene.apache.org>
Subject: additional term meta data

Hi folks:

We like to propose a feature to add additional per-term metadata to the term diction.

Currently, the TermsEnum API returns docFreq as its only meta-data. We needed a way to quickly get the first and last doc id in the postings without having to scan through the entire postings list.

We have created a PR on our own fork and we would like to contribute this back to the community. Please let us know if this is something that's useful and/or fits Lucene's roadmap, we would be happy to submit a patch.

https://github.com/dashbase/lucene-solr/pull/1

Thank you

-John
Re: additional term meta data [ In reply to ]
Hi Martin:

I don't think my PR has this test.

Thanks

-John

On Fri, Jan 22, 2021 at 7:01 AM Martin Gainty <mgainty@hotmail.com> wrote:

> close to finish testing but i need help finding this testcase
> RamUsageTester
>
> any ideas?
>
> Thanks John!
> martin
>
> ------------------------------
> *From:* Martin Gainty <mgainty@hotmail.com>
> *Sent:* Wednesday, January 6, 2021 6:28 AM
> *To:* dev@lucene.apache.org <dev@lucene.apache.org>
> *Subject:* Re: additional term meta data
>
> how to access first and last?
> which version will you be merging
>
> ------------------------------
> *From:* John Wang <john.wang@gmail.com>
> *Sent:* Tuesday, January 5, 2021 8:19 PM
> *To:* dev@lucene.apache.org <dev@lucene.apache.org>
> *Subject:* additional term meta data
>
> Hi folks:
>
> We like to propose a feature to add additional per-term metadata to the
> term diction.
>
> Currently, the TermsEnum API returns docFreq as its only meta-data. We
> needed a way to quickly get the first and last doc id in the postings
> without having to scan through the entire postings list.
>
> We have created a PR on our own fork and we would like to contribute this
> back to the community. Please let us know if this is something that's
> useful and/or fits Lucene's roadmap, we would be happy to submit a patch.
>
> https://github.com/dashbase/lucene-solr/pull/1
>
> Thank you
>
> -John
>