Mailing List Archive: How to get terms of a particular field of a particular document

How to get terms of a particular field of a particular document

michael.wechner at wyona

Nov 12, 2023, 8:46 AM

Post #1 of 6 (123 views)

HI

IIUC I can get all terms of a particular field of an index with

IndexReader reader = DirectoryReader.open(„index_directory");
List<LeafReaderContext> list = reader.leaves();
for (LeafReaderContext lrc : list) {
Terms terms = lrc.reader().terms(„field_name");
if (terms != null) {
TermsEnum termsEnum = terms.iterator();
BytesRef term = null;
while ((term = termsEnum.next()) != null) {
System.out.println("Term: " + term.utf8ToString());
}
}
}
reader.close();
But how I can get all terms of a particular field of a particular document?
Thanks
Michael

P.S.: Btw, does it make sense to update the Lucene FAQ
https://cwiki.apache.org/confluence/display/lucene/lucenefaq#LuceneFAQ-HowdoIretrieveallthevaluesofaparticularfieldthatexistswithinanindex,acrossalldocuments?
with the code above?
I can do this, but want to make sure, that I don’t update it in a wrong way.

Re: How to get terms of a particular field of a particular document [ In reply to ]

Nov 12, 2023, 9:46 AM

Post #2 of 6 (123 views)

Hello,
This is what highlighters do. There are two options:
- index termVectors, obtain them in search time.
- obtain the stored field value, analyse it again, get all terms.
Good Luck

On Sun, Nov 12, 2023 at 7:47?PM Michael Wechner <michael.wechner@wyona.com>
wrote:

> HI
>
> IIUC I can get all terms of a particular field of an index with
>
> IndexReader reader = DirectoryReader.open(„index_directory");
> List<LeafReaderContext> list = reader.leaves();
> for (LeafReaderContext lrc : list) {
> Terms terms = lrc.reader().terms(„field_name");
> if (terms != null) {
> TermsEnum termsEnum = terms.iterator();
> BytesRef term = null;
> while ((term = termsEnum.next()) != null) {
> System.out.println("Term: " + term.utf8ToString());
> }
> }
> }
> reader.close();
> But how I can get all terms of a particular field of a particular document?
> Thanks
> Michael
>
> P.S.: Btw, does it make sense to update the Lucene FAQ
>
> https://cwiki.apache.org/confluence/display/lucene/lucenefaq#LuceneFAQ-HowdoIretrieveallthevaluesofaparticularfieldthatexistswithinanindex,acrossalldocuments
> ?
> with the code above?
> I can do this, but want to make sure, that I don’t update it in a wrong
> way.
>
>
>
>

--
Sincerely yours
Mikhail Khludnev

Re: How to get terms of a particular field of a particular document [ In reply to ]

michael.wechner at wyona

Nov 12, 2023, 12:42 PM

Post #3 of 6 (123 views)

Hi Mikhail

Thank you very much for your feedback!

I have found various examples for the first option when running a query,
e.g.

https://howtodoinjava.com/lucene/lucene-search-highlight-example/

but don't understand how to implement the second option, resp. how to
get the extracted terms of a document field independent of a query?

Can you maybe give a code example?

Thanks

Michael

Am 12.11.23 um 18:46 schrieb Mikhail Khludnev:
> Hello,
> This is what highlighters do. There are two options:
> - index termVectors, obtain them in search time.
> - obtain the stored field value, analyse it again, get all terms.
> Good Luck
>
> On Sun, Nov 12, 2023 at 7:47?PM Michael Wechner <michael.wechner@wyona.com>
> wrote:
>
>> HI
>>
>> IIUC I can get all terms of a particular field of an index with
>>
>> IndexReader reader = DirectoryReader.open(„index_directory");
>> List<LeafReaderContext> list = reader.leaves();
>> for (LeafReaderContext lrc : list) {
>> Terms terms = lrc.reader().terms(„field_name");
>> if (terms != null) {
>> TermsEnum termsEnum = terms.iterator();
>> BytesRef term = null;
>> while ((term = termsEnum.next()) != null) {
>> System.out.println("Term: " + term.utf8ToString());
>> }
>> }
>> }
>> reader.close();
>> But how I can get all terms of a particular field of a particular document?
>> Thanks
>> Michael
>>
>> P.S.: Btw, does it make sense to update the Lucene FAQ
>>
>> https://cwiki.apache.org/confluence/display/lucene/lucenefaq#LuceneFAQ-HowdoIretrieveallthevaluesofaparticularfieldthatexistswithinanindex,acrossalldocuments
>> ?
>> with the code above?
>> I can do this, but want to make sure, that I don’t update it in a wrong
>> way.
>>
>>
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to get terms of a particular field of a particular document [ In reply to ]

Nov 12, 2023, 1:00 PM

Post #4 of 6 (123 views)

it's something over there
https://github.com/apache/lucene/blob/4e2ce76b3e131ba92b7327a52460e6c4d92c5e33/lucene/highlighter/src/java/org/apache/lucene/search/highlight/Highlighter.java#L159

On Sun, Nov 12, 2023 at 11:42?PM Michael Wechner <michael.wechner@wyona.com>
wrote:

> Hi Mikhail
>
> Thank you very much for your feedback!
>
> I have found various examples for the first option when running a query,
> e.g.
>
> https://howtodoinjava.com/lucene/lucene-search-highlight-example/
>
> but don't understand how to implement the second option, resp. how to
> get the extracted terms of a document field independent of a query?
>
> Can you maybe give a code example?
>
> Thanks
>
> Michael
>
>
>
> Am 12.11.23 um 18:46 schrieb Mikhail Khludnev:
> > Hello,
> > This is what highlighters do. There are two options:
> > - index termVectors, obtain them in search time.
> > - obtain the stored field value, analyse it again, get all terms.
> > Good Luck
> >
> > On Sun, Nov 12, 2023 at 7:47?PM Michael Wechner <
> michael.wechner@wyona.com>
> > wrote:
> >
> >> HI
> >>
> >> IIUC I can get all terms of a particular field of an index with
> >>
> >> IndexReader reader = DirectoryReader.open(„index_directory");
> >> List<LeafReaderContext> list = reader.leaves();
> >> for (LeafReaderContext lrc : list) {
> >> Terms terms = lrc.reader().terms(„field_name");
> >> if (terms != null) {
> >> TermsEnum termsEnum = terms.iterator();
> >> BytesRef term = null;
> >> while ((term = termsEnum.next()) != null) {
> >> System.out.println("Term: " + term.utf8ToString());
> >> }
> >> }
> >> }
> >> reader.close();
> >> But how I can get all terms of a particular field of a particular
> document?
> >> Thanks
> >> Michael
> >>
> >> P.S.: Btw, does it make sense to update the Lucene FAQ
> >>
> >>
> https://cwiki.apache.org/confluence/display/lucene/lucenefaq#LuceneFAQ-HowdoIretrieveallthevaluesofaparticularfieldthatexistswithinanindex,acrossalldocuments
> >> ?
> >> with the code above?
> >> I can do this, but want to make sure, that I don’t update it in a wrong
> >> way.
> >>
> >>
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--
Sincerely yours
Mikhail Khludnev

Re: How to get terms of a particular field of a particular document [ In reply to ]

michael.wechner at wyona

Nov 12, 2023, 2:36 PM

Post #5 of 6 (123 views)

Thanks again, whereas I think I have found now what I wanted (without needing the Highlighter):

IndexReader reader = DirectoryReader.open(„index_directory");
log.info("Get terms of document ...");
TokenStream stream = TokenSources.getTokenStream(„field_name", null, text, analyzer, -1);
stream.reset();
while (stream.incrementToken()) {
log.info("Term: " + stream.getAttribute(CharTermAttribute.class));
}
stream.close();
reader.close()

Thanks

Michael

> Am 12.11.2023 um 22:00 schrieb Mikhail Khludnev <mkhl@apache.org>:
>
> it's something over there
> https://github.com/apache/lucene/blob/4e2ce76b3e131ba92b7327a52460e6c4d92c5e33/lucene/highlighter/src/java/org/apache/lucene/search/highlight/Highlighter.java#L159
>
>
> On Sun, Nov 12, 2023 at 11:42?PM Michael Wechner <michael.wechner@wyona.com>
> wrote:
>
>> Hi Mikhail
>>
>> Thank you very much for your feedback!
>>
>> I have found various examples for the first option when running a query,
>> e.g.
>>
>> https://howtodoinjava.com/lucene/lucene-search-highlight-example/
>>
>> but don't understand how to implement the second option, resp. how to
>> get the extracted terms of a document field independent of a query?
>>
>> Can you maybe give a code example?
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>> Am 12.11.23 um 18:46 schrieb Mikhail Khludnev:
>>> Hello,
>>> This is what highlighters do. There are two options:
>>> - index termVectors, obtain them in search time.
>>> - obtain the stored field value, analyse it again, get all terms.
>>> Good Luck
>>>
>>> On Sun, Nov 12, 2023 at 7:47?PM Michael Wechner <
>> michael.wechner@wyona.com>
>>> wrote:
>>>
>>>> HI
>>>>
>>>> IIUC I can get all terms of a particular field of an index with
>>>>
>>>> IndexReader reader = DirectoryReader.open(„index_directory");
>>>> List<LeafReaderContext> list = reader.leaves();
>>>> for (LeafReaderContext lrc : list) {
>>>> Terms terms = lrc.reader().terms(„field_name");
>>>> if (terms != null) {
>>>> TermsEnum termsEnum = terms.iterator();
>>>> BytesRef term = null;
>>>> while ((term = termsEnum.next()) != null) {
>>>> System.out.println("Term: " + term.utf8ToString());
>>>> }
>>>> }
>>>> }
>>>> reader.close();
>>>> But how I can get all terms of a particular field of a particular
>> document?
>>>> Thanks
>>>> Michael
>>>>
>>>> P.S.: Btw, does it make sense to update the Lucene FAQ
>>>>
>>>>
>> https://cwiki.apache.org/confluence/display/lucene/lucenefaq#LuceneFAQ-HowdoIretrieveallthevaluesofaparticularfieldthatexistswithinanindex,acrossalldocuments
>>>> ?
>>>> with the code above?
>>>> I can do this, but want to make sure, that I don’t update it in a wrong
>>>> way.
>>>>
>>>>
>>>>
>>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> --
> Sincerely yours
> Mikhail Khludnev

Re: How to get terms of a particular field of a particular document [ In reply to ]

michael.wechner at wyona

Nov 13, 2023, 1:30 AM

Post #6 of 6 (123 views)

I just realize, that the code can be even simpler:

String text ="Apache Lucen is a great search library!"; TokenStream stream = TokenSources.getTokenStream(null,null, text,new StandardAnalyzer(), -1);
stream.reset();// INFO: See
https://lucene.apache.org/core/9_8_0/core/org/apache/lucene/analysis/TokenStream.html
while (stream.incrementToken()) {
log.info("Token: " + stream.getAttribute(CharTermAttribute.class));
}
stream.end();
stream.close();

The code also seems to work without stream.end() but if I understand the documentation at
https://lucene.apache.org/core/9_8_0/core/org/apache/lucene/analysis/TokenStream.html
correctly, then one should add it.

Thanks

Michael

Am 12.11.23 um 23:36 schrieb Michael Wechner:
> Thanks again, whereas I think I have found now what I wanted (without needing the Highlighter):
>
> IndexReader reader = DirectoryReader.open(„index_directory");
> log.info("Get terms of document ...");
> TokenStream stream = TokenSources.getTokenStream(„field_name", null, text, analyzer, -1);
> stream.reset();
> while (stream.incrementToken()) {
> log.info("Term: " + stream.getAttribute(CharTermAttribute.class));
> }
> stream.close();
> reader.close()
>
> Thanks
>
> Michael
>
>
>
>
>> Am 12.11.2023 um 22:00 schrieb Mikhail Khludnev<mkhl@apache.org>:
>>
>> it's something over there
>> https://github.com/apache/lucene/blob/4e2ce76b3e131ba92b7327a52460e6c4d92c5e33/lucene/highlighter/src/java/org/apache/lucene/search/highlight/Highlighter.java#L159
>>
>>
>> On Sun, Nov 12, 2023 at 11:42?PM Michael Wechner<michael.wechner@wyona.com>
>> wrote:
>>
>>> Hi Mikhail
>>>
>>> Thank you very much for your feedback!
>>>
>>> I have found various examples for the first option when running a query,
>>> e.g.
>>>
>>> https://howtodoinjava.com/lucene/lucene-search-highlight-example/
>>>
>>> but don't understand how to implement the second option, resp. how to
>>> get the extracted terms of a document field independent of a query?
>>>
>>> Can you maybe give a code example?
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>>
>>>
>>> Am 12.11.23 um 18:46 schrieb Mikhail Khludnev:
>>>> Hello,
>>>> This is what highlighters do. There are two options:
>>>> - index termVectors, obtain them in search time.
>>>> - obtain the stored field value, analyse it again, get all terms.
>>>> Good Luck
>>>>
>>>> On Sun, Nov 12, 2023 at 7:47?PM Michael Wechner <
>>> michael.wechner@wyona.com>
>>>> wrote:
>>>>
>>>>> HI
>>>>>
>>>>> IIUC I can get all terms of a particular field of an index with
>>>>>
>>>>> IndexReader reader = DirectoryReader.open(„index_directory");
>>>>> List<LeafReaderContext> list = reader.leaves();
>>>>> for (LeafReaderContext lrc : list) {
>>>>> Terms terms = lrc.reader().terms(„field_name");
>>>>> if (terms != null) {
>>>>> TermsEnum termsEnum = terms.iterator();
>>>>> BytesRef term = null;
>>>>> while ((term = termsEnum.next()) != null) {
>>>>> System.out.println("Term: " + term.utf8ToString());
>>>>> }
>>>>> }
>>>>> }
>>>>> reader.close();
>>>>> But how I can get all terms of a particular field of a particular
>>> document?
>>>>> Thanks
>>>>> Michael
>>>>>
>>>>> P.S.: Btw, does it make sense to update the Lucene FAQ
>>>>>
>>>>>
>>> https://cwiki.apache.org/confluence/display/lucene/lucenefaq#LuceneFAQ-HowdoIretrieveallthevaluesofaparticularfieldthatexistswithinanindex,acrossalldocuments
>>>>> ?
>>>>> with the code above?
>>>>> I can do this, but want to make sure, that I don’t update it in a wrong
>>>>> way.
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail:java-user-help@lucene.apache.org
>>>
>>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>