Mailing List Archive: Can an analyzer access other field's data during index time?

Can an analyzer access other field's data during index time?

Apr 24, 2023, 7:59 AM

Post #1 of 9 (373 views)

Hi,

I understand Lucene analyzer is per field basis. But I wonder if it's even possible for an analyzer on field A to be able to access data in field B during the index process on any stage, saying CharFilter, Tokenizer or TokenFilter?

I'd like to control the behavior of the indexing process for field A based upon the value in field B.

Mighty Lucene community, please let me know if this is doable...

Many thanks,

Guan
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

Re: Can an analyzer access other field's data during index time? [ In reply to ]

mkhl at apache

Apr 24, 2023, 1:20 PM

Post #2 of 9 (373 views)

Permalink

Hello Guan.
It reminds me https://youtu.be/EkkzSLstSAE?t=1531 timecode.
I'm afraid it's quite far from the existing codebase where the Field has no
reference to enclosing Document. sigh.

On Mon, Apr 24, 2023 at 6:00?PM Wang, Guan <wanggu@med.umich.edu> wrote:

> Hi,
>
> I understand Lucene analyzer is per field basis. But I wonder if it's even
> possible for an analyzer on field A to be able to access data in field B
> during the index process on any stage, saying CharFilter, Tokenizer or
> TokenFilter?
>
> I'd like to control the behavior of the indexing process for field A based
> upon the value in field B.
>
> Mighty Lucene community, please let me know if this is doable...
>
> Many thanks,
>
> Guan
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not
> be used for urgent or sensitive issues
>

--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

RE: Can an analyzer access other field's data during index time? [ In reply to ]

wanggu at med

Apr 24, 2023, 1:39 PM

Post #3 of 9 (373 views)

Permalink

Hi Mikhail,

Thank you for the definitive answer!

I could "solve" this by adding a header in the document with proper information to guide the indexing process. Header will be parsed then ignored by the tokenizer. However, the header along with the actual text will be stored together in that field...

I wonder (again...) if it's possible I may control which part of the text shall be stored during the index process? In other words, is it possible to strip the header when storing the text into the field?

Best regards,

Guan

-----Original Message-----
From: Mikhail Khludnev <mkhl@apache.org>
Sent: Monday, April 24, 2023 4:20 PM
To: java-user@lucene.apache.org
Subject: Re: Can an analyzer access other field's data during index time?

External Email - Use Caution

Hello Guan.
It reminds me https://youtu.be/EkkzSLstSAE?t=1531 timecode.
I'm afraid it's quite far from the existing codebase where the Field has no reference to enclosing Document. sigh.

On Mon, Apr 24, 2023 at 6:00?PM Wang, Guan <wanggu@med.umich.edu> wrote:

> Hi,
>
> I understand Lucene analyzer is per field basis. But I wonder if it's
> even possible for an analyzer on field A to be able to access data in
> field B during the index process on any stage, saying CharFilter,
> Tokenizer or TokenFilter?
>
> I'd like to control the behavior of the indexing process for field A
> based upon the value in field B.
>
> Mighty Lucene community, please let me know if this is doable...
>
> Many thanks,
>
> Guan
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should
> not be used for urgent or sensitive issues
>

--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
B?KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB??[??X???X?KK[XZ[??]?K]\?\?][??X???X?PX?[?K?\X?K???B???Y][?[??[X[??K[XZ[??]?K]\?\?Z[X?[?K?\X?K???B?B

Re: Can an analyzer access other field's data during index time? [ In reply to ]

mkhl at apache

Apr 24, 2023, 1:55 PM

Post #4 of 9 (373 views)

Permalink

Well.. maybe something like
https://lucene.apache.org/core/8_5_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/ConditionalTokenFilter.html
?

On Mon, Apr 24, 2023 at 11:40?PM Wang, Guan <wanggu@med.umich.edu> wrote:

> Hi Mikhail,
>
> Thank you for the definitive answer!
>
> I could "solve" this by adding a header in the document with proper
> information to guide the indexing process. Header will be parsed then
> ignored by the tokenizer. However, the header along with the actual text
> will be stored together in that field...
>
> I wonder (again...) if it's possible I may control which part of the text
> shall be stored during the index process? In other words, is it possible to
> strip the header when storing the text into the field?
>
> Best regards,
>
> Guan
>
> -----Original Message-----
> From: Mikhail Khludnev <mkhl@apache.org>
> Sent: Monday, April 24, 2023 4:20 PM
> To: java-user@lucene.apache.org
> Subject: Re: Can an analyzer access other field's data during index time?
>
> External Email - Use Caution
>
> Hello Guan.
> It reminds me https://youtu.be/EkkzSLstSAE?t=1531 timecode.
> I'm afraid it's quite far from the existing codebase where the Field has
> no reference to enclosing Document. sigh.
>
>
> On Mon, Apr 24, 2023 at 6:00?PM Wang, Guan <wanggu@med.umich.edu> wrote:
>
> > Hi,
> >
> > I understand Lucene analyzer is per field basis. But I wonder if it's
> > even possible for an analyzer on field A to be able to access data in
> > field B during the index process on any stage, saying CharFilter,
> > Tokenizer or TokenFilter?
> >
> > I'd like to control the behavior of the indexing process for field A
> > based upon the value in field B.
> >
> > Mighty Lucene community, please let me know if this is doable...
> >
> > Many thanks,
> >
> > Guan
> > **********************************************************
> > Electronic Mail is not secure, may not be read every day, and should
> > not be used for urgent or sensitive issues
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/MUST_SEARCH
> A caveat: Cyrillic!
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not
> be used for urgent or sensitive issues
>

--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

RE: Can an analyzer access other field's data during index time? [ In reply to ]

wanggu at med

Apr 24, 2023, 2:29 PM

Post #5 of 9 (373 views)

Permalink

Hi Mikhail,

Thank you for introducing abstract class ConditionalTokenFilter to me! Took a quick look, it's a wrapper of the upperstream TokenStream with conditional rendition.

So, if I have a document like:

HEADER
TEXT
TEXT

Implementing ConditionalToeknFilter could only tokenize line 2 and 3. However, all 3 lines would still be stored in the field if index=true and stored=true...

I wonder if I could only store line 2 and 3 in the field in such a scenario?

Many thanks,

Guan

-----Original Message-----
From: Mikhail Khludnev <mkhl@apache.org>
Sent: Monday, April 24, 2023 4:56 PM
To: java-user@lucene.apache.org
Subject: Re: Can an analyzer access other field's data during index time?

External Email - Use Caution

Well.. maybe something like
https://lucene.apache.org/core/8_5_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/ConditionalTokenFilter.html
?

On Mon, Apr 24, 2023 at 11:40?PM Wang, Guan <wanggu@med.umich.edu> wrote:

> Hi Mikhail,
>
> Thank you for the definitive answer!
>
> I could "solve" this by adding a header in the document with proper
> information to guide the indexing process. Header will be parsed then
> ignored by the tokenizer. However, the header along with the actual
> text will be stored together in that field...
>
> I wonder (again...) if it's possible I may control which part of the
> text shall be stored during the index process? In other words, is it
> possible to strip the header when storing the text into the field?
>
> Best regards,
>
> Guan
>
> -----Original Message-----
> From: Mikhail Khludnev <mkhl@apache.org>
> Sent: Monday, April 24, 2023 4:20 PM
> To: java-user@lucene.apache.org
> Subject: Re: Can an analyzer access other field's data during index time?
>
> External Email - Use Caution
>
> Hello Guan.
> It reminds me https://youtu.be/EkkzSLstSAE?t=1531 timecode.
> I'm afraid it's quite far from the existing codebase where the Field
> has no reference to enclosing Document. sigh.
>
>
> On Mon, Apr 24, 2023 at 6:00?PM Wang, Guan <wanggu@med.umich.edu> wrote:
>
> > Hi,
> >
> > I understand Lucene analyzer is per field basis. But I wonder if
> > it's even possible for an analyzer on field A to be able to access
> > data in field B during the index process on any stage, saying
> > CharFilter, Tokenizer or TokenFilter?
> >
> > I'd like to control the behavior of the indexing process for field A
> > based upon the value in field B.
> >
> > Mighty Lucene community, please let me know if this is doable...
> >
> > Many thanks,
> >
> > Guan
> > **********************************************************
> > Electronic Mail is not secure, may not be read every day, and should
> > not be used for urgent or sensitive issues
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/
> %2FMUST_SEARCH&data=05%7C01%7Cwanggu%40med.umich.edu%7C0bea50f222a14e4
> b2ca708db450683cc%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C6381796
> 66504341414%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMz
> IiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=2Hy%2B1tCxYQ7ID
> Ewa36%2ByOl5Jfe284fj4%2B0tutGWOvsk%3D&reserved=0
> A caveat: Cyrillic!
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should
> not be used for urgent or sensitive issues
>

--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
???????????????????????????????????????????????????????????????????????F?V?7V'67&?&R?R???â?f?W6W"?V?7V'67&?&T?V6V?R?6?R??&p?f?"FF?F????6????G2?R???â?f?W6W"?V??V6V?R?6?R??&p?

Re: Can an analyzer access other field's data during index time? [ In reply to ]

mkhl at apache

Apr 25, 2023, 1:40 AM

Post #6 of 9 (373 views)

Permalink

Guan,
I hardly grasp the particular obstacle. But I don't think that the task is
out of reach overall. Can you share a test case formally describing the
desired behavior?

On Tue, Apr 25, 2023 at 12:29?AM Wang, Guan <wanggu@med.umich.edu> wrote:

> Hi Mikhail,
>
> Thank you for introducing abstract class ConditionalTokenFilter to me!
> Took a quick look, it's a wrapper of the upperstream TokenStream with
> conditional rendition.
>
> So, if I have a document like:
>
> HEADER
> TEXT
> TEXT
>
> Implementing ConditionalToeknFilter could only tokenize line 2 and 3.
> However, all 3 lines would still be stored in the field if index=true and
> stored=true...
>
> I wonder if I could only store line 2 and 3 in the field in such a
> scenario?
>
> Many thanks,
>
> Guan
>
> -----Original Message-----
> From: Mikhail Khludnev <mkhl@apache.org>
> Sent: Monday, April 24, 2023 4:56 PM
> To: java-user@lucene.apache.org
> Subject: Re: Can an analyzer access other field's data during index time?
>
> External Email - Use Caution
>
> Well.. maybe something like
>
> https://lucene.apache.org/core/8_5_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/ConditionalTokenFilter.html
> ?
>
> On Mon, Apr 24, 2023 at 11:40?PM Wang, Guan <wanggu@med.umich.edu> wrote:
>
> > Hi Mikhail,
> >
> > Thank you for the definitive answer!
> >
> > I could "solve" this by adding a header in the document with proper
> > information to guide the indexing process. Header will be parsed then
> > ignored by the tokenizer. However, the header along with the actual
> > text will be stored together in that field...
> >
> > I wonder (again...) if it's possible I may control which part of the
> > text shall be stored during the index process? In other words, is it
> > possible to strip the header when storing the text into the field?
> >
> > Best regards,
> >
> > Guan
> >
> > -----Original Message-----
> > From: Mikhail Khludnev <mkhl@apache.org>
> > Sent: Monday, April 24, 2023 4:20 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Can an analyzer access other field's data during index time?
> >
> > External Email - Use Caution
> >
> > Hello Guan.
> > It reminds me https://youtu.be/EkkzSLstSAE?t=1531 timecode.
> > I'm afraid it's quite far from the existing codebase where the Field
> > has no reference to enclosing Document. sigh.
> >
> >
> > On Mon, Apr 24, 2023 at 6:00?PM Wang, Guan <wanggu@med.umich.edu> wrote:
> >
> > > Hi,
> > >
> > > I understand Lucene analyzer is per field basis. But I wonder if
> > > it's even possible for an analyzer on field A to be able to access
> > > data in field B during the index process on any stage, saying
> > > CharFilter, Tokenizer or TokenFilter?
> > >
> > > I'd like to control the behavior of the indexing process for field A
> > > based upon the value in field B.
> > >
> > > Mighty Lucene community, please let me know if this is doable...
> > >
> > > Many thanks,
> > >
> > > Guan
> > > **********************************************************
> > > Electronic Mail is not secure, may not be read every day, and should
> > > not be used for urgent or sensitive issues
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > https://t.me/
> > %2FMUST_SEARCH&data=05%7C01%7Cwanggu%40med.umich.edu%7C0bea50f222a14e4
> > b2ca708db450683cc%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C6381796
> > 66504341414%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMz
> > IiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=2Hy%2B1tCxYQ7ID
> > Ewa36%2ByOl5Jfe284fj4%2B0tutGWOvsk%3D&reserved=0
> > A caveat: Cyrillic!
> > **********************************************************
> > Electronic Mail is not secure, may not be read every day, and should
> > not be used for urgent or sensitive issues
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/MUST_SEARCH
> A caveat: Cyrillic!
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not
> be used for urgent or sensitive issues
>

--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

RE: Can an analyzer access other field's data during index time? [ In reply to ]

wanggu at med

Apr 25, 2023, 6:08 AM

Post #7 of 9 (373 views)

Permalink

Hi Mikhail,

Again, thank you so much for getting back to me!

Here is the scenario:

Given a document with an added header line:

HEADER_LINE
ORIGINAL_TEXT_LINE
ORIGINAL_TEXT_LINE...

And a field in managed-schema for the document:

<field name="RPT_TEXT" ... indexed="true" stored="true" ... />

I'd like to extract the information in the HEADER_LINE to guide the indexing for this document. When the document is stored in the field RPT_TEXT, I'd like to remove the HEADER_LINE so only the original document text will be saved.

With a custom tokenizer or the ConditionalToeknFilter as you mentioned, extracting the HEADER_LINE should be straightforward. The remaining puzzle is to strip the HEADER_LINE during saving.

I took a look at IndexingChain class. It turns out I will have to do a custom Field class and override stringValue() method.

In a nutshell, I will need two parts to make this work:

1. a custom tokenizer/filter;
2. a custom field;

Let me know if there is any caveat...

And thank you so much for guiding me through!

Guan

-----Original Message-----
From: Mikhail Khludnev <mkhl@apache.org>
Sent: Tuesday, April 25, 2023 4:40 AM
To: java-user@lucene.apache.org
Subject: Re: Can an analyzer access other field's data during index time?

External Email - Use Caution

Guan,
I hardly grasp the particular obstacle. But I don't think that the task is out of reach overall. Can you share a test case formally describing the desired behavior?

On Tue, Apr 25, 2023 at 12:29?AM Wang, Guan <wanggu@med.umich.edu> wrote:

> Hi Mikhail,
>
> Thank you for introducing abstract class ConditionalTokenFilter to me!
> Took a quick look, it's a wrapper of the upperstream TokenStream with
> conditional rendition.
>
> So, if I have a document like:
>
> HEADER
> TEXT
> TEXT
>
> Implementing ConditionalToeknFilter could only tokenize line 2 and 3.
> However, all 3 lines would still be stored in the field if index=true
> and stored=true...
>
> I wonder if I could only store line 2 and 3 in the field in such a
> scenario?
>
> Many thanks,
>
> Guan
>
> -----Original Message-----
> From: Mikhail Khludnev <mkhl@apache.org>
> Sent: Monday, April 24, 2023 4:56 PM
> To: java-user@lucene.apache.org
> Subject: Re: Can an analyzer access other field's data during index time?
>
> External Email - Use Caution
>
> Well.. maybe something like
>
> https://luce/
> ne.apache.org%2Fcore%2F8_5_1%2Fanalyzers-common%2Forg%2Fapache%2Flucen
> e%2Fanalysis%2Fmiscellaneous%2FConditionalTokenFilter.html&data=05%7C0
> 1%7Cwanggu%40med.umich.edu%7C4bbb8f1743494265999e08db4568cd7f%7C1f41d6
> 13d3a14ead918d2a25b10de330%7C0%7C0%7C638180088661302659%7CUnknown%7CTW
> FpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6
> Mn0%3D%7C3000%7C%7C%7C&sdata=pv8ZhUIxfDRK894jqDc79eBizvM6tjuU%2BP0imA9
> FmmU%3D&reserved=0
> ?
>
> On Mon, Apr 24, 2023 at 11:40?PM Wang, Guan <wanggu@med.umich.edu> wrote:
>
> > Hi Mikhail,
> >
> > Thank you for the definitive answer!
> >
> > I could "solve" this by adding a header in the document with proper
> > information to guide the indexing process. Header will be parsed
> > then ignored by the tokenizer. However, the header along with the
> > actual text will be stored together in that field...
> >
> > I wonder (again...) if it's possible I may control which part of the
> > text shall be stored during the index process? In other words, is it
> > possible to strip the header when storing the text into the field?
> >
> > Best regards,
> >
> > Guan
> >
> > -----Original Message-----
> > From: Mikhail Khludnev <mkhl@apache.org>
> > Sent: Monday, April 24, 2023 4:20 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Can an analyzer access other field's data during index time?
> >
> > External Email - Use Caution
> >
> > Hello Guan.
> > It reminds me https://youtu.be/EkkzSLstSAE?t=1531 timecode.
> > I'm afraid it's quite far from the existing codebase where the Field
> > has no reference to enclosing Document. sigh.
> >
> >
> > On Mon, Apr 24, 2023 at 6:00?PM Wang, Guan <wanggu@med.umich.edu> wrote:
> >
> > > Hi,
> > >
> > > I understand Lucene analyzer is per field basis. But I wonder if
> > > it's even possible for an analyzer on field A to be able to access
> > > data in field B during the index process on any stage, saying
> > > CharFilter, Tokenizer or TokenFilter?
> > >
> > > I'd like to control the behavior of the indexing process for field
> > > A based upon the value in field B.
> > >
> > > Mighty Lucene community, please let me know if this is doable...
> > >
> > > Many thanks,
> > >
> > > Guan
> > > **********************************************************
> > > Electronic Mail is not secure, may not be read every day, and
> > > should not be used for urgent or sensitive issues
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > https://t/.
> > me%2F&data=05%7C01%7Cwanggu%40med.umich.edu%7C4bbb8f1743494265999e08
> > db4568cd7f%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C638180088661
> > 302659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
> > CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=QM%2BJQkijKN8evI
> > 2sIqi9dLXyKDnklF6vkmGYqxoCDrY%3D&reserved=0
> > %2FMUST_SEARCH&data=05%7C01%7Cwanggu%40med.umich.edu%7C0bea50f222a14
> > e4
> > b2ca708db450683cc%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C63817
> > 96
> > 66504341414%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2lu
> > Mz
> > IiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=2Hy%2B1tCxYQ7
> > ID
> > Ewa36%2ByOl5Jfe284fj4%2B0tutGWOvsk%3D&reserved=0
> > A caveat: Cyrillic!
> > **********************************************************
> > Electronic Mail is not secure, may not be read every day, and should
> > not be used for urgent or sensitive issues
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/
> %2FMUST_SEARCH&data=05%7C01%7Cwanggu%40med.umich.edu%7C4bbb8f174349426
> 5999e08db4568cd7f%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C6381800
> 88661302659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMz
> IiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=KrD%2FkNANILwVb
> py%2BshScTcu7qgxtxbRpB%2F48xmtd4Z8%3D&reserved=0
> A caveat: Cyrillic!
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should
> not be used for urgent or sensitive issues
>

--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Can an analyzer access other field's data during index time? [ In reply to ]

mkhl at apache

Apr 26, 2023, 1:13 PM

Post #8 of 9 (373 views)

Permalink

Hello,
It sounds like you are talking about Solr (though it's Lucene core mailing
list).
If you want to manipulate what's been stored, it's not the analyzer's duty
for sure.
Overriding org.apache.solr.schema.FieldType#createFields can be used to
yield indexed and stored (Lucene) fields with different content.
If your logic is so comprehensive you may also consider to completely
extract analysis logic
https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processes.html#the-preanalyzedfield-type

On Tue, Apr 25, 2023 at 4:08?PM Wang, Guan <wanggu@med.umich.edu> wrote:

> Hi Mikhail,
>
> Again, thank you so much for getting back to me!
>
> Here is the scenario:
>
> Given a document with an added header line:
>
> HEADER_LINE
> ORIGINAL_TEXT_LINE
> ORIGINAL_TEXT_LINE...
>
> And a field in managed-schema for the document:
>
> <field name="RPT_TEXT" ... indexed="true" stored="true" ... />
>
> I'd like to extract the information in the HEADER_LINE to guide the
> indexing for this document. When the document is stored in the field
> RPT_TEXT, I'd like to remove the HEADER_LINE so only the original document
> text will be saved.
>
> With a custom tokenizer or the ConditionalToeknFilter as you mentioned,
> extracting the HEADER_LINE should be straightforward. The remaining puzzle
> is to strip the HEADER_LINE during saving.
>
> I took a look at IndexingChain class. It turns out I will have to do a
> custom Field class and override stringValue() method.
>
> In a nutshell, I will need two parts to make this work:
>
> 1. a custom tokenizer/filter;
> 2. a custom field;
>
> Let me know if there is any caveat...
>
> And thank you so much for guiding me through!
>
> Guan
>
> -----Original Message-----
> From: Mikhail Khludnev <mkhl@apache.org>
> Sent: Tuesday, April 25, 2023 4:40 AM
> To: java-user@lucene.apache.org
> Subject: Re: Can an analyzer access other field's data during index time?
>
> External Email - Use Caution
>
> Guan,
> I hardly grasp the particular obstacle. But I don't think that the task is
> out of reach overall. Can you share a test case formally describing the
> desired behavior?
>
> On Tue, Apr 25, 2023 at 12:29?AM Wang, Guan <wanggu@med.umich.edu> wrote:
>
> > Hi Mikhail,
> >
> > Thank you for introducing abstract class ConditionalTokenFilter to me!
> > Took a quick look, it's a wrapper of the upperstream TokenStream with
> > conditional rendition.
> >
> > So, if I have a document like:
> >
> > HEADER
> > TEXT
> > TEXT
> >
> > Implementing ConditionalToeknFilter could only tokenize line 2 and 3.
> > However, all 3 lines would still be stored in the field if index=true
> > and stored=true...
> >
> > I wonder if I could only store line 2 and 3 in the field in such a
> > scenario?
> >
> > Many thanks,
> >
> > Guan
> >
> > -----Original Message-----
> > From: Mikhail Khludnev <mkhl@apache.org>
> > Sent: Monday, April 24, 2023 4:56 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Can an analyzer access other field's data during index time?
> >
> > External Email - Use Caution
> >
> > Well.. maybe something like
> >
> > https://luce/
> > ne.apache.org%2Fcore%2F8_5_1%2Fanalyzers-common%2Forg%2Fapache%2Flucen
> > e%2Fanalysis%2Fmiscellaneous%2FConditionalTokenFilter.html&data=05%7C0
> > 1%7Cwanggu%40med.umich.edu%7C4bbb8f1743494265999e08db4568cd7f%7C1f41d6
> > 13d3a14ead918d2a25b10de330%7C0%7C0%7C638180088661302659%7CUnknown%7CTW
> > FpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6
> > Mn0%3D%7C3000%7C%7C%7C&sdata=pv8ZhUIxfDRK894jqDc79eBizvM6tjuU%2BP0imA9
> > FmmU%3D&reserved=0
> > ?
> >
> > On Mon, Apr 24, 2023 at 11:40?PM Wang, Guan <wanggu@med.umich.edu>
> wrote:
> >
> > > Hi Mikhail,
> > >
> > > Thank you for the definitive answer!
> > >
> > > I could "solve" this by adding a header in the document with proper
> > > information to guide the indexing process. Header will be parsed
> > > then ignored by the tokenizer. However, the header along with the
> > > actual text will be stored together in that field...
> > >
> > > I wonder (again...) if it's possible I may control which part of the
> > > text shall be stored during the index process? In other words, is it
> > > possible to strip the header when storing the text into the field?
> > >
> > > Best regards,
> > >
> > > Guan
> > >
> > > -----Original Message-----
> > > From: Mikhail Khludnev <mkhl@apache.org>
> > > Sent: Monday, April 24, 2023 4:20 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: Can an analyzer access other field's data during index
> time?
> > >
> > > External Email - Use Caution
> > >
> > > Hello Guan.
> > > It reminds me https://youtu.be/EkkzSLstSAE?t=1531 timecode.
> > > I'm afraid it's quite far from the existing codebase where the Field
> > > has no reference to enclosing Document. sigh.
> > >
> > >
> > > On Mon, Apr 24, 2023 at 6:00?PM Wang, Guan <wanggu@med.umich.edu>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > I understand Lucene analyzer is per field basis. But I wonder if
> > > > it's even possible for an analyzer on field A to be able to access
> > > > data in field B during the index process on any stage, saying
> > > > CharFilter, Tokenizer or TokenFilter?
> > > >
> > > > I'd like to control the behavior of the indexing process for field
> > > > A based upon the value in field B.
> > > >
> > > > Mighty Lucene community, please let me know if this is doable...
> > > >
> > > > Many thanks,
> > > >
> > > > Guan
> > > > **********************************************************
> > > > Electronic Mail is not secure, may not be read every day, and
> > > > should not be used for urgent or sensitive issues
> > > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > https://t/.
> > > me%2F&data=05%7C01%7Cwanggu%40med.umich.edu%7C4bbb8f1743494265999e08
> > > db4568cd7f%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C638180088661
> > > 302659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
> > > CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=QM%2BJQkijKN8evI
> > > 2sIqi9dLXyKDnklF6vkmGYqxoCDrY%3D&reserved=0
> > > %2FMUST_SEARCH&data=05%7C01%7Cwanggu%40med.umich.edu%7C0bea50f222a14
> > > e4
> > > b2ca708db450683cc%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C63817
> > > 96
> > > 66504341414%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2lu
> > > Mz
> > > IiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=2Hy%2B1tCxYQ7
> > > ID
> > > Ewa36%2ByOl5Jfe284fj4%2B0tutGWOvsk%3D&reserved=0
> > > A caveat: Cyrillic!
> > > **********************************************************
> > > Electronic Mail is not secure, may not be read every day, and should
> > > not be used for urgent or sensitive issues
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > https://t.me/
> > %2FMUST_SEARCH&data=05%7C01%7Cwanggu%40med.umich.edu%7C4bbb8f174349426
> > 5999e08db4568cd7f%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C6381800
> > 88661302659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMz
> > IiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=KrD%2FkNANILwVb
> > py%2BshScTcu7qgxtxbRpB%2F48xmtd4Z8%3D&reserved=0
> > A caveat: Cyrillic!
> > **********************************************************
> > Electronic Mail is not secure, may not be read every day, and should
> > not be used for urgent or sensitive issues
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/MUST_SEARCH
> A caveat: Cyrillic!
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not
> be used for urgent or sensitive issues
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

RE: Can an analyzer access other field's data during index time? [ In reply to ]

wanggu at med

May 3, 2023, 6:07 AM

Post #9 of 9 (363 views)

Permalink

Hi Mikhail,

I apologize for the late reply!

Yeah, it's my bad to use Solr field def here since this is Lucene mailing list and yes, you are right, how the raw text is stored is not the analyzer's responsibility.

Solr's pre-analyzed field, on the other hand, is a great example! Thank you for the tip! I am reading it's SimplePreAnalyzedParser as we speak!

Best regards,

Guan

-----Original Message-----
From: Mikhail Khludnev <mkhl@apache.org>
Sent: Wednesday, April 26, 2023 4:14 PM
To: java-user@lucene.apache.org
Subject: Re: Can an analyzer access other field's data during index time?

External Email - Use Caution

Hello,
It sounds like you are talking about Solr (though it's Lucene core mailing list).
If you want to manipulate what's been stored, it's not the analyzer's duty for sure.
Overriding org.apache.solr.schema.FieldType#createFields can be used to yield indexed and stored (Lucene) fields with different content.
If your logic is so comprehensive you may also consider to completely extract analysis logic
https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processes.html#the-preanalyzedfield-type

On Tue, Apr 25, 2023 at 4:08?PM Wang, Guan <wanggu@med.umich.edu> wrote:

> Hi Mikhail,
>
> Again, thank you so much for getting back to me!
>
> Here is the scenario:
>
> Given a document with an added header line:
>
> HEADER_LINE
> ORIGINAL_TEXT_LINE
> ORIGINAL_TEXT_LINE...
>
> And a field in managed-schema for the document:
>
> <field name="RPT_TEXT" ... indexed="true" stored="true" ... />
>
> I'd like to extract the information in the HEADER_LINE to guide the
> indexing for this document. When the document is stored in the field
> RPT_TEXT, I'd like to remove the HEADER_LINE so only the original
> document text will be saved.
>
> With a custom tokenizer or the ConditionalToeknFilter as you
> mentioned, extracting the HEADER_LINE should be straightforward. The
> remaining puzzle is to strip the HEADER_LINE during saving.
>
> I took a look at IndexingChain class. It turns out I will have to do a
> custom Field class and override stringValue() method.
>
> In a nutshell, I will need two parts to make this work:
>
> 1. a custom tokenizer/filter;
> 2. a custom field;
>
> Let me know if there is any caveat...
>
> And thank you so much for guiding me through!
>
> Guan
>
> -----Original Message-----
> From: Mikhail Khludnev <mkhl@apache.org>
> Sent: Tuesday, April 25, 2023 4:40 AM
> To: java-user@lucene.apache.org
> Subject: Re: Can an analyzer access other field's data during index time?
>
> External Email - Use Caution
>
> Guan,
> I hardly grasp the particular obstacle. But I don't think that the
> task is out of reach overall. Can you share a test case formally
> describing the desired behavior?
>
> On Tue, Apr 25, 2023 at 12:29?AM Wang, Guan <wanggu@med.umich.edu> wrote:
>
> > Hi Mikhail,
> >
> > Thank you for introducing abstract class ConditionalTokenFilter to me!
> > Took a quick look, it's a wrapper of the upperstream TokenStream
> > with conditional rendition.
> >
> > So, if I have a document like:
> >
> > HEADER
> > TEXT
> > TEXT
> >
> > Implementing ConditionalToeknFilter could only tokenize line 2 and 3.
> > However, all 3 lines would still be stored in the field if
> > index=true and stored=true...
> >
> > I wonder if I could only store line 2 and 3 in the field in such a
> > scenario?
> >
> > Many thanks,
> >
> > Guan
> >
> > -----Original Message-----
> > From: Mikhail Khludnev <mkhl@apache.org>
> > Sent: Monday, April 24, 2023 4:56 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Can an analyzer access other field's data during index time?
> >
> > External Email - Use Caution
> >
> > Well.. maybe something like
> >
> > https://luce/
> > ne.apache.org%2Fcore%2F8_5_1%2Fanalyzers-common%2Forg%2Fapache%2Fluc
> > en
> > e%2Fanalysis%2Fmiscellaneous%2FConditionalTokenFilter.html&data=05%7
> > C0
> > 1%7Cwanggu%40med.umich.edu%7C4bbb8f1743494265999e08db4568cd7f%7C1f41
> > d6
> > 13d3a14ead918d2a25b10de330%7C0%7C0%7C638180088661302659%7CUnknown%7C
> > TW
> > FpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVC
> > I6
> > Mn0%3D%7C3000%7C%7C%7C&sdata=pv8ZhUIxfDRK894jqDc79eBizvM6tjuU%2BP0im
> > A9
> > FmmU%3D&reserved=0
> > ?
> >
> > On Mon, Apr 24, 2023 at 11:40?PM Wang, Guan <wanggu@med.umich.edu>
> wrote:
> >
> > > Hi Mikhail,
> > >
> > > Thank you for the definitive answer!
> > >
> > > I could "solve" this by adding a header in the document with
> > > proper information to guide the indexing process. Header will be
> > > parsed then ignored by the tokenizer. However, the header along
> > > with the actual text will be stored together in that field...
> > >
> > > I wonder (again...) if it's possible I may control which part of
> > > the text shall be stored during the index process? In other words,
> > > is it possible to strip the header when storing the text into the field?
> > >
> > > Best regards,
> > >
> > > Guan
> > >
> > > -----Original Message-----
> > > From: Mikhail Khludnev <mkhl@apache.org>
> > > Sent: Monday, April 24, 2023 4:20 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: Can an analyzer access other field's data during
> > > index
> time?
> > >
> > > External Email - Use Caution
> > >
> > > Hello Guan.
> > > It reminds me https://youtu.be/EkkzSLstSAE?t=1531 timecode.
> > > I'm afraid it's quite far from the existing codebase where the
> > > Field has no reference to enclosing Document. sigh.
> > >
> > >
> > > On Mon, Apr 24, 2023 at 6:00?PM Wang, Guan <wanggu@med.umich.edu>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > I understand Lucene analyzer is per field basis. But I wonder if
> > > > it's even possible for an analyzer on field A to be able to
> > > > access data in field B during the index process on any stage,
> > > > saying CharFilter, Tokenizer or TokenFilter?
> > > >
> > > > I'd like to control the behavior of the indexing process for
> > > > field A based upon the value in field B.
> > > >
> > > > Mighty Lucene community, please let me know if this is doable...
> > > >
> > > > Many thanks,
> > > >
> > > > Guan
> > > > **********************************************************
> > > > Electronic Mail is not secure, may not be read every day, and
> > > > should not be used for urgent or sensitive issues
> > > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > https://t/.
> > > me%2F&data=05%7C01%7Cwanggu%40med.umich.edu%7C4bbb8f1743494265999e
> > > 08
> > > db4568cd7f%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C6381800886
> > > 61
> > > 302659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzI
> > > iL
> > > CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=QM%2BJQkijKN8e
> > > vI
> > > 2sIqi9dLXyKDnklF6vkmGYqxoCDrY%3D&reserved=0
> > > %2FMUST_SEARCH&data=05%7C01%7Cwanggu%40med.umich.edu%7C0bea50f222a
> > > 14
> > > e4
> > > b2ca708db450683cc%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C638
> > > 17
> > > 96
> > > 66504341414%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2
> > > lu
> > > Mz
> > > IiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=2Hy%2B1tCxY
> > > Q7
> > > ID
> > > Ewa36%2ByOl5Jfe284fj4%2B0tutGWOvsk%3D&reserved=0
> > > A caveat: Cyrillic!
> > > **********************************************************
> > > Electronic Mail is not secure, may not be read every day, and
> > > should not be used for urgent or sensitive issues
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > https://t/.
> > me%2F&data=05%7C01%7Cwanggu%40med.umich.edu%7C213a285fa9c04b7f7a2808
> > db46930326%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C638181369461
> > 173698%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
> > CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=fsK5sOX4d6OZeBsy
> > ikGUxvJimR%2Fi7S1HBa125GpUm9A%3D&reserved=0
> > %2FMUST_SEARCH&data=05%7C01%7Cwanggu%40med.umich.edu%7C4bbb8f1743494
> > 26
> > 5999e08db4568cd7f%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C63818
> > 00
> > 88661302659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2lu
> > Mz
> > IiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=KrD%2FkNANILw
> > Vb
> > py%2BshScTcu7qgxtxbRpB%2F48xmtd4Z8%3D&reserved=0
> > A caveat: Cyrillic!
> > **********************************************************
> > Electronic Mail is not secure, may not be read every day, and should
> > not be used for urgent or sensitive issues
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/
> %2FMUST_SEARCH&data=05%7C01%7Cwanggu%40med.umich.edu%7C213a285fa9c04b7
> f7a2808db46930326%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C6381813
> 69461173698%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMz
> IiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=OVRKnftOrseH0SN
> leizye17hGQ65fkaE8k3RuykBwVM%3D&reserved=0
> A caveat: Cyrillic!
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should
> not be used for urgent or sensitive issues
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
B?KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB??[??X???X?KK[XZ[??]?K]\?\?][??X???X?PX?[?K?\X?K???B???Y][?[??[X[??K[XZ[??]?K]\?\?Z[X?[?K?\X?K???B?B