Mailing List Archive

Searching number of tokens in text field
Hello,

I was wondering if it is possible to search for the number of tokens in a
text field. For example find book titles with 3 or more words. I don't
mind adding a field that is the number of tokens to the search index but I
would like to avoid analyzing the text two times. Can Lucene search for
the number of tokens in a text field? Or can I get the number of tokens
after analysis and add it to the Lucene document before/during indexing?
Or do I need to analysis the text myself and add the field to the document
(analyze the text twice, once myself, once in the IndexWriter).

Thanks,
Matt Davis
Re: Searching number of tokens in text field [ In reply to ]
I don't know of any pre-existing thing that does exactly this, but how
about a token filter that counts tokens (or positions maybe), and then
appends some special token encoding the length?

On Sat, Dec 28, 2019, 9:36 AM Matt Davis <kryptonics411@gmail.com> wrote:

> Hello,
>
> I was wondering if it is possible to search for the number of tokens in a
> text field. For example find book titles with 3 or more words. I don't
> mind adding a field that is the number of tokens to the search index but I
> would like to avoid analyzing the text two times. Can Lucene search for
> the number of tokens in a text field? Or can I get the number of tokens
> after analysis and add it to the Lucene document before/during indexing?
> Or do I need to analysis the text myself and add the field to the document
> (analyze the text twice, once myself, once in the IndexWriter).
>
> Thanks,
> Matt Davis
>
Re: Searching number of tokens in text field [ In reply to ]
That is a clever idea. I would still prefer something cleaner but this
could work. Thanks!

On Sat, Dec 28, 2019 at 10:11 PM Michael Sokolov <msokolov@gmail.com> wrote:

> I don't know of any pre-existing thing that does exactly this, but how
> about a token filter that counts tokens (or positions maybe), and then
> appends some special token encoding the length?
>
> On Sat, Dec 28, 2019, 9:36 AM Matt Davis <kryptonics411@gmail.com> wrote:
>
> > Hello,
> >
> > I was wondering if it is possible to search for the number of tokens in a
> > text field. For example find book titles with 3 or more words. I don't
> > mind adding a field that is the number of tokens to the search index but
> I
> > would like to avoid analyzing the text two times. Can Lucene search for
> > the number of tokens in a text field? Or can I get the number of tokens
> > after analysis and add it to the Lucene document before/during indexing?
> > Or do I need to analysis the text myself and add the field to the
> document
> > (analyze the text twice, once myself, once in the IndexWriter).
> >
> > Thanks,
> > Matt Davis
> >
>
Re: Searching number of tokens in text field [ In reply to ]
This comes up occasionally, it’d be a neat thing to add to Solr if you’re motivated. It gets tricky though.

- part of the config would have to be the name of the length field to put the result into, that part’s easy.

- The trickier part is “when should the count be incremented?”. For instance, say you add 15 synonyms for a particular word. Would that add 1 or 16 to the count? What about WordDelimiterGraphFilterFactory, that can output N tokens in place of one. Do stopwords count? What about shingles? CJK languages? The list goes on.

If you tackle this I suggest you open a JIRA for discussion, probably a Lucene JIRA ‘cause the folks who deal with Lucene would have the best feedback. And probably ignore most of the possible interactions with other filters and document that most users should just put it immediately after the tokenizer and leave it at that ;)

I can think of a few other options, but about the only thing that I think makes sense is something like “countTokensInTheSamePosition=true|false” (there’s _GOT_ to be a better name for that!), defaulting to false so you could control whether synonym expansion and WDGFF insertions incremented the count or not. And I suspect that if you put such a filter after WDGFF, you’d also want to document that it should go after FlattenGraphFilterFactory, but trust any feedback on a Lucene JIRA over my suspicion...

Best,
Erick

> On Dec 29, 2019, at 7:57 PM, Matt Davis <kryptonics411@gmail.com> wrote:
>
> That is a clever idea. I would still prefer something cleaner but this
> could work. Thanks!
>
> On Sat, Dec 28, 2019 at 10:11 PM Michael Sokolov <msokolov@gmail.com> wrote:
>
>> I don't know of any pre-existing thing that does exactly this, but how
>> about a token filter that counts tokens (or positions maybe), and then
>> appends some special token encoding the length?
>>
>> On Sat, Dec 28, 2019, 9:36 AM Matt Davis <kryptonics411@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I was wondering if it is possible to search for the number of tokens in a
>>> text field. For example find book titles with 3 or more words. I don't
>>> mind adding a field that is the number of tokens to the search index but
>> I
>>> would like to avoid analyzing the text two times. Can Lucene search for
>>> the number of tokens in a text field? Or can I get the number of tokens
>>> after analysis and add it to the Lucene document before/during indexing?
>>> Or do I need to analysis the text myself and add the field to the
>> document
>>> (analyze the text twice, once myself, once in the IndexWriter).
>>>
>>> Thanks,
>>> Matt Davis
>>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Searching number of tokens in text field [ In reply to ]
Norms encode the number of tokens in the field, but in a lossy manner (1
byte by default), so you could probably create a custom query that filtered
based on that, if you could tolerate the loss in precision? Or maybe
change your norms storage to more precision?

You could use NormsFieldExistsQuery as a starting point for the sources for
your custom query. Or maybe there's already a more similar Query based on
norms?

Mike McCandless

http://blog.mikemccandless.com


On Mon, Dec 30, 2019 at 8:07 AM Erick Erickson <erickerickson@gmail.com>
wrote:

> This comes up occasionally, it’d be a neat thing to add to Solr if you’re
> motivated. It gets tricky though.
>
> - part of the config would have to be the name of the length field to put
> the result into, that part’s easy.
>
> - The trickier part is “when should the count be incremented?”. For
> instance, say you add 15 synonyms for a particular word. Would that add 1
> or 16 to the count? What about WordDelimiterGraphFilterFactory, that can
> output N tokens in place of one. Do stopwords count? What about shingles?
> CJK languages? The list goes on.
>
> If you tackle this I suggest you open a JIRA for discussion, probably a
> Lucene JIRA ‘cause the folks who deal with Lucene would have the best
> feedback. And probably ignore most of the possible interactions with other
> filters and document that most users should just put it immediately after
> the tokenizer and leave it at that ;)
>
> I can think of a few other options, but about the only thing that I think
> makes sense is something like “countTokensInTheSamePosition=true|false”
> (there’s _GOT_ to be a better name for that!), defaulting to false so you
> could control whether synonym expansion and WDGFF insertions incremented
> the count or not. And I suspect that if you put such a filter after WDGFF,
> you’d also want to document that it should go after
> FlattenGraphFilterFactory, but trust any feedback on a Lucene JIRA over my
> suspicion...
>
> Best,
> Erick
>
> > On Dec 29, 2019, at 7:57 PM, Matt Davis <kryptonics411@gmail.com> wrote:
> >
> > That is a clever idea. I would still prefer something cleaner but this
> > could work. Thanks!
> >
> > On Sat, Dec 28, 2019 at 10:11 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >
> >> I don't know of any pre-existing thing that does exactly this, but how
> >> about a token filter that counts tokens (or positions maybe), and then
> >> appends some special token encoding the length?
> >>
> >> On Sat, Dec 28, 2019, 9:36 AM Matt Davis <kryptonics411@gmail.com>
> wrote:
> >>
> >>> Hello,
> >>>
> >>> I was wondering if it is possible to search for the number of tokens
> in a
> >>> text field. For example find book titles with 3 or more words. I
> don't
> >>> mind adding a field that is the number of tokens to the search index
> but
> >> I
> >>> would like to avoid analyzing the text two times. Can Lucene search
> for
> >>> the number of tokens in a text field? Or can I get the number of
> tokens
> >>> after analysis and add it to the Lucene document before/during
> indexing?
> >>> Or do I need to analysis the text myself and add the field to the
> >> document
> >>> (analyze the text twice, once myself, once in the IndexWriter).
> >>>
> >>> Thanks,
> >>> Matt Davis
> >>>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Searching number of tokens in text field [ In reply to ]
Thanks Mike that is very helpful. Am I reading the code correctly that the
norm lossy encoding is done in the similarity? How do you set the number
of bytes used for the norms?

Thanks,
Matt

On Thu, Jan 2, 2020 at 10:31 AM Michael McCandless <
lucene@mikemccandless.com> wrote:

> Norms encode the number of tokens in the field, but in a lossy manner (1
> byte by default), so you could probably create a custom query that filtered
> based on that, if you could tolerate the loss in precision? Or maybe
> change your norms storage to more precision?
>
> You could use NormsFieldExistsQuery as a starting point for the sources for
> your custom query. Or maybe there's already a more similar Query based on
> norms?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Dec 30, 2019 at 8:07 AM Erick Erickson <erickerickson@gmail.com>
> wrote:
>
> > This comes up occasionally, it’d be a neat thing to add to Solr if you’re
> > motivated. It gets tricky though.
> >
> > - part of the config would have to be the name of the length field to put
> > the result into, that part’s easy.
> >
> > - The trickier part is “when should the count be incremented?”. For
> > instance, say you add 15 synonyms for a particular word. Would that add 1
> > or 16 to the count? What about WordDelimiterGraphFilterFactory, that can
> > output N tokens in place of one. Do stopwords count? What about shingles?
> > CJK languages? The list goes on.
> >
> > If you tackle this I suggest you open a JIRA for discussion, probably a
> > Lucene JIRA ‘cause the folks who deal with Lucene would have the best
> > feedback. And probably ignore most of the possible interactions with
> other
> > filters and document that most users should just put it immediately after
> > the tokenizer and leave it at that ;)
> >
> > I can think of a few other options, but about the only thing that I think
> > makes sense is something like “countTokensInTheSamePosition=true|false”
> > (there’s _GOT_ to be a better name for that!), defaulting to false so you
> > could control whether synonym expansion and WDGFF insertions incremented
> > the count or not. And I suspect that if you put such a filter after
> WDGFF,
> > you’d also want to document that it should go after
> > FlattenGraphFilterFactory, but trust any feedback on a Lucene JIRA over
> my
> > suspicion...
> >
> > Best,
> > Erick
> >
> > > On Dec 29, 2019, at 7:57 PM, Matt Davis <kryptonics411@gmail.com>
> wrote:
> > >
> > > That is a clever idea. I would still prefer something cleaner but this
> > > could work. Thanks!
> > >
> > > On Sat, Dec 28, 2019 at 10:11 PM Michael Sokolov <msokolov@gmail.com>
> > wrote:
> > >
> > >> I don't know of any pre-existing thing that does exactly this, but how
> > >> about a token filter that counts tokens (or positions maybe), and then
> > >> appends some special token encoding the length?
> > >>
> > >> On Sat, Dec 28, 2019, 9:36 AM Matt Davis <kryptonics411@gmail.com>
> > wrote:
> > >>
> > >>> Hello,
> > >>>
> > >>> I was wondering if it is possible to search for the number of tokens
> > in a
> > >>> text field. For example find book titles with 3 or more words. I
> > don't
> > >>> mind adding a field that is the number of tokens to the search index
> > but
> > >> I
> > >>> would like to avoid analyzing the text two times. Can Lucene search
> > for
> > >>> the number of tokens in a text field? Or can I get the number of
> > tokens
> > >>> after analysis and add it to the Lucene document before/during
> > indexing?
> > >>> Or do I need to analysis the text myself and add the field to the
> > >> document
> > >>> (analyze the text twice, once myself, once in the IndexWriter).
> > >>>
> > >>> Thanks,
> > >>> Matt Davis
> > >>>
> > >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>