Mailing List Archive

Beginner Question: Tokenized and full phrase
Hallo

We use Lucene to index POJO's which are stored in the database.
The index primarily contains text fields.

After some work with lucene I came across a strange restriction.
I can only assign string or text fields to the document to be indexed.
One only indexes the whole string, the other only the single words or tokens.
This results in the query finding only single words or the whole text, depending on the field type used.
But we would need both, the search should find the whole text as well as single words.
Even after a long analysis of the documentation and partly of the source code,
I'm not sure how to achieve that in a clean way.
Could someone give me a tip on how to do this?

Thanks

Roland

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Beginner Question: Tokenized and full phrase [ In reply to ]
In the Lucene context you simply have tokens. In the analyzed case (i.e. text), the token is however the incoming stream is split up by the analysis chain you construct. In the string case the token is the entire input. That’s just the way it works.

You have two choices:

1> Use two fields, one text-based and one string based. Your query puts the search text against whichever one is appropriate. I’ll add that if you want to use limited analysis, say lowercasing the entire input string, use a text-based field with something like KeywordTokenizer + LowerCaseFilter rather than a string field.


2> Use a text field and do phrase searching when you want the whole thing to match. The flaw here is that if the text were “my dog has fleas” and you searched for “my dog” (as a phrase), you’d get a match. You can get around that by adding another field with the word count and then search something like “my dog” AND word_count:2.


Best,
Erick

> On Sep 2, 2019, at 4:38 AM, Roland Käser <roland.kaeser@ziil.ch> wrote:
>
> Hallo
>
> We use Lucene to index POJO's which are stored in the database.
> The index primarily contains text fields.
>
> After some work with lucene I came across a strange restriction.
> I can only assign string or text fields to the document to be indexed.
> One only indexes the whole string, the other only the single words or tokens.
> This results in the query finding only single words or the whole text, depending on the field type used.
> But we would need both, the search should find the whole text as well as single words.
> Even after a long analysis of the documentation and partly of the source code,
> I'm not sure how to achieve that in a clean way.
> Could someone give me a tip on how to do this?
>
> Thanks
>
> Roland
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org