By the way, I was wondering if there is any Analyzer that uses the following
constructor
public Token(String text, int start, int end, String typ) ?
Maybe it could be interesting to build an analyzer that recognizes
punctuation marks and
keeps it in the index as Tokens with a given Type (say for example
"punctuation") ?
The advantage is that information could be used by a
SloppyPhraseScorer.phraseFreq() method
to avoid PhraseQuery containing a punctuation mark. Since PhraseQueries are
used for compound words
(e.g. "personal computer") with a given slop value (say 3), it could be
great not to match things such as "It is not personal. My computer hates
me..." .
A solution could be to set a slop value of zero, but it is not possible in
my case (I use a module that generates compound terms with slop values, in
order to handle morphologic variations - eg in French "gestion de la casse"
and "gestion des casses" which are represented by "gestion casse"^3 and
"gestion casses"^3).
This involves creating a subclasse of PhraseQuery or modifing it by adding a
boolean to it and modifying the phraseFreq() method so that it checks that
there is no Token with a punctuation Type in the scope of the slop.
What do you think about it? Has anyone already tried in that direction? Does
it implies heavy changes?
Hugo : maybe you could store your stopwords as tokens with a different type?
----- Original Message -----
From: "hugo burm" <hugob@xs4all.nl>
To: <lucene-user@jakarta.apache.org>
Sent: Wednesday, February 13, 2002 5:32 PM
Subject: How does Lucene handle phrases containing words that are not
indexed?
>
> How does Lucene handle phrases (literals) containing words that are not
> indexed? (e.g. stopwords, one-letter words, numbers)? I did some tests
> (lucene demo, my own 120000 xml documents, Cocoon search) and in all cases
> it looks like that when you are looking for the phrase "a specification"
it
> also finds documents which contain "the specification". (or: "D.
Washington"
> instead of "G. Washington").
>
> Of course you can change the index behaviour and make sure there are no
> stopwords, and all one-letter words and numbers are indexed. But that
seems
> a bad approach. A better approach: 1) find all indexed words in the phrase
> and from these words find all documents containing these words. 2) check
the
> occurence of the phrase by opening the original document. I am wondering:
> does Lucene performs step 2)? Off course this step burns some cpu cycles.
>
> Hugo
>
> hugob@xs4all.nl
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
constructor
public Token(String text, int start, int end, String typ) ?
Maybe it could be interesting to build an analyzer that recognizes
punctuation marks and
keeps it in the index as Tokens with a given Type (say for example
"punctuation") ?
The advantage is that information could be used by a
SloppyPhraseScorer.phraseFreq() method
to avoid PhraseQuery containing a punctuation mark. Since PhraseQueries are
used for compound words
(e.g. "personal computer") with a given slop value (say 3), it could be
great not to match things such as "It is not personal. My computer hates
me..." .
A solution could be to set a slop value of zero, but it is not possible in
my case (I use a module that generates compound terms with slop values, in
order to handle morphologic variations - eg in French "gestion de la casse"
and "gestion des casses" which are represented by "gestion casse"^3 and
"gestion casses"^3).
This involves creating a subclasse of PhraseQuery or modifing it by adding a
boolean to it and modifying the phraseFreq() method so that it checks that
there is no Token with a punctuation Type in the scope of the slop.
What do you think about it? Has anyone already tried in that direction? Does
it implies heavy changes?
Hugo : maybe you could store your stopwords as tokens with a different type?
----- Original Message -----
From: "hugo burm" <hugob@xs4all.nl>
To: <lucene-user@jakarta.apache.org>
Sent: Wednesday, February 13, 2002 5:32 PM
Subject: How does Lucene handle phrases containing words that are not
indexed?
>
> How does Lucene handle phrases (literals) containing words that are not
> indexed? (e.g. stopwords, one-letter words, numbers)? I did some tests
> (lucene demo, my own 120000 xml documents, Cocoon search) and in all cases
> it looks like that when you are looking for the phrase "a specification"
it
> also finds documents which contain "the specification". (or: "D.
Washington"
> instead of "G. Washington").
>
> Of course you can change the index behaviour and make sure there are no
> stopwords, and all one-letter words and numbers are indexed. But that
seems
> a bad approach. A better approach: 1) find all indexed words in the phrase
> and from these words find all documents containing these words. 2) check
the
> occurence of the phrase by opening the original document. I am wondering:
> does Lucene performs step 2)? Off course this step burns some cpu cycles.
>
> Hugo
>
> hugob@xs4all.nl
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>