Mailing List Archive: Re : How does Lucene handle phrases containing words that are not indexed?

Re : How does Lucene handle phrases containing words that are not indexed?

Feb 13, 2002, 10:08 AM

Post #1 of 4 (878 views)

By the way, I was wondering if there is any Analyzer that uses the following
constructor
public Token(String text, int start, int end, String typ) ?

Maybe it could be interesting to build an analyzer that recognizes
punctuation marks and
keeps it in the index as Tokens with a given Type (say for example
"punctuation") ?

The advantage is that information could be used by a
SloppyPhraseScorer.phraseFreq() method
to avoid PhraseQuery containing a punctuation mark. Since PhraseQueries are
used for compound words
(e.g. "personal computer") with a given slop value (say 3), it could be
great not to match things such as "It is not personal. My computer hates
me..." .

A solution could be to set a slop value of zero, but it is not possible in
my case (I use a module that generates compound terms with slop values, in
order to handle morphologic variations - eg in French "gestion de la casse"
and "gestion des casses" which are represented by "gestion casse"^3 and
"gestion casses"^3).

This involves creating a subclasse of PhraseQuery or modifing it by adding a
boolean to it and modifying the phraseFreq() method so that it checks that
there is no Token with a punctuation Type in the scope of the slop.

What do you think about it? Has anyone already tried in that direction? Does
it implies heavy changes?

Hugo : maybe you could store your stopwords as tokens with a different type?

----- Original Message -----
From: "hugo burm" <hugob@xs4all.nl>
To: <lucene-user@jakarta.apache.org>
Sent: Wednesday, February 13, 2002 5:32 PM
Subject: How does Lucene handle phrases containing words that are not
indexed?

>
> How does Lucene handle phrases (literals) containing words that are not
> indexed? (e.g. stopwords, one-letter words, numbers)? I did some tests
> (lucene demo, my own 120000 xml documents, Cocoon search) and in all cases
> it looks like that when you are looking for the phrase "a specification"
it
> also finds documents which contain "the specification". (or: "D.
Washington"
> instead of "G. Washington").
>
> Of course you can change the index behaviour and make sure there are no
> stopwords, and all one-letter words and numbers are indexed. But that
seems
> a bad approach. A better approach: 1) find all indexed words in the phrase
> and from these words find all documents containing these words. 2) check
the
> occurence of the phrase by opening the original document. I am wondering:
> does Lucene performs step 2)? Off course this step burns some cpu cycles.
>
> Hugo
>
> hugob@xs4all.nl
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

RE: Re : How does Lucene handle phrases containing words that are not indexed? [ In reply to ]

halacsy.peter at axelero

Feb 14, 2002, 2:33 AM

Post #2 of 4 (862 views)

Permalink

Hello,
I think my problem is something similar.

> -----Original Message-----
> From: Julien Nioche [mailto:julien.nioche@lingway.com]
> Sent: Wednesday, February 13, 2002 6:09 PM
> To: Lucene Developers List
> Subject: Re : How does Lucene handle phrases containing words
> that are not indexed?
>

> PhraseQueries are
> used for compound words
> (e.g. "personal computer") with a given slop value (say 3),
> it could be
> great not to match things such as "It is not personal. My
> computer hates
> me..." .
>

I'd like to index documents that are described by keywords. One document can have zero or more keywords and a keyword can be related to one ore more documents. Assume two keywords:
"human computer interaction"
"computer science"

If I add these keywords to a documents in a field and one search with query human science the document'll be found, won't it? I could use - say - 16 distinct fields for the max 16 keywords and translate the query keyword:"human science" to keyword1:"human science" or keyword2:"human science" ... keyword16:"human science" but this solution isn't prefered by me.

peter

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

RE: Re : How does Lucene handle phrases containing words that are not indexed? [ In reply to ]

DCutting at grandcentral

Feb 14, 2002, 10:33 AM

Post #3 of 4 (854 views)

Permalink

> From: Halácsy Péter [mailto:halacsy.peter@axelero.com]
>
> I'd like to index documents that are described by keywords.
> One document can have zero or more keywords and a keyword can
> be related to one ore more documents. Assume two keywords:
> "human computer interaction"
> "computer science"
>
> If I add these keywords to a documents in a field and one
> search with query human science the document'll be found,
> won't it? I could use - say - 16 distinct fields for the max
> 16 keywords and translate the query keyword:"human science"
> to keyword1:"human science" or keyword2:"human science" ...
> keyword16:"human science" but this solution isn't prefered by me.

This sounds like a good case for an untokenized field.

When you index, use something like:

Document doc = new Document();
doc.add(Field.keyword("keyword", "computer science"));
doc.add(Field.keyword("keyword", "human computer interaction"));
...
indexReader.add(doc);

Then you can either add query keywords "manually":

BooleanQuery query = (BooleanQuery)queryParser.parse("other terms",
analyzer);
query.add(new TermQuery(new Term("keyword", "computer science")), true,
false);

or you can integrate this with the query parser by making an analyzer that
constructs terms for the field named "keyword" using exactly the provided
input:

public class MyAnalyzer extends Analyzer {
private Analyzer standard = new StandardAnalyzer();
public TokenStream tokenStream(String field, final Reader reader) {
if ("keyword".equals(field)) {
return new CharTokenizer(reader) {
protected boolean isTokenChar(char c) { return true; }
};
} else {
return standard.tokenStream(field, reader);
}
}
}

Analyzer analyzer = new MyAnalyzer();
Query query = queryParser.parse("keyword:\"computer science\"", analyzer);

I haven't tested the above code, but I hope you get the idea.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

RE: Re : How does Lucene handle phrases containing words that are not indexed? [ In reply to ]

DCutting at grandcentral

Feb 14, 2002, 10:42 AM

Post #4 of 4 (860 views)

Permalink

> From: Julien Nioche [mailto:julien.nioche@lingway.com]
>
> By the way, I was wondering if there is any Analyzer that
> uses the following
> constructor
> public Token(String text, int start, int end, String typ) ?

StandardTokenizer uses Token's type field to communicate with
StandardFilter, which does some post-processing.

> Maybe it could be interesting to build an analyzer that recognizes
> punctuation marks and
> keeps it in the index as Tokens with a given Type (say for example
> "punctuation") ?

Unfortunately token type is not stored in the index. Adding it could have a
big impact on index size and search performance.

> The advantage is that information could be used by a
> SloppyPhraseScorer.phraseFreq() method
> to avoid PhraseQuery containing a punctuation mark. Since
> PhraseQueries are
> used for compound words
> (e.g. "personal computer") with a given slop value (say 3),
> it could be
> great not to match things such as "It is not personal. My
> computer hates
> me..." .

On the other hand, you'd miss things like, "He needs a new computer.
Personal computing has advanced since 1970."

Still, constraining matches to be within a sentence can be useful, but
Lucene does not currently support it, and I don't see an easy way to add it.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>