Mailing List Archive

Boolean Query Parsing with "IN" keyword
*This message was transferred with a trial version of CommuniGate(tm) Pro*

I'm trying to search on a US State field. The lucene field name is "state"
and so I'm building a query like: +(state:fl state:al state:in) to search
for documents in Florida, Alabama, or Indiana. But whenever I pass "in" or
"IN" to the QueryParser it strips it out. Passing the above query to the
QueryParser yields +(state:fl state:al). Is there a way to escape the "in"
keyword? I've tried enclosing it in double and single quotes, neither of
which worked.

Thanks,
Jonathan Franzone



--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Boolean Query Parsing with "IN" keyword [ In reply to ]
Jonathan,

That's most likely caused by StandardAnalyzer, which you are probably
using. 'in' is listed as one of the stop words:

public static final String[] STOP_WORDS = {
"a", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "s", "such",
"t", "that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
};

Try searching for state:or
It should yield no matches.

But, StandardAnalyzer is no longer final (get the latest build) and you
can write a class that subclasses it and calls this StandardAnalyser
constructor:

/** Builds an analyzer with the given stop words. */
public StandardAnalyzer(String[] stopWords) {
stopTable = StopFilter.makeStopTable(stopWords);
}

Pass it your own list of stop words and you are done.
If you've already indexed some data you have to be careful which words
you choose as stop words. I suggest sticking with the above list
(minus 'in', 'or', etc.) for now.
Once you have your class use it instead of StandardAnalyzer.

Otis




--- Jonathan Franzone <jonathan@franzone.com> wrote:
> *This message was transferred with a trial version of CommuniGate(tm)
> Pro*
>
> I'm trying to search on a US State field. The lucene field name is
> "state"
> and so I'm building a query like: +(state:fl state:al state:in) to
> search
> for documents in Florida, Alabama, or Indiana. But whenever I pass
> "in" or
> "IN" to the QueryParser it strips it out. Passing the above query to
> the
> QueryParser yields +(state:fl state:al). Is there a way to escape the
> "in"
> keyword? I've tried enclosing it in double and single quotes, neither
> of
> which worked.
>
> Thanks,
> Jonathan Franzone
>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>


__________________________________________________
Do You Yahoo!?
Yahoo! Sports - Coverage of the 2002 Olympic Games
http://sports.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Boolean Query Parsing with "IN" keyword [ In reply to ]
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
>
> But, StandardAnalyzer is no longer final (get the latest
> build) and you
> can write a class that subclasses it

Right. To flesh out Otis' example of how to change StandardAnalyzer's stop
list by defining a subclass of it:

public class MyAnalyzer extends StandardAnalyzer {
private static final String[] MY_STOP_WORDS = {"a", "b", ... };
public MyAnalyzer() {
super(MY_STOP_WORDS);
}
}

Another way to do this is to use a different analyzer for the "state" field
than for your other fields:

public class MyAnalyzer2 extends Analyzer {
private Analyzer stateAnalyzer = new SimpleAnalyzer();
private Analyzer otherAnalyzer = new StandardAnalyzer();
public TokenStream tokenStream(String field, Reader reader) {
if ("state".equals(field))
return stateAnalyzer.tokenStream(field, reader);
else
return otherAnalyzer.tokenStream(field, reader);
}
}

This technique is handy for fields that aren't normal text. For example,
you could use WhitespaceAnalyzer for a case-sensitive field whose values
contain punctuation.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>