Mailing List Archive: QueryParser question

QueryParser question - case-sensitivity

May 9, 2002, 9:52 AM

Post #1 of 8 (1427 views)

I have a QueryParser/Query question. These classes (not sure which) is
apparently converting my term values into lowercase even though Term's
values are by default case-sensitive. I've got non-word text, id's, that
are case sensitive and stored/indexed that way, but query parser is not
respecting my case sensitive search criterion.

For example, I create a query string:

id:"templatedata/f2container/data/Course1102043194747042"

and pass this to the QueryParser.parse() method. When I dump the Query with
toString() I get:

+id:templatedata/f2container/data/course1102043194747042

Naturally, this query fails as I'm expecting a hit on the id with the
uppercase C. If I create and index an id all lower case, then the query
succeeds. Case-sensitivity is important to maintain for querying this
element, especially using it once the hit occurs.

How do I coerce QueryParser/Query to not 'tolower' my query string? or is
there an alternate method that's more direct which takes my query string
with no modification?

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: QueryParser question - case-sensitivity [ In reply to ]

cutting at lucene

May 9, 2002, 10:05 AM

Post #2 of 8 (1406 views)

Permalink

Define an Analyzer that does not lowercase the id field, e.g., something
like:

public class MyAnalyzer extends Analyzer {
private Analyzer standard = new StandardAnalyzer();
public TokenStream tokenStream(String field, final Reader reader) {
if ("id".equals(field)) {
return new WhitespaceTokenizer(reader);
} else {
return standard.tokenStream(field, reader);
}
}
}

Then pass this into QueryParser.

Doug

> -----Original Message-----
> From: Landon Cox
> [mailto:lcox.at.interactive-media.com@cutting.at.lucene.com]
> Sent: Thursday, May 09, 2002 9:52 AM
> To: dcutting@grandcentral.com
> Subject: QueryParser question - case-sensitivity
>
>
>
> I have a QueryParser/Query question. These classes (not sure
> which) is
> apparently converting my term values into lowercase even though Term's
> values are by default case-sensitive. I've got non-word
> text, id's, that
> are case sensitive and stored/indexed that way, but query
> parser is not
> respecting my case sensitive search criterion.
>
> For example, I create a query string:
>
> id:"templatedata/f2container/data/Course1102043194747042"
>
> and pass this to the QueryParser.parse() method. When I dump
> the Query with
> toString() I get:
>
> +id:templatedata/f2container/data/course1102043194747042
>
> Naturally, this query fails as I'm expecting a hit on the id with the
> uppercase C. If I create and index an id all lower case,
> then the query
> succeeds. Case-sensitivity is important to maintain for querying this
> element, especially using it once the hit occurs.
>
> How do I coerce QueryParser/Query to not 'tolower' my query
> string? or is
> there an alternate method that's more direct which takes my
> query string
> with no modification?
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: QueryParser question - case-sensitivity [ In reply to ]

otis_gospodnetic at yahoo

May 9, 2002, 10:24 AM

Post #3 of 8 (1412 views)

Permalink

Wouldn't that be the Analzyer that you are using?
I don't have the source handy to check it for you, but look for
toLowerCase or some such, and you'll find who's messing with your
queries.
Replace that piece, and you'll keep your upper cases.

Otis

--- Landon Cox <lcox@interactive-media.com> wrote:
>
> I have a QueryParser/Query question. These classes (not sure which)
> is
> apparently converting my term values into lowercase even though
> Term's
> values are by default case-sensitive. I've got non-word text, id's,
> that
> are case sensitive and stored/indexed that way, but query parser is
> not
> respecting my case sensitive search criterion.
>
> For example, I create a query string:
>
> id:"templatedata/f2container/data/Course1102043194747042"
>
> and pass this to the QueryParser.parse() method. When I dump the
> Query with
> toString() I get:
>
> +id:templatedata/f2container/data/course1102043194747042
>
> Naturally, this query fails as I'm expecting a hit on the id with the
> uppercase C. If I create and index an id all lower case, then the
> query
> succeeds. Case-sensitivity is important to maintain for querying
> this
> element, especially using it once the hit occurs.
>
> How do I coerce QueryParser/Query to not 'tolower' my query string?
> or is
> there an alternate method that's more direct which takes my query
> string
> with no modification?
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>

__________________________________________________
Do You Yahoo!?
Yahoo! Shopping - Mother's Day is May 12th!
http://shopping.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: QueryParser question - case-sensitivity [ In reply to ]

lcox at interactive-media

May 9, 2002, 11:28 AM

Post #4 of 8 (1404 views)

Permalink

Hi Otis,

On both the indexing side and creation of the query parser, I'm using the
StandardAnalyzer class. Seems like it would be symmetrical w/r to case
sensitivity, but it's apparently not related to the problem or it's a
bug...I suspect the former. I'll start looking at the source next. Thanks,

Landon

|-----Original Message-----
|From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
|Sent: Thursday, May 09, 2002 11:24 AM
|To: Lucene Users List
|Subject: Re: QueryParser question - case-sensitivity
|
|
|Wouldn't that be the Analzyer that you are using?
|I don't have the source handy to check it for you, but look for
|toLowerCase or some such, and you'll find who's messing with your
|queries.
|Replace that piece, and you'll keep your upper cases.
|
|Otis

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: QueryParser question - case-sensitivity [ In reply to ]

peixotto at geofolio

May 9, 2002, 12:36 PM

Post #5 of 8 (1406 views)

Permalink

Looks like the Standard Analyzer uses the LowerCaseFilter as one of its
filters. This is the one that is converting everything to lower case. If
you replace the StandardAnalyser with a different Analyzer you should be ok.

Dave
----- Original Message -----
From: "Landon Cox" <lcox@interactive-media.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Thursday, May 09, 2002 11:28 AM
Subject: RE: QueryParser question - case-sensitivity

>
> Hi Otis,
>
> On both the indexing side and creation of the query parser, I'm using the
> StandardAnalyzer class. Seems like it would be symmetrical w/r to case
> sensitivity, but it's apparently not related to the problem or it's a
> bug...I suspect the former. I'll start looking at the source next.
Thanks,
>
> Landon
>
> |-----Original Message-----
> |From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> |Sent: Thursday, May 09, 2002 11:24 AM
> |To: Lucene Users List
> |Subject: Re: QueryParser question - case-sensitivity
> |
> |
> |Wouldn't that be the Analzyer that you are using?
> |I don't have the source handy to check it for you, but look for
> |toLowerCase or some such, and you'll find who's messing with your
> |queries.
> |Replace that piece, and you'll keep your upper cases.
> |
> |Otis
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: QueryParser question - case-sensitivity [ In reply to ]

DCutting at Grandcentral

May 9, 2002, 12:57 PM

Post #6 of 8 (1409 views)

Permalink

[.I'm resending this from a different account, since my first attempt is
bogged down somewhere. A second copy will probably show up tomorrow, but in
the interests of solving this problem sooner, I'm resending it. Sorry for
the duplicaton.]

Define an Analyzer that does not lowercase the id field, e.g., something
like:

public class MyAnalyzer extends Analyzer {
private Analyzer standard = new StandardAnalyzer();
public TokenStream tokenStream(String field, final Reader reader) {
if ("id".equals(field)) {
return new WhitespaceTokenizer(reader);
} else {
return standard.tokenStream(field, reader);
}
}
}

Then pass this into QueryParser.

Doug

> -----Original Message-----
> From: Landon Cox
> [mailto:lcox.at.interactive-media.com@cutting.at.lucene.com]
> Sent: Thursday, May 09, 2002 9:52 AM
> To: dcutting@grandcentral.com
> Subject: QueryParser question - case-sensitivity
>
>
>
> I have a QueryParser/Query question. These classes (not sure
> which) is
> apparently converting my term values into lowercase even though Term's
> values are by default case-sensitive. I've got non-word
> text, id's, that
> are case sensitive and stored/indexed that way, but query
> parser is not
> respecting my case sensitive search criterion.
>
> For example, I create a query string:
>
> id:"templatedata/f2container/data/Course1102043194747042"
>
> and pass this to the QueryParser.parse() method. When I dump
> the Query with
> toString() I get:
>
> +id:templatedata/f2container/data/course1102043194747042
>
> Naturally, this query fails as I'm expecting a hit on the id with the
> uppercase C. If I create and index an id all lower case,
> then the query
> succeeds. Case-sensitivity is important to maintain for querying this
> element, especially using it once the hit occurs.
>
> How do I coerce QueryParser/Query to not 'tolower' my query
> string? or is
> there an alternate method that's more direct which takes my
> query string
> with no modification?
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: QueryParser question - case-sensitivity [ In reply to ]

lcox at interactive-media

May 9, 2002, 1:06 PM

Post #7 of 8 (1413 views)

Permalink

Hi DaveP and Otis,

After looking further, my take:

It would be symmetrical except the ID field (term) I'm looking for was
indexed as a Keyword since it's a file path that I don't want tokenized. I
think what's happening is that since it's not being tokenized, even though
I'm using StandardAnalyzer on both indexing and querying, when indexed it's
not going through the lower case filter of StandardAnalyzer and therefore is
stored fully respecting case-sensitivity.

On the flipside, the query doesn't really know the same thing (term names
mapped to field types - in this case a keyword) and is running all queries
through StandardAnalyzer without regard to term name and type (as it was
designed.)

So, I think you're right - it comes down to the analyzer, but more directly,
I think it comes down to the fact that the Keyword value is unmolested when
indexed but the query term value after going through QueryParser.parse() is
lower-case due to the LowerCaseFilter that StandardAnalyzer uses.

For a keyword field, the docs say:
Keyword
public static final Field Keyword(String name,
String value)
Constructs a String-valued Field that is not tokenized, but is indexed and
stored. Useful for non-text fields, e.g. date or url.

If you look at StandardAnalyzer source, the tokenStream method runs it
through LowerCaseFilter as spec'd. But since a Keyword is not tokenized,
it's stored/indexed respecting case.

Does that jive with your knowledge of the source and behavior of the
classes?

It does look like I need to make a query analyzer that's a little more
"aware" of my field names (and types) for querying purposes...that analyzer
would match the behavior on the indexing side such that it knows what fields
are Keywords and therefore whether to pass them through unchanged or not.

Thanks for the feedback.

Landon

PS. Late break: Just read the mail from Doug after writing this analysis.
Think it confirmed what was going on. Thank you, Doug.

|-----Original Message-----
|From: Landon Cox [mailto:lcox@interactive-media.com]
|Sent: Thursday, May 09, 2002 12:29 PM
|To: Lucene Users List
|Subject: RE: QueryParser question - case-sensitivity
|
|
|
|Hi Otis,
|
|On both the indexing side and creation of the query parser, I'm using the
|StandardAnalyzer class. Seems like it would be symmetrical w/r to case
|sensitivity, but it's apparently not related to the problem or it's a
|bug...I suspect the former. I'll start looking at the source
|next. Thanks,
|
|Landon
|
||-----Original Message-----
||From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
||Sent: Thursday, May 09, 2002 11:24 AM
||To: Lucene Users List
||Subject: Re: QueryParser question - case-sensitivity
||
||
||Wouldn't that be the Analzyer that you are using?
||I don't have the source handy to check it for you, but look for
||toLowerCase or some such, and you'll find who's messing with your
||queries.
||Replace that piece, and you'll keep your upper cases.
||
||Otis

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: QueryParser question - case-sensitivity [ In reply to ]

lcox at interactive-media

May 9, 2002, 2:23 PM

Post #8 of 8 (1410 views)

Permalink

Ok, this is the solution and it seems to have worked like a charm. I took
Doug's fragment as a starting point, but enhanced it to be general purpose.
Instead of the keyword field name being hardwired into the tokenStream
method, the derived Analyzer class, in this case DCRAnalyzer, accepts a
hashtable of keyword fieldnames. As long as the keyword's value in the hash
is != null, this code below will work, so you can initialize the keyword's
value with any object you care about.

If you create another class derived from Hashtable that wires your app's
keyword fieldnames into it, an instance of that class can be passed into
DCRAnalyzer so all that all the application specific keyword knowledge
remains contained in one app class, but this code can remain general.

For my app, the XML 'id' attribute of all tags fall into this category of
keyword fields to pass through unscathed. I'm sure I'll add others over
time which is why hash seemed convenient and fast. Anyway, this all tested
out as expected and now the analyzer has the 'smarts' needed for
case-sensitivity on different fieldnames.

/*
* DCRAnalyzer.java
*
* Created on May 9, 2002, 1:14 PM
*/

package <<<yourpackagenamegoeshere>>>;

import java.io.*;
import java.util.*;

import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.StandardAnalyzer;

public class DCRAnalyzer extends Analyzer
{

/** Creates a new instance of DCRAnalyzer */
public DCRAnalyzer( Hashtable keywordFieldNames )
{
m_keywordNames = keywordFieldNames;

}

public TokenStream tokenStream( String field, final Reader reader )
{
// see if field is a designated keyword name, if so, don't run it
through standard
if ( m_keywordNames.get(field) != null )
{
return new WhitespaceTokenizer(reader);

} else {
return m_standard.tokenStream(field, reader);
}

}

private Hashtable m_keywordNames = new Hashtable();
private Analyzer m_standard = new StandardAnalyzer();
}

Not much code, but it nicely did the trick and you could easily extend it to
support numerous analyzers mapped to fieldnames, not just these two. Thanks
for the various bits of advice from Doug, DaveP, and Otis.

Landon Cox

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>