Mailing List Archive: PLEASE REVIEW: QueryParser syntax documentation

PLEASE REVIEW: QueryParser syntax documentation

May 8, 2002, 9:55 PM

Post #1 of 3 (372 views)

Hi,

I was trying to validate how the unit test should work for wildcard searches
and I couldn't find a central reference for the query language. Here is a
general reference that I thought might be useful for people trying to
understand all the QueryParser language (it's based on some instruction I
wrote for a project so I hope it makes sense).

Please provide comments, then I'll post it.

Thanks

--Peter

Overview
Although Lucene provides the ability to create your own query's though its
API, it also provides a rich query language through the QueryParser.

Terms
A query is broken up into terms and operators. There are two types of terms:
Single Terms and Phrases.
A Single Term is a single word such as "test" or "oracle".
A Phrase is a group of words surrounded by double quotes such as "test
oracle".
Each of these terms can be combined together with Boolean operators to form
a more complex query (see below).

Fields
Lucene supports fielded data. When performing a search you can either
specify a field, or use the default field. The fields and default field is
implementation specific.

You can search any of these fields by typing the field name followed by a
colon ":" and then the term you are looking for. For example, if a Lucene
index contains two fields, title and text and text is the default field. If
you want to find the document entitled "The Right Way" which contains the
text "right", you can enter:

title:"The Right Way" AND text:right
or
title:"Do it right" AND right
If text is the default field

Note: The field is only valid for the term that it directly precedes, so the
query
title:Do it right
Will only find "Do" in the title field. It will find "it" and "right" in the
default field (in this case the text field).

Wildcard Searches
Lucene supports single and multiple character wildcard searches.
To perform a single character wildcard search use the "?" symbol.
To perform a multiple character wildcard search use the "*" symbol.
The single character wildcard search looks for terms that match that with
the single character replaced. For example, to search for "text" or "test"
you can use the search:

te?t
Note: searching for "test?" will not find "test", but will find "tests".

Multiple character wildcard searches looks for 0 or more characters. For
example, to search for test, tests or tester, you can use the search:

test*
You can also use the wildcard searches in the middle of a term.

te*t
Note: You cannot use a * or ? symbol as the first character of a search.

Fuzzy Searches
Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit
Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the
end of a term. For example to search for a term similar in spelling to
"roam" use the fuzzy search:

roam~
This search will find terms like foam and roams

Boosting a Term
Lucene provides the relevance level of matching documents based on the terms
found. To boost a term use the caret, "^", symbol with a boost factor (a
number) at the end of the term you are searching. The higher the boost
factor, the more relevant the term will be.
Boosting allows you to control the relevance of a document by boosting its
term. For example, to search for

IBM Microsoft
and you want the term "IBM" to be more relevant boost it using the ^ symbol
along with the boost factor next to the term. You would type:

IBM^4 Microsoft
This will make documents with the term IBM appear more relevant. You can
also boost Phrase Terms as in the example:

"Microsoft Word"^4 "Microsoft Excel"
By default, the boost factor is 1.

Boolean operators
Lucene supports AND, OR and NOT as Boolean operators.(Note: Boolean
operators must be ALL CAPS).

OR
The OR operator is the default conjunction operator. This means that if
there is no Boolean operator between two terms, the OR operator is used. The
OR operator links two terms and finds a matching document if either of the
terms exist in a document. For example to search for documents that contain
either "Microsoft Word" or just "Microsoft":

"Microsoft Word" Microsoft

or

"Microsoft Word" OR Microsoft

AND
The AND operator matches documents where both terms exist anywhere in the
text of a single document. For example to search for documents that contain
"Microsoft Word" and "Microsoft Excel":

"Microsoft Word" AND "Microsoft Excel"

NOT
The NOT operator excludes documents that contain the term after NOT. For
example to search for documents that contain "Microsoft Word" but not
"Microsoft Excel":

"Microsoft Word" NOT "Microsoft Excel"

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: PLEASE REVIEW: QueryParser syntax documentation [ In reply to ]

otis_gospodnetic at yahoo

May 9, 2002, 8:03 AM

Post #2 of 3 (361 views)

Permalink

Looks good to me.
Two minor things. This sentence doesn't make sense to me:

For example, if a Lucene index contains two fields, title and text and
text is the default field.

Also, maybe you can mention that one can use &, |, etc. in place of
AND, OR, etc.

Thanks,
Otis
P.S.
How about grouping? Does Lucene's query parser support that?
For instance: (("red snapper" AND Sancere) OR (burger AND Pepsi))

--- Peter Carlson <carlson@bookandhammer.com> wrote:
> Hi,
>
> I was trying to validate how the unit test should work for wildcard
> searches
> and I couldn't find a central reference for the query language. Here
> is a
> general reference that I thought might be useful for people trying to
> understand all the QueryParser language (it's based on some
> instruction I
> wrote for a project so I hope it makes sense).
>
> Please provide comments, then I'll post it.
>
> Thanks
>
> --Peter
>
> Overview
> Although Lucene provides the ability to create your own query's
> though its
> API, it also provides a rich query language through the QueryParser.
>
> Terms
> A query is broken up into terms and operators. There are two types of
> terms:
> Single Terms and Phrases.
> A Single Term is a single word such as "test" or "oracle".
> A Phrase is a group of words surrounded by double quotes such as
> "test
> oracle".
> Each of these terms can be combined together with Boolean operators
> to form
> a more complex query (see below).
>
>
> Fields
> Lucene supports fielded data. When performing a search you can either
> specify a field, or use the default field. The fields and default
> field is
> implementation specific.
>
> You can search any of these fields by typing the field name followed
> by a
> colon ":" and then the term you are looking for. For example, if a
> Lucene
> index contains two fields, title and text and text is the default
> field. If
> you want to find the document entitled "The Right Way" which contains
> the
> text "right", you can enter:
>
> title:"The Right Way" AND text:right
> or
> title:"Do it right" AND right
> If text is the default field
>
> Note: The field is only valid for the term that it directly precedes,
> so the
> query
> title:Do it right
> Will only find "Do" in the title field. It will find "it" and "right"
> in the
> default field (in this case the text field).
>
> Wildcard Searches
> Lucene supports single and multiple character wildcard searches.
> To perform a single character wildcard search use the "?" symbol.
> To perform a multiple character wildcard search use the "*" symbol.
> The single character wildcard search looks for terms that match that
> with
> the single character replaced. For example, to search for "text" or
> "test"
> you can use the search:
>
> te?t
> Note: searching for "test?" will not find "test", but will find
> "tests".
>
> Multiple character wildcard searches looks for 0 or more characters.
> For
> example, to search for test, tests or tester, you can use the search:
>
> test*
> You can also use the wildcard searches in the middle of a term.
>
> te*t
> Note: You cannot use a * or ? symbol as the first character of a
> search.
>
> Fuzzy Searches
> Lucene supports fuzzy searches based on the Levenshtein Distance, or
> Edit
> Distance algorithm. To do a fuzzy search use the tilde, "~", symbol
> at the
> end of a term. For example to search for a term similar in spelling
> to
> "roam" use the fuzzy search:
>
> roam~
> This search will find terms like foam and roams
>
> Boosting a Term
> Lucene provides the relevance level of matching documents based on
> the terms
> found. To boost a term use the caret, "^", symbol with a boost factor
> (a
> number) at the end of the term you are searching. The higher the
> boost
> factor, the more relevant the term will be.
> Boosting allows you to control the relevance of a document by
> boosting its
> term. For example, to search for
>
> IBM Microsoft
> and you want the term "IBM" to be more relevant boost it using the ^
> symbol
> along with the boost factor next to the term. You would type:
>
> IBM^4 Microsoft
> This will make documents with the term IBM appear more relevant. You
> can
> also boost Phrase Terms as in the example:
>
> "Microsoft Word"^4 "Microsoft Excel"
> By default, the boost factor is 1.
>
> Boolean operators
> Lucene supports AND, OR and NOT as Boolean operators.(Note: Boolean
> operators must be ALL CAPS).
>
> OR
> The OR operator is the default conjunction operator. This means that
> if
> there is no Boolean operator between two terms, the OR operator is
> used. The
> OR operator links two terms and finds a matching document if either
> of the
> terms exist in a document. For example to search for documents that
> contain
> either "Microsoft Word" or just "Microsoft":
>
> "Microsoft Word" Microsoft
>
> or
>
> "Microsoft Word" OR Microsoft
>
>
> AND
> The AND operator matches documents where both terms exist anywhere in
> the
> text of a single document. For example to search for documents that
> contain
> "Microsoft Word" and "Microsoft Excel":
>
> "Microsoft Word" AND "Microsoft Excel"
>
> NOT
> The NOT operator excludes documents that contain the term after NOT.
> For
> example to search for documents that contain "Microsoft Word" but not
> "Microsoft Excel":
>
> "Microsoft Word" NOT "Microsoft Excel"
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>

__________________________________________________
Do You Yahoo!?
Yahoo! Shopping - Mother's Day is May 12th!
http://shopping.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

RE: PLEASE REVIEW: QueryParser syntax documentation [ In reply to ]

cutting at lucene

May 9, 2002, 8:30 AM

Post #3 of 3 (361 views)

Permalink

Thanks, this is great to have!

A few things you don't mention:

- Grouping: parentheses may be used to group clauses, for example:
apple AND (fruit OR tree)

- Plus and minus: these can be used to require or prohibit clauses:
+apple -"computer company" -model:(macintosh OR lisa)

Doug

> -----Original Message-----
> From: Peter Carlson
> [mailto:carlson.at.bookandhammer.com@cutting.at.lucene.com]
> Sent: Wednesday, May 08, 2002 9:55 PM
> To: dcutting@grandcentral.com
> Subject: PLEASE REVIEW: QueryParser syntax documentation
>
>
> Hi,
>
> I was trying to validate how the unit test should work for
> wildcard searches
> and I couldn't find a central reference for the query
> language. Here is a
> general reference that I thought might be useful for people trying to
> understand all the QueryParser language (it's based on some
> instruction I
> wrote for a project so I hope it makes sense).
>
> Please provide comments, then I'll post it.
>
> Thanks
>
> --Peter
>
> Overview
> Although Lucene provides the ability to create your own
> query's though its
> API, it also provides a rich query language through the QueryParser.
>
> Terms
> A query is broken up into terms and operators. There are two
> types of terms:
> Single Terms and Phrases.
> A Single Term is a single word such as "test" or "oracle".
> A Phrase is a group of words surrounded by double quotes such as "test
> oracle".
> Each of these terms can be combined together with Boolean
> operators to form
> a more complex query (see below).
>
>
> Fields
> Lucene supports fielded data. When performing a search you can either
> specify a field, or use the default field. The fields and
> default field is
> implementation specific.
>
> You can search any of these fields by typing the field name
> followed by a
> colon ":" and then the term you are looking for. For example,
> if a Lucene
> index contains two fields, title and text and text is the
> default field. If
> you want to find the document entitled "The Right Way" which
> contains the
> text "right", you can enter:
>
> title:"The Right Way" AND text:right
> or
> title:"Do it right" AND right
> If text is the default field
>
> Note: The field is only valid for the term that it directly
> precedes, so the
> query
> title:Do it right
> Will only find "Do" in the title field. It will find "it" and
> "right" in the
> default field (in this case the text field).
>
> Wildcard Searches
> Lucene supports single and multiple character wildcard searches.
> To perform a single character wildcard search use the "?" symbol.
> To perform a multiple character wildcard search use the "*" symbol.
> The single character wildcard search looks for terms that
> match that with
> the single character replaced. For example, to search for
> "text" or "test"
> you can use the search:
>
> te?t
> Note: searching for "test?" will not find "test", but will
> find "tests".
>
> Multiple character wildcard searches looks for 0 or more
> characters. For
> example, to search for test, tests or tester, you can use the search:
>
> test*
> You can also use the wildcard searches in the middle of a term.
>
> te*t
> Note: You cannot use a * or ? symbol as the first character
> of a search.
>
> Fuzzy Searches
> Lucene supports fuzzy searches based on the Levenshtein
> Distance, or Edit
> Distance algorithm. To do a fuzzy search use the tilde, "~",
> symbol at the
> end of a term. For example to search for a term similar in spelling to
> "roam" use the fuzzy search:
>
> roam~
> This search will find terms like foam and roams
>
> Boosting a Term
> Lucene provides the relevance level of matching documents
> based on the terms
> found. To boost a term use the caret, "^", symbol with a
> boost factor (a
> number) at the end of the term you are searching. The higher the boost
> factor, the more relevant the term will be.
> Boosting allows you to control the relevance of a document by
> boosting its
> term. For example, to search for
>
> IBM Microsoft
> and you want the term "IBM" to be more relevant boost it
> using the ^ symbol
> along with the boost factor next to the term. You would type:
>
> IBM^4 Microsoft
> This will make documents with the term IBM appear more
> relevant. You can
> also boost Phrase Terms as in the example:
>
> "Microsoft Word"^4 "Microsoft Excel"
> By default, the boost factor is 1.
>
> Boolean operators
> Lucene supports AND, OR and NOT as Boolean operators.(Note: Boolean
> operators must be ALL CAPS).
>
> OR
> The OR operator is the default conjunction operator. This
> means that if
> there is no Boolean operator between two terms, the OR
> operator is used. The
> OR operator links two terms and finds a matching document if
> either of the
> terms exist in a document. For example to search for
> documents that contain
> either "Microsoft Word" or just "Microsoft":
>
> "Microsoft Word" Microsoft
>
> or
>
> "Microsoft Word" OR Microsoft
>
>
> AND
> The AND operator matches documents where both terms exist
> anywhere in the
> text of a single document. For example to search for
> documents that contain
> "Microsoft Word" and "Microsoft Excel":
>
> "Microsoft Word" AND "Microsoft Excel"
>
> NOT
> The NOT operator excludes documents that contain the term
> after NOT. For
> example to search for documents that contain "Microsoft Word" but not
> "Microsoft Excel":
>
> "Microsoft Word" NOT "Microsoft Excel"
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>