Hi,
I was trying to validate how the unit test should work for wildcard searches
and I couldn't find a central reference for the query language. Here is a
general reference that I thought might be useful for people trying to
understand all the QueryParser language (it's based on some instruction I
wrote for a project so I hope it makes sense).
Please provide comments, then I'll post it.
Thanks
--Peter
Overview
Although Lucene provides the ability to create your own query's though its
API, it also provides a rich query language through the QueryParser.
Terms
A query is broken up into terms and operators. There are two types of terms:
Single Terms and Phrases.
A Single Term is a single word such as "test" or "oracle".
A Phrase is a group of words surrounded by double quotes such as "test
oracle".
Each of these terms can be combined together with Boolean operators to form
a more complex query (see below).
Fields
Lucene supports fielded data. When performing a search you can either
specify a field, or use the default field. The fields and default field is
implementation specific.
You can search any of these fields by typing the field name followed by a
colon ":" and then the term you are looking for. For example, if a Lucene
index contains two fields, title and text and text is the default field. If
you want to find the document entitled "The Right Way" which contains the
text "right", you can enter:
title:"The Right Way" AND text:right
or
title:"Do it right" AND right
If text is the default field
Note: The field is only valid for the term that it directly precedes, so the
query
title:Do it right
Will only find "Do" in the title field. It will find "it" and "right" in the
default field (in this case the text field).
Wildcard Searches
Lucene supports single and multiple character wildcard searches.
To perform a single character wildcard search use the "?" symbol.
To perform a multiple character wildcard search use the "*" symbol.
The single character wildcard search looks for terms that match that with
the single character replaced. For example, to search for "text" or "test"
you can use the search:
te?t
Note: searching for "test?" will not find "test", but will find "tests".
Multiple character wildcard searches looks for 0 or more characters. For
example, to search for test, tests or tester, you can use the search:
test*
You can also use the wildcard searches in the middle of a term.
te*t
Note: You cannot use a * or ? symbol as the first character of a search.
Fuzzy Searches
Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit
Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the
end of a term. For example to search for a term similar in spelling to
"roam" use the fuzzy search:
roam~
This search will find terms like foam and roams
Boosting a Term
Lucene provides the relevance level of matching documents based on the terms
found. To boost a term use the caret, "^", symbol with a boost factor (a
number) at the end of the term you are searching. The higher the boost
factor, the more relevant the term will be.
Boosting allows you to control the relevance of a document by boosting its
term. For example, to search for
IBM Microsoft
and you want the term "IBM" to be more relevant boost it using the ^ symbol
along with the boost factor next to the term. You would type:
IBM^4 Microsoft
This will make documents with the term IBM appear more relevant. You can
also boost Phrase Terms as in the example:
"Microsoft Word"^4 "Microsoft Excel"
By default, the boost factor is 1.
Boolean operators
Lucene supports AND, OR and NOT as Boolean operators.(Note: Boolean
operators must be ALL CAPS).
OR
The OR operator is the default conjunction operator. This means that if
there is no Boolean operator between two terms, the OR operator is used. The
OR operator links two terms and finds a matching document if either of the
terms exist in a document. For example to search for documents that contain
either "Microsoft Word" or just "Microsoft":
"Microsoft Word" Microsoft
or
"Microsoft Word" OR Microsoft
AND
The AND operator matches documents where both terms exist anywhere in the
text of a single document. For example to search for documents that contain
"Microsoft Word" and "Microsoft Excel":
"Microsoft Word" AND "Microsoft Excel"
NOT
The NOT operator excludes documents that contain the term after NOT. For
example to search for documents that contain "Microsoft Word" but not
"Microsoft Excel":
"Microsoft Word" NOT "Microsoft Excel"
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
I was trying to validate how the unit test should work for wildcard searches
and I couldn't find a central reference for the query language. Here is a
general reference that I thought might be useful for people trying to
understand all the QueryParser language (it's based on some instruction I
wrote for a project so I hope it makes sense).
Please provide comments, then I'll post it.
Thanks
--Peter
Overview
Although Lucene provides the ability to create your own query's though its
API, it also provides a rich query language through the QueryParser.
Terms
A query is broken up into terms and operators. There are two types of terms:
Single Terms and Phrases.
A Single Term is a single word such as "test" or "oracle".
A Phrase is a group of words surrounded by double quotes such as "test
oracle".
Each of these terms can be combined together with Boolean operators to form
a more complex query (see below).
Fields
Lucene supports fielded data. When performing a search you can either
specify a field, or use the default field. The fields and default field is
implementation specific.
You can search any of these fields by typing the field name followed by a
colon ":" and then the term you are looking for. For example, if a Lucene
index contains two fields, title and text and text is the default field. If
you want to find the document entitled "The Right Way" which contains the
text "right", you can enter:
title:"The Right Way" AND text:right
or
title:"Do it right" AND right
If text is the default field
Note: The field is only valid for the term that it directly precedes, so the
query
title:Do it right
Will only find "Do" in the title field. It will find "it" and "right" in the
default field (in this case the text field).
Wildcard Searches
Lucene supports single and multiple character wildcard searches.
To perform a single character wildcard search use the "?" symbol.
To perform a multiple character wildcard search use the "*" symbol.
The single character wildcard search looks for terms that match that with
the single character replaced. For example, to search for "text" or "test"
you can use the search:
te?t
Note: searching for "test?" will not find "test", but will find "tests".
Multiple character wildcard searches looks for 0 or more characters. For
example, to search for test, tests or tester, you can use the search:
test*
You can also use the wildcard searches in the middle of a term.
te*t
Note: You cannot use a * or ? symbol as the first character of a search.
Fuzzy Searches
Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit
Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the
end of a term. For example to search for a term similar in spelling to
"roam" use the fuzzy search:
roam~
This search will find terms like foam and roams
Boosting a Term
Lucene provides the relevance level of matching documents based on the terms
found. To boost a term use the caret, "^", symbol with a boost factor (a
number) at the end of the term you are searching. The higher the boost
factor, the more relevant the term will be.
Boosting allows you to control the relevance of a document by boosting its
term. For example, to search for
IBM Microsoft
and you want the term "IBM" to be more relevant boost it using the ^ symbol
along with the boost factor next to the term. You would type:
IBM^4 Microsoft
This will make documents with the term IBM appear more relevant. You can
also boost Phrase Terms as in the example:
"Microsoft Word"^4 "Microsoft Excel"
By default, the boost factor is 1.
Boolean operators
Lucene supports AND, OR and NOT as Boolean operators.(Note: Boolean
operators must be ALL CAPS).
OR
The OR operator is the default conjunction operator. This means that if
there is no Boolean operator between two terms, the OR operator is used. The
OR operator links two terms and finds a matching document if either of the
terms exist in a document. For example to search for documents that contain
either "Microsoft Word" or just "Microsoft":
"Microsoft Word" Microsoft
or
"Microsoft Word" OR Microsoft
AND
The AND operator matches documents where both terms exist anywhere in the
text of a single document. For example to search for documents that contain
"Microsoft Word" and "Microsoft Excel":
"Microsoft Word" AND "Microsoft Excel"
NOT
The NOT operator excludes documents that contain the term after NOT. For
example to search for documents that contain "Microsoft Word" but not
"Microsoft Excel":
"Microsoft Word" NOT "Microsoft Excel"
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>