Mailing List Archive

Re: Lucene Query Structure
Actually, Winton's suggestion doesn't work because it's inconsistent with
the syntax of BooleanQuery() (the constructor doesn't take arguments, and
add() takes one Query argument, not two).

After considerable study of the documentation, I am still confused about
the semantics of BooleanQuery. I think I can answer the original
syntactic question posed by sjb, but the overall motivation escapes me.

I believe the correct syntax is (given TermQueries a, b, c, and d):

// create a BooleanQuery for '(a AND b)'
BooleanQuery bq_ab = new BooleanQuery();
bq_ab.add(a, true, false);
bq_ab.add(b, true, false);

// same as above with c and d
BooleanQuery bq_cd = new BooleanQuery();
bq_cd.add(c, true, false);
bq_cd.add(d, true, false);

// join two BooleanQueries together
BooleanQuery bq_abcd = new BooleanQuery();
bq_abcd.add(bq_ab, false, false);
bq_abcd.add(bq_cd, false, false);


Now, as sjb pointed out, "(query, false, false)" doesn't really seem to
have the semantics of a boolean OR. In particular:

(1) It's a unary operator: add() adds a Query (or a BooleanClause) to a
BooleanQuery. OR is a binary operator.

(2) "add.(query, false, false)" adds a query Q whose satisfaction is
*irrelevant* to that of the resultant BooleanQuery BQ: documents are
neither required to satisfy Q, nor required to *not* satisfy Q, in order
for BQ to be satisfied. The semantics of a boolean OR should be that at
least *one* of the components (queries) must be satisfied in order for the
entire expression (composite query) to be satisfied.


I conclude that either
(a) I simply don't understand the proper use of BooleanQuery, or
(b) BooleanQuery cannot be used to express a boolean OR.

If (b) is true, either the semantics of BooleanQuery need revising, or
it needs to get called something else so as not to be confusing.

If the semantics of BooleanQuery are revised, I would suggest changing the
syntax as well to reflect the binary nature of Boolean operators:

BooleanQuery.add(Query q1, Query q2, boolean and)

which would equate to (q1 AND q2) if 'and' is true, otherwise (q1 OR q2).


If there are any papers or references which would explain this better,
that would be great. Otherwise, I would really appreciate it if one of
the developers would take a few moments to clarify this issue. I'm trying
to use Lucene as a platform for research in IR; to do this I need a clear
understanding of the exact definition of the scoring system, especially as
it relates to the semantics of queries involving multiple terms.

Once I get a clear understanding of this issue, I would be happy to write
it up and submit it as an addition to the FAQ/docs.

Thanks in advance for any assistance rendered in getting this sorted out.

Regards,

Joshua

On Mon, 18 Feb 2002, Winton Davies wrote:

> BQ(Term, Term, Include, Exclude)
>
> BQ(
> BQ(a,b,true,false)
> BQ(a,b,true,false)
> false,
> false)
>
> Should work...
> Winton

> >Lets say I have two queries which I want to combine into one:
> >
> >(a and b) OR (c and d)
> >
> >I would use QueryParser.parse to form the subqueries, but how do I/can I
> >combine them with the OR logic?
> >
> >The BooleanQuery can be used to piece queries together but it does not
> >support OR's correct? (It only supports include, exclude, and
> >"preferential".)
> >
> >Thanks for any help,
> >
> >sjb



jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.



--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: Lucene Query Structure [ In reply to ]
> From: Joshua O'Madadhain [mailto:jmadden@ics.uci.edu]
>
> After considerable study of the documentation, I am still
> confused about the semantics of BooleanQuery.
>
> Now, as sjb pointed out, "(query, false, false)" doesn't
> really seem to have the semantics of a boolean OR.

In fact, it does.

> In particular:
>
> (1) It's a unary operator: add() adds a Query (or a
> BooleanClause) to a BooleanQuery. OR is a binary operator.

OR is not inherently binary.

> The semantics of a boolean OR should
> be that at
> least *one* of the components (queries) must be satisfied in
> order for the
> entire expression (composite query) to be satisfied.

That is the case here.

> I conclude that either
> (a) I simply don't understand the proper use of BooleanQuery, or
> (b) BooleanQuery cannot be used to express a boolean OR.

(a)

Good analogies for the semantics of BooleanQuery are most internet search
engines (except Google) which permit you to put '+' or '-' in front of a
word to require or prohibit it. (Google requires terms by default.) A term
with no plus or minus is not required for a match, but all of the documents
containing it are included.

This form also facilitates efficient search. One can keep track of which
query terms a document contains as a bitmask, then check the boolean
expression against this mask. This is fast if the boolean expression can
itself be expressed as a few simple bitmasks. Converting an arbitrary
boolean expression to disjunctive normal form is possible, but not trivial.
But with this form mask construction is trivial.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: Lucene Query Structure [ In reply to ]
> -----Original Message-----
> From: Doug Cutting [mailto:DCutting@grandcentral.com]
> Sent: Tuesday, February 19, 2002 5:05 PM
> To: 'Lucene Developers List'; Lucene Users List
> Subject: RE: Lucene Query Structure
>
>

>
> Good analogies for the semantics of BooleanQuery are most
> internet search
> engines (except Google) which permit you to put '+' or '-' in
> front of a
> word to require or prohibit it. (Google requires terms by
> default.) A term
> with no plus or minus is not required for a match, but all of
> the documents
> containing it are included.
>

"Keyword searching refers to a search type in which you enter terms representing the concepts you wish to retrieve. Boolean operators are not used.

Implied Boolean logic refers to a search in which symbols are used to represent Boolean logical operators. In this type of search on the Internet, the absence of a symbol is also significant, as the space between keywords defaults to either OR logic or AND logic. Many well-known search engines traditionally defaulted to OR logic, but as a rule are moving away from the practice and defaulting to AND. " http://library.albany.edu/internet/boolean.html

The queryParser of Lucene implies OR logic if no operator found in the query, doesn't it? I think the users (on our site) prefer keyword searching and implied AND logic. Unfortunatly to decide whether a query contains any operator or not, needs to read and parse the whole query (not LL(k) language). How could I modify the queryParser to implement default AND logic?

peter


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: Lucene Query Structure [ In reply to ]
> From: Halácsy Péter [mailto:halacsy.peter@axelero.com]
> Sent: Tuesday, February 19, 2002 8:49 AM
> To: Lucene Developers List; Lucene Users List
> Subject: RE: Lucene Query Structure
>
> The queryParser of Lucene implies OR logic if no operator
> found in the query, doesn't it?

Yes.

> How could I modify the queryParser to implement
> default AND logic?

I haven't tested this, but it should be as simple as changing line 318 of
QueryParser.jj to:
int ret = MOD_REQ;

Unfortunately, I think this would end up disabling OR, so the proper change
is more complex, requiring some changes to the addClause() method. This
should also be something that folks can turn on and off.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: Lucene Query Structure [ In reply to ]
> From: Joshua O'Madadhain [mailto:jmadden@ics.uci.edu]
>
> Okay, I think I finally understand how this is working. If we express
> the semantics of (required, prohibited) in terms of their
> impact on the score for a document D and query q, we get:
>
> (true, false): if q is not satisfied by D, score(D) is set to 0;
> otherwise the score is calculated as specified in the FAQ. [AND]
>
> (false, true): if q *is* satisfied by D, score(D) = 0; otherwise,
> the score is calculated as in the FAQ. [NAND]
>
> (false, false): The score is calculated as in the FAQ.
> [(implicit) OR]

That is correct.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: Lucene Query Structure [ In reply to ]
On Tue, 19 Feb 2002, Doug Cutting wrote:

[in response to my confusion on the semantics of BooleanQuery]

> OR is not inherently binary.

<blink> Well, OK, if you want to define it that way. I usually think of
OR as being k-ary for k >= 2. I guess you are defining unary OR as just
the value of its input (OR(T) = T, OR(F) = F). The reason why this seems
counterintuitive to me is that the most obvious semantics for unary AND
would be identical.

> Good analogies for the semantics of BooleanQuery are most internet
> search engines (except Google) which permit you to put '+' or '-' in
> front of a word to require or prohibit it. (Google requires terms by
> default.) A term with no plus or minus is not required for a match,
> but all of the documents containing it are included.

Okay, I think I finally understand how this is working. If we express
the semantics of (required, prohibited) in terms of their impact on the
score for a document D and query q, we get:

(true, false): if q is not satisfied by D, score(D) is set to 0;
otherwise the score is calculated as specified in the FAQ. [AND]

(false, true): if q *is* satisfied by D, score(D) = 0; otherwise,
the score is calculated as in the FAQ. [NAND]

(false, false): The score is calculated as in the FAQ. [(implicit) OR]

In other words, BooleanQuery does two things:
(a) provides a mechanism for searching on multiple independent terms
(b) allows the user to compose the document score function (for the vector
model of IR) with threshold functions whose behavior is specified by
certain types of Boolean operators.

(If I am still out in left field, corrections would be welcome.)


Doug, thank you for taking the time to clear this up for me. You've been
very helpful.

Regards,

Joshua O'Madadhain


jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.



--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>