Mailing List Archive

Boolean searching across multiple fields
A question.

OK, so I have these fields:

title
introtext

Both are keyword-searched against. If I search "+foo +bar" though, then
how can I get results that will find foo in one, and bar in another?

Like, I can do something like this:

(title:foo AND (title:bar OR introtext:bar)) OR (title:bar AND (title:foo
OR introtext:foo))

But that gets messy very quickly, and I have a whole bunch of fields, not
just the two. I can combine the fields, but if we want to give one field a
boost later, that becomes a problem.

Right now my code looks like:

for my $field (@fields) {
my $query_parser = KinoSearch::QueryParser::QueryParser->new(
analyzer => $analyzer,
default_field => $field,
);

$query->add_clause(query =>
$query_parser->parse($querystring));
}

I know in the next version I can do, simply:

my $query_parser = KinoSearch::QueryParser::QueryParser->new(
analyzer => $analyzer,
fields => \@fields,
);

Will that search across all the fields as I want? Or is there another way?

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
Boolean searching across multiple fields [ In reply to ]
On Oct 11, 2006, at 3:04 PM, Chris Nandor wrote:

> OK, so I have these fields:
>
> title
> introtext
>
> Both are keyword-searched against. If I search "+foo +bar" though,
> then
> how can I get results that will find foo in one, and bar in another?

The QueryParser in KinoSearch 0.13 has this behavior by default.

Note the top hit in this search for '+3 +senator':

http://www.rectangular.com/cgi-bin/uscon_search.cgi?q=%2B3+%2Bsenator

The title contains '3', but the body does not. The body contains
'senator', but the title does not.

> I know in the next version I can do, simply:
>
> my $query_parser = KinoSearch::QueryParser::QueryParser->new(
> analyzer => $analyzer,
> fields => \@fields,
> );

What version are you using? If you're using 0.12, you can get 0.13
either from the KS homepage or off the search.cpan.org website. You
can also do this from the CPAN shell:

install CREAMYG/KinoSearch-0.13.tar.gz

Ordinarily, you could just 'install KinoSearch', but the CPAN indexer
was hosed when I uploaded that file and it's never recovered.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Boolean searching across multiple fields [ In reply to ]
At 15:29 -0700 2006.10.11, Marvin Humphrey wrote:
>The QueryParser in KinoSearch 0.13 has this behavior by default.

Oh, I had 0.12 initially. I have had 0.12 the last few weeks, and forgot
that it was added in *this* version.


>Note the top hit in this search for '+3 +senator':
>
>http://www.rectangular.com/cgi-bin/uscon_search.cgi?q=%2B3+%2Bsenator
>
>The title contains '3', but the body does not. The body contains
>'senator', but the title does not.

Sweet.


>> I know in the next version I can do, simply:
>>
>> my $query_parser = KinoSearch::QueryParser::QueryParser->new(
>> analyzer => $analyzer,
>> fields => \@fields,
>> );

So this code will allow the above behavior, then?

Curiously, how would I do it in 0.12? Knowing that may help me understand
the whole thing better.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
Boolean searching across multiple fields [ In reply to ]
On Oct 11, 2006, at 4:05 PM, Chris Nandor wrote:

>>> I know in the next version I can do, simply:
>>>
>>> my $query_parser = KinoSearch::QueryParser::QueryParser->new(
>>> analyzer => $analyzer,
>>> fields => \@fields,
>>> );
>
> So this code will allow the above behavior, then?

Yes. QueryParser behaves like this because it's the most intuitive
behavior for the common case.

Most often, people want to search multiple fields -- say, title and
body. A required term such as "+senator" must match against AT LEAST
ONE field out of several. A prohibited term such as "-senator" MUST
NOT MATCH AGAINST ANY of them. It's as if all the fields were
flattened into one and QueryParser was generating a query against
that. However, the scoring algorithm still gets to use multiple
fields, which is important for returning the most relevant document set.

The guts that make that happen are kind of complicated (thank dog for
tests!) but the concept is straightforward:
QueryParser processes the input string one chunk at a time.

Consider the following input:

'+foo -bar "okee dokee"'

First chunk is '+foo'. It gets expanded to...

'+(title:foo OR body:foo)'

Next, '-bar' expands to...

'-(title:bar OR body:bar)'

Lastly, the phrase '"okee dokee"' gets treated as a single chunk,
expanding to...

'(title:"okee dokee" OR body:"okee dokee")'

(Note that the internal mechanism isn't literal text expansion --
QueryParser is using Query objects.)

> Curiously, how would I do it in 0.12? Knowing that may help me
> understand
> the whole thing better.

That particular configuation is actually kind of hard to nail with
0.12. The "negate operator bug" that was fixed in 0.13 actually
affected queries in which all clauses are required too, of which your
'+foo +bar' is the perfect reduced example.

QueryParser's clever trick is to handle the string chunk by chunk.
There's no public API for squeezing chunks out of QueryParser one-at-
a-time, though, so you can't duplicate the multi-field functionality
easily.

As a workaround, you can dump all content into one big field.

$doc->set_value( title => $title );
$doc->set_value( body => $body );
$doc->set_value( all_content => "$title $body" );

Then, you create a QueryParser against the all_content field, and
your search for '+foo +bar' returns the correct set of documents.

my $query_parser = KinoSearch::QueryParser::QueryParser->new(
default_field => 'all_content',
);
my $query = $query_parser->parse('+foo +bar');

Essentially, you are flattening the fields yourself, rather than
letting the QueryParser from KinoSearch 0.13 do it for you.

This option gets recommended all the time on the Lucene user's list,
and it's OK for small document sets. However, the relevancy from
that searcg will be inferior to a search performed against multiple
fields, because the title text gets dumped into all_content rather
than staying separate -- where, as a short field, it will
automatically be weighted more heavily. With large document sets,
relevancy becomes a major concern, and I recommend against this
technique.

Another option is to rewrite your requirements. :) Make sure that
'foo' and 'bar' come to you already split up -- say from different
HTML form fields -- so you don't need to rely on QueryParser to break
up the string and determine what's required/prohibited. Then, you
can build up your own compound BooleanQuery piece by piece.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/