Mailing List Archive

Question on the FAQ list with filters
From the FAQ:

***
16. What is filtering and how is it performed ?

Filtering means imposing additional restriction on the hit list to eliminate
hits that otherwise would be included in the search results. There are two
ways to filter hits:

* Search Query - in this approach, provide your custom filter object to the
when you call the search() method. This filter will be called exactly once
to evaluate every document that resulted in non zero score.

* Selective Collection - in this approach you perform the regular search and
when you get back the hit list, collect only those that matches your
filtering criteria. In this approach, your filter is called only for hits
that returned by the search method which may be only a subset of the non
zero matches (useful when evaluating your search filter is expensive).

***

I don't see why the second way is useful. Yes, your filter is called only
for hits that got returned by the search method, but aren't those the same
hits that the search() method would run through the filter? Maybe I'm just
not reading it close enough.

Is my assumption that it is faster to provide a filter to the search()
method, than to do a selective collation correct?





--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Question on the FAQ list with filters [ In reply to ]
On Wed, Mar 27, 2002 at 03:52:21PM -0600, Armbrust, Daniel C. wrote:
> From the FAQ:
> 16. What is filtering and how is it performed ?
> * Search Query - in this approach, provide your custom filter object to the
> when you call the search() method. This filter will be called exactly once
> to evaluate every document that resulted in non zero score.
> * Selective Collection - in this approach you perform the regular search and
> when you get back the hit list, collect only those that matches your
> filtering criteria. In this approach, your filter is called only for hits
> that returned by the search method which may be only a subset of the non
> zero matches (useful when evaluating your search filter is expensive).
>
> ***
>
> I don't see why the second way is useful. Yes, your filter is called only
> for hits that got returned by the search method, but aren't those the same
> hits that the search() method would run through the filter? Maybe I'm just
> not reading it close enough.
>
> Is my assumption that it is faster to provide a filter to the search()
> method, than to do a selective collation correct?

"It Depends." That's more or less the point of the FAQ answer,
though it could be more clearly expressed. The gist of the FAQ seems
to be that you can either do the filtering BEFORE you do the search,
or AFTER you do the search.

Obviously the question is, which is more expensive, filtering out
inappropriate documents, or searching for the possible hits? If
filtering is cheaper, you do the filtering first, then do the search.
If filtering is expensive, you do the search first, then do the
filtering. You should also factor in which is more restrictive - will
either the filter or the search drop out a large number of the
documents? If you can arrange it so one is both cheaper and drops out
the majority of the documents, you win.

In either case, you implement some sort of object which you can
hand a org.apache.lucene.TermDocs and get back a yes or no as to
whether it's a valid possible search result.

From looking at the source for:

org.apache.lucene.search.Filter,
org.apache.lucene.search.DateFilter, and
org.apache.lucene.search.IndexSearcher,

...it appears that you instantiate your Filter subclass, then for
filtering BEFORE the search, you pass YourFilter an IndexReader and
get back a BitSet. Or more to the point, when you invoke
IndexSearcher.search(), you pass it YourFilter, and a HitsCollector,
and IndexSearcher.search() gets the BitSet from YourFilter.

A BitSet, from the JDK API, is a vector of bit values (i.e. 1 or
0, corresponding to the java boolean values true and false).

It appears, from looking at the source, that each Bit in the
BitSet corresponds to an SearchIndex TermDoc at the same sequential
location in the SearchIndex. IndexSearcher.search() has an inner
class (this is a bit ambiguous and it's been a year since I've lookd
at inner classes, so I'm going to just handwave and move along :-)
with a collect() method that loops through the termDocs, skipping the
ones for which BitSet.get() returns false.

I'm not sure exactly how you would use an
org.apache.lucene.search.Filter to do the filtering AFTER, but
presumably that would involve just handing it the TermDocs in
question, or maybe IndexReader and Hits both implement a common
interface... uhm, no, that's not it. Well, I guess you use your own
class for the filter. That's what I ended up doing anyway, in my
ignorance of the Filter abstract class. I ended up doing my filtering
AFTER, btw, because it involved some expensive lookups in other
documents.

There's actually a third option, figure out a way to implement
your filter as an additional boolean phrase on your search. However,
that may or may not be feasible, or the Lucene Filter mechanism may
not have been intended to address such cases.

To be honest, the design of the Filter seems less
well-thought-out than the rest of Lucene, like it's an afterthought.
I really oughta join the developers list, I guess, so I can put my
money where my mouth is, and submit changes to clarify the docs, etc,
when I go roaming through the source.

Steven J. Owens
puff@darksleep.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Question on the FAQ list with filters [ In reply to ]
The API provides the Filter mechanism for filtering out hits before they are
searched.

The alternative is to write your own classes to filter out documents
returned after searching.
If its not efficient to check on every single document, or the results are
not obtained in batch, then this method is probably better.

I currently run a query through my database to return a list of documents
which a particular user is allowed to access. The Filter method, thus makes
a good deal of sense for me, since I'm able to obtain the results in batch.

From my interpretation of the FAQ, it seems that you're expected to write
your own code to perform post-search filtering and not use/subclass the
Filter class. Of course, this could be made slightly clearer...

HTH

Regards,
Kelvin
----- Original Message -----
From: "Armbrust, Daniel C." <Armbrust.Daniel@mayo.edu>
To: "'Lucene Users List'" <lucene-user@jakarta.apache.org>
Sent: Thursday, March 28, 2002 5:52 AM
Subject: Question on the FAQ list with filters


> From the FAQ:
>
> ***
> 16. What is filtering and how is it performed ?
>
> Filtering means imposing additional restriction on the hit list to
eliminate
> hits that otherwise would be included in the search results. There are two
> ways to filter hits:
>
> * Search Query - in this approach, provide your custom filter object to
the
> when you call the search() method. This filter will be called exactly once
> to evaluate every document that resulted in non zero score.
>
> * Selective Collection - in this approach you perform the regular search
and
> when you get back the hit list, collect only those that matches your
> filtering criteria. In this approach, your filter is called only for hits
> that returned by the search method which may be only a subset of the non
> zero matches (useful when evaluating your search filter is expensive).
>
> ***
>
> I don't see why the second way is useful. Yes, your filter is called only
> for hits that got returned by the search method, but aren't those the same
> hits that the search() method would run through the filter? Maybe I'm
just
> not reading it close enough.
>
> Is my assumption that it is faster to provide a filter to the search()
> method, than to do a selective collation correct?
>
>
>
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Question on the FAQ list with filters [ In reply to ]
On Wed, Mar 27, 2002 at 09:26:29PM -0500, I wrote:
> On Wed, Mar 27, 2002 at 03:52:21PM -0600, Armbrust, Daniel C. wrote:
> > From the FAQ:
> > 16. What is filtering and how is it performed ?
> [...]
> > Is my assumption that it is faster to provide a filter to the search()
> > method, than to do a selective collation correct?
>
> "It Depends." That's more or less the point of the FAQ answer,
> though it could be more clearly expressed. The gist of the FAQ seems
> to be that you can either do the filtering BEFORE you do the search,
> or AFTER you do the search.
>
> Obviously the question is, which is more expensive, filtering out
> inappropriate documents, or searching for the possible hits? If
> filtering is cheaper, you do the filtering first, then do the search.
> If filtering is expensive, you do the search first, then do the
> filtering. You should also factor in which is more restrictive - will
> either the filter or the search drop out a large number of the
> documents? If you can arrange it so one is both cheaper and drops out
> the majority of the documents, you win.

I meant to add, here, that many applications that do searching
and filtering will display the hits only a chunk at a time (typical
web search interface). This is another situation where it would make
a lot more sense to filter after the search, since you'd only have to
filter a relatively small portion of the hits for each page of results
the user asks for. On top of that, the user may in fact get what they
were looking for in the first page or two of results.

Steven J. Owens
puff@darksleep.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Question on the FAQ list with filters [ In reply to ]
<paste> I meant to add, here, that many applications that do searching
and filtering will display the hits only a chunk at a time (typical
web search interface). This is another situation where it would make
a lot more sense to filter after the search, since you'd only have to
filter a relatively small portion of the hits for each page of results
the user asks for.
</paste>


How nice it is to have a list like this where there are thoughtful replies
given. Thanks all!

I don't know why I didn't think of this case last night. The various ways
make a lot more sense now.

Dan

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>