Mailing List Archive

cvs commit: jakarta-lucene/src/java/org/apache/lucene/search QueryFilter.java
cutting 2002/08/05 11:05:56

Modified: . CHANGES.txt
Added: src/java/org/apache/lucene/search QueryFilter.java
Log:
Added QueryFilter class.

Revision Changes Path
1.30 +13 -2 jakarta-lucene/CHANGES.txt

Index: CHANGES.txt
===================================================================
RCS file: /home/cvs/jakarta-lucene/CHANGES.txt,v
retrieving revision 1.29
retrieving revision 1.30
diff -u -r1.29 -r1.30
--- CHANGES.txt 5 Aug 2002 17:39:03 -0000 1.29
+++ CHANGES.txt 5 Aug 2002 18:05:56 -0000 1.30
@@ -71,7 +71,18 @@
have been removed.

Finally, repeating a token with an increment of zero can also be
- used to boost scores of matches on that token.
+ used to boost scores of matches on that token. (cutting)
+
+ 14. Added new Filter class, QueryFilter. This constrains search
+ results to only match those which also match a provided query.
+ Results are cached, so that searches after the first on the same
+ index using this filter are very fast.
+
+ This could be used, for example, with a RangeQuery on a formatted
+ date field to implement date filtering. One could re-use a
+ single QueryFilter that matches, e.g., only documents modified
+ within the last week. The QueryFilter and RangeQuery would only
+ need to be reconstructed once per day. (cutting)


1.2 RC6



1.1 jakarta-lucene/src/java/org/apache/lucene/search/QueryFilter.java

Index: QueryFilter.java
===================================================================
package org.apache.lucene.search;

/* ====================================================================
* The Apache Software License, Version 1.1
*
* Copyright (c) 2001 The Apache Software Foundation. All rights
* reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in
* the documentation and/or other materials provided with the
* distribution.
*
* 3. The end-user documentation included with the redistribution,
* if any, must include the following acknowledgment:
* "This product includes software developed by the
* Apache Software Foundation (http://www.apache.org/)."
* Alternately, this acknowledgment may appear in the software itself,
* if and wherever such third-party acknowledgments normally appear.
*
* 4. The names "Apache" and "Apache Software Foundation" and
* "Apache Lucene" must not be used to endorse or promote products
* derived from this software without prior written permission. For
* written permission, please contact apache@apache.org.
*
* 5. Products derived from this software may not be called "Apache",
* "Apache Lucene", nor may "Apache" appear in their name, without
* prior written permission of the Apache Software Foundation.
*
* THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
* WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
* ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
* USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
* ====================================================================
*
* This software consists of voluntary contributions made by many
* individuals on behalf of the Apache Software Foundation. For more
* information on the Apache Software Foundation, please see
* <http://www.apache.org/>.
*/

import java.io.IOException;
import java.util.WeakHashMap;
import java.util.BitSet;
import org.apache.lucene.index.IndexReader;

/** Constrains search results to only match those which also match a provided
* query. Results are cached, so that searches after the first on the same
* index using this filter are much faster.
*
* <p> This could be used, for example, with a {@link RangeQuery} on a suitably
* formatted date field to implement date filtering. One could re-use a single
* QueryFilter that matches, e.g., only documents modified within the last
* week. The QueryFilter and RangeQuery would only need to be reconstructed
* once per day.
*/
public class QueryFilter extends Filter {
private Query query;
private transient WeakHashMap cache = new WeakHashMap();

/** Constructs a filter which only matches documents matching
* <code>query</code>.
*/
public QueryFilter(Query query) {
this.query = query;
}

public BitSet bits(IndexReader reader) throws IOException {

synchronized (cache) { // check cache
BitSet cached = (BitSet)cache.get(reader);
if (cached != null)
return cached;
}

final BitSet bits = new BitSet(reader.maxDoc());

new IndexSearcher(reader).search(query, new HitCollector() {
public final void collect(int doc, float score) {
bits.set(doc); // set bit for hit
}
});


synchronized (cache) { // update cache
cache.put(reader, bits);
}

return bits;
}
}




--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: cvs commit: jakarta-lucene/src/java/org/apache/lucene/search QueryFilter.java [ In reply to ]
I assume that this means that my suggestion (for making Hits work as a
Filter) was discarded. Was there any particular reason why? Just
curious...

Scott

> -----Original Message-----
> From: cutting@apache.org [mailto:cutting@apache.org]
> Sent: Monday, August 05, 2002 1:06 PM
> To: jakarta-lucene-cvs@apache.org
> Subject: cvs commit: jakarta-lucene/src/java/org/apache/lucene/search
> QueryFilter.java
>
>
> cutting 2002/08/05 11:05:56
>
> Modified: . CHANGES.txt
> Added: src/java/org/apache/lucene/search QueryFilter.java
> Log:
> Added QueryFilter class.
>
> Revision Changes Path
> 1.30 +13 -2 jakarta-lucene/CHANGES.txt
>
> Index: CHANGES.txt
> ===================================================================
> RCS file: /home/cvs/jakarta-lucene/CHANGES.txt,v
> retrieving revision 1.29
> retrieving revision 1.30
> diff -u -r1.29 -r1.30
> --- CHANGES.txt 5 Aug 2002 17:39:03 -0000 1.29
> +++ CHANGES.txt 5 Aug 2002 18:05:56 -0000 1.30
> @@ -71,7 +71,18 @@
> have been removed.
>
> Finally, repeating a token with an increment of zero
> can also be
> - used to boost scores of matches on that token.
> + used to boost scores of matches on that token. (cutting)
> +
> + 14. Added new Filter class, QueryFilter. This constrains search
> + results to only match those which also match a provided query.
> + Results are cached, so that searches after the first
> on the same
> + index using this filter are very fast.
> +
> + This could be used, for example, with a RangeQuery on
> a formatted
> + date field to implement date filtering. One could re-use a
> + single QueryFilter that matches, e.g., only documents modified
> + within the last week. The QueryFilter and RangeQuery
> would only
> + need to be reconstructed once per day. (cutting)
>
>
> 1.2 RC6
>
>
>
> 1.1
> jakarta-lucene/src/java/org/apache/lucene/search/QueryFilter.java
>
> Index: QueryFilter.java
> ===================================================================
> package org.apache.lucene.search;
>
> /*
> ====================================================================
> * The Apache Software License, Version 1.1
> *
> * Copyright (c) 2001 The Apache Software Foundation. All rights
> * reserved.
> *
> * Redistribution and use in source and binary forms, with
> or without
> * modification, are permitted provided that the following
> conditions
> * are met:
> *
> * 1. Redistributions of source code must retain the above copyright
> * notice, this list of conditions and the following disclaimer.
> *
> * 2. Redistributions in binary form must reproduce the
> above copyright
> * notice, this list of conditions and the following
> disclaimer in
> * the documentation and/or other materials provided with the
> * distribution.
> *
> * 3. The end-user documentation included with the redistribution,
> * if any, must include the following acknowledgment:
> * "This product includes software developed by the
> * Apache Software Foundation (http://www.apache.org/)."
> * Alternately, this acknowledgment may appear in the
> software itself,
> * if and wherever such third-party acknowledgments
> normally appear.
> *
> * 4. The names "Apache" and "Apache Software Foundation" and
> * "Apache Lucene" must not be used to endorse or
> promote products
> * derived from this software without prior written
> permission. For
> * written permission, please contact apache@apache.org.
> *
> * 5. Products derived from this software may not be called
> "Apache",
> * "Apache Lucene", nor may "Apache" appear in their
> name, without
> * prior written permission of the Apache Software Foundation.
> *
> * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
> * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
> * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
> * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
> * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
> * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
> CAUSED AND
> * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> LIABILITY,
> * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
> ANY WAY OUT
> * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
> POSSIBILITY OF
> * SUCH DAMAGE.
> *
> ====================================================================
> *
> * This software consists of voluntary contributions made by many
> * individuals on behalf of the Apache Software Foundation.
> For more
> * information on the Apache Software Foundation, please see
> * <http://www.apache.org/>.
> */
>
> import java.io.IOException;
> import java.util.WeakHashMap;
> import java.util.BitSet;
> import org.apache.lucene.index.IndexReader;
>
> /** Constrains search results to only match those which
> also match a provided
> * query. Results are cached, so that searches after the
> first on the same
> * index using this filter are much faster.
> *
> * <p> This could be used, for example, with a {@link
> RangeQuery} on a suitably
> * formatted date field to implement date filtering. One
> could re-use a single
> * QueryFilter that matches, e.g., only documents modified
> within the last
> * week. The QueryFilter and RangeQuery would only need to
> be reconstructed
> * once per day.
> */
> public class QueryFilter extends Filter {
> private Query query;
> private transient WeakHashMap cache = new WeakHashMap();
>
> /** Constructs a filter which only matches documents matching
> * <code>query</code>.
> */
> public QueryFilter(Query query) {
> this.query = query;
> }
>
> public BitSet bits(IndexReader reader) throws IOException {
>
> synchronized (cache) { // check cache
> BitSet cached = (BitSet)cache.get(reader);
> if (cached != null)
> return cached;
> }
>
> final BitSet bits = new BitSet(reader.maxDoc());
>
> new IndexSearcher(reader).search(query, new HitCollector() {
> public final void collect(int doc, float score) {
> bits.set(doc); // set bit for hit
> }
> });
>
>
> synchronized (cache) { // update cache
> cache.put(reader, bits);
> }
>
> return bits;
> }
> }
>
>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>
Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/search QueryFilter.java [ In reply to ]
Scott Ganyo wrote:
> I assume that this means that my suggestion (for making Hits work as a
> Filter) was discarded. Was there any particular reason why? Just
> curious...

Sorry I never got back to you.

This filter keeps a BitSet of all the matching documents. Hits does not
do this, and it would cause it to use a lot more memory to make it do
so. All of the queries which would never be used as a filter would thus
pay this penalty.

Perhaps we should add a getQuery() method to Hits so that folks can
extract the query and use it to construct a filter or another query.
Would that meet your needs?

Note however that filters do not affect scoring. Using a QueryFilter is
not the same as requiring that same query in a BooleanQuery: the ranking
may be different.

Doug


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: cvs commit: jakarta-lucene/src/java/org/apache/lucene/search QueryFilter.java [ In reply to ]
> This filter keeps a BitSet of all the matching documents.
> Hits does not
> do this, and it would cause it to use a lot more memory to make it do
> so. All of the queries which would never be used as a filter
> would thus pay this penalty.

My thought was that the Filter.bits() method on Hits would only resolve the
BitSet if it was asked for (and probably wouldn't even cache it), so in the
common case Hits wouldn't suffer any ill effect. Would that work? (I feel
like I'm missing something obvious...)

Thanks,
Scott

> -----Original Message-----
> From: Doug Cutting [mailto:cutting@lucene.com]
> Sent: Monday, August 05, 2002 1:23 PM
> To: Lucene Developers List
> Subject: Re: cvs commit:
> jakarta-lucene/src/java/org/apache/lucene/search QueryFilter.java
>
>
> Scott Ganyo wrote:
> > I assume that this means that my suggestion (for making
> Hits work as a
> > Filter) was discarded. Was there any particular reason why? Just
> > curious...
>
> Sorry I never got back to you.
>
> This filter keeps a BitSet of all the matching documents.
> Hits does not
> do this, and it would cause it to use a lot more memory to make it do
> so. All of the queries which would never be used as a filter
> would thus
> pay this penalty.
>
> Perhaps we should add a getQuery() method to Hits so that folks can
> extract the query and use it to construct a filter or another query.
> Would that meet your needs?
>
> Note however that filters do not affect scoring. Using a
> QueryFilter is
> not the same as requiring that same query in a BooleanQuery:
> the ranking
> may be different.
>
> Doug
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>
Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/search QueryFilter.java [ In reply to ]
Scott Ganyo wrote:
> My thought was that the Filter.bits() method on Hits would only resolve the
> BitSet if it was asked for (and probably wouldn't even cache it), so in the
> common case Hits wouldn't suffer any ill effect. Would that work? (I feel
> like I'm missing something obvious...)

One could do this, but I'm not sure what the advantage would be.

In your original message on this topic, you wrote:

Scott Ganyo wrote:
> But instead of adding a new class, why not change Hits to
> inherit from Filter and add the bits() method to it?
> Then one could "pipe" the output of one Query into another
> search without modifying the Queries...

If that's the goal, then a bits() method is not a great way to do this,
as it ignores the ranking in the first search when ranking the second.
Since that is a material difference, I prefer to make it explicit.

Filters are not designed for searching within an arbitrary result set.
For that you really should take the ranking for the first query into
account: a new query should be formed by adding clauses to the original
query. Filters are instead designed to search subsets of an index
defined by boolean criteria, criteria that do not affect ranking, like
date, language, postal code, document type, etc. They are particularly
useful when the same criterion is used repeatedly, and the bit vector
can be cached, as the construction and storage of a new bit-vector per
query is expensive. Thus the canonical uses of a filter should be to
implement things like "modified in last week", or "written in english"
or "in Word format", not a general-purpose "search within results".

Does that make sense?

Doug


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>