Mailing List Archive: PhrasePrefixQuery.java and MultipleTermPositions.java

PhrasePrefixQuery.java and MultipleTermPositions.java

May 15, 2002, 5:22 AM

Post #1 of 6 (633 views)

Hello all,

Attached are two classes that handles Phrase queries of the form "microsoft
app*" where app* is supposed to match all words starting with "app".

It's also possible to handle queries where the prefix-term(s) are in the
middle of the phrase like so: "Microsoft* app*"

---

PhrasePrefixQuery

PhrasePrefixQuery.java is a generalized version of PhraseQuery.java, with an
added method add(Term[]).

To use this class, to search for the phrase "Microsoft app*" first use
add(Term) on the term "Microsoft", then find all terms that has "app" as
prefix using IndexReader.terms(Term), and use PhrasePrefixQuery.add(Term[]
terms) to add them to the query.

Known Issues: the method toString() assumes that the first term in a array
of terms is the prefix for the whole array. That might not necessarily be
so.

MultipleTermPositions

MultipleTermPositions.java is a class that implements the TermPositions
interface, and behaves like a single TermPosition would do iterating through
<doc, freq, <pos1, pos2, .. , posn>> tuples, but it handles multiple
TermPositions at once by keeping them in a queue.

Using this class, it was easy to write PhrasePrefixQuery reusing the
ExactPhraseScorer and SloppyPhraseScorer directly.

Known Issues: Doesn't fully implement the TermDocs interface, leaving
read(int[], int[]) and seek(Term) unsupported.

Other Comments

It would possibly be a good idea for IndexReader to have a method
IndexReader.termPositions(Term[]) which could return a
MultipleTermPositions-object. But for now the constructor takes an
IndexReader and an array of Terms.

Also, to fully integrate this into Lucene, code would have to be added to
QueryParser that handles queries of this type.

regards,
Anders Nielsen

RE: PhrasePrefixQuery.java and MultipleTermPositions.java [ In reply to ]

cutting at lucene

May 15, 2002, 10:49 AM

Post #2 of 6 (608 views)

Permalink

This looks great!

The MultipleTermPositions implementation could be made more efficient if it:
1. avoided constructing an Integer for each position
2. avoided using a SortedSet to collect positions

These could be achieved by changing your PriorityQueue to be sorted by not
just document, but also by position within document, as in PhraseQueue.

Another optimization would be to use PriorityQueue adjustTop() instead of
pop() and put() -- it's twice as fast.

After these changes, a typical call to MultipleTermPositions.next() would
consist of a call to PriorityQueue.top(), a call to TermPositions.next(),
and a call to PriorityQueue().adjustTop().

I hope this makes sense.

Doug

> -----Original Message-----
> From: Anders Nielsen
> [mailto:anders.at.visator.com@cutting.at.lucene.com]
> Sent: Wednesday, May 15, 2002 5:23 AM
> To: dcutting@grandcentral.com
> Subject: PhrasePrefixQuery.java and MultipleTermPositions.java
>
>
> Hello all,
>
> Attached are two classes that handles Phrase queries of the
> form "microsoft
> app*" where app* is supposed to match all words starting with "app".
>
> It's also possible to handle queries where the prefix-term(s)
> are in the
> middle of the phrase like so: "Microsoft* app*"
>
> ---
>
> PhrasePrefixQuery
>
> PhrasePrefixQuery.java is a generalized version of
> PhraseQuery.java, with an
> added method add(Term[]).
>
> To use this class, to search for the phrase "Microsoft app*" first use
> add(Term) on the term "Microsoft", then find all terms that
> has "app" as
> prefix using IndexReader.terms(Term), and use
> PhrasePrefixQuery.add(Term[]
> terms) to add them to the query.
>
> Known Issues: the method toString() assumes that the first
> term in a array
> of terms is the prefix for the whole array. That might not
> necessarily be
> so.
>
>
> MultipleTermPositions
>
> MultipleTermPositions.java is a class that implements the
> TermPositions
> interface, and behaves like a single TermPosition would do
> iterating through
> <doc, freq, <pos1, pos2, .. , posn>> tuples, but it handles multiple
> TermPositions at once by keeping them in a queue.
>
> Using this class, it was easy to write PhrasePrefixQuery reusing the
> ExactPhraseScorer and SloppyPhraseScorer directly.
>
> Known Issues: Doesn't fully implement the TermDocs interface, leaving
> read(int[], int[]) and seek(Term) unsupported.
>
>
> Other Comments
>
> It would possibly be a good idea for IndexReader to have a method
> IndexReader.termPositions(Term[]) which could return a
> MultipleTermPositions-object. But for now the constructor takes an
> IndexReader and an array of Terms.
>
> Also, to fully integrate this into Lucene, code would have to
> be added to
> QueryParser that handles queries of this type.
>
>
> regards,
> Anders Nielsen
>
>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

RE: PhrasePrefixQuery.java and MultipleTermPositions.java [ In reply to ]

anders at visator

May 15, 2002, 11:59 AM

Post #3 of 6 (613 views)

Permalink

Regarding sorting by position: I made the MultipleTermPositions.next() pull
out all the positions from each TermPositions because there doesn't seem to
be a way to peek at the current position in a TermPositions object without
moving to the next position (it's like Heisenberg's uncertainty principle:
Observing the object, changes it's state).

Alternatively to putting all the positions into a SortedSet I could either
make a TermPositions-wrapper (ala PhrasePosition) that keeps the next
position as a public variable and make the queue sort on both doc and
position, or the sorted list of positions could be made with an array of
ints instead of a SortedSet.

I think the latter is probably the most efficient, but I can't really tell
for sure unless I implement both and test. Any suggestions?

Regarding using adjustTop() instead of pop()-push(): sounds good, I'll make
that chance asap.

regards,
Anders Nielsen

-----Original Message-----
From: cutting@lucene.com
To: lucene-dev@jakarta.apache.org
Sent: 15-05-2002 19:49
Subject: RE: PhrasePrefixQuery.java and MultipleTermPositions.java

This looks great!

The MultipleTermPositions implementation could be made more efficient if
it:
1. avoided constructing an Integer for each position
2. avoided using a SortedSet to collect positions

These could be achieved by changing your PriorityQueue to be sorted by
not
just document, but also by position within document, as in PhraseQueue.

Another optimization would be to use PriorityQueue adjustTop() instead
of
pop() and put() -- it's twice as fast.

After these changes, a typical call to MultipleTermPositions.next()
would
consist of a call to PriorityQueue.top(), a call to
TermPositions.next(),
and a call to PriorityQueue().adjustTop().

I hope this makes sense.

Doug

> -----Original Message-----
> From: Anders Nielsen
> [mailto:anders.at.visator.com@cutting.at.lucene.com]
> Sent: Wednesday, May 15, 2002 5:23 AM
> To: dcutting@grandcentral.com
> Subject: PhrasePrefixQuery.java and MultipleTermPositions.java
>
>
> Hello all,
>
> Attached are two classes that handles Phrase queries of the
> form "microsoft
> app*" where app* is supposed to match all words starting with "app".
>
> It's also possible to handle queries where the prefix-term(s)
> are in the
> middle of the phrase like so: "Microsoft* app*"
>
> ---
>
> PhrasePrefixQuery
>
> PhrasePrefixQuery.java is a generalized version of
> PhraseQuery.java, with an
> added method add(Term[]).
>
> To use this class, to search for the phrase "Microsoft app*" first use
> add(Term) on the term "Microsoft", then find all terms that
> has "app" as
> prefix using IndexReader.terms(Term), and use
> PhrasePrefixQuery.add(Term[]
> terms) to add them to the query.
>
> Known Issues: the method toString() assumes that the first
> term in a array
> of terms is the prefix for the whole array. That might not
> necessarily be
> so.
>
>
> MultipleTermPositions
>
> MultipleTermPositions.java is a class that implements the
> TermPositions
> interface, and behaves like a single TermPosition would do
> iterating through
> <doc, freq, <pos1, pos2, .. , posn>> tuples, but it handles multiple
> TermPositions at once by keeping them in a queue.
>
> Using this class, it was easy to write PhrasePrefixQuery reusing the
> ExactPhraseScorer and SloppyPhraseScorer directly.
>
> Known Issues: Doesn't fully implement the TermDocs interface, leaving
> read(int[], int[]) and seek(Term) unsupported.
>
>
> Other Comments
>
> It would possibly be a good idea for IndexReader to have a method
> IndexReader.termPositions(Term[]) which could return a
> MultipleTermPositions-object. But for now the constructor takes an
> IndexReader and an array of Terms.
>
> Also, to fully integrate this into Lucene, code would have to
> be added to
> QueryParser that handles queries of this type.
>
>
> regards,
> Anders Nielsen
>
>

--
To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

RE: PhrasePrefixQuery.java and MultipleTermPositions.java [ In reply to ]

cutting at lucene

May 15, 2002, 1:19 PM

Post #4 of 6 (613 views)

Permalink

> From: Anders Nielsen
>
> Alternatively to putting all the positions into a SortedSet I
> could either
> make a TermPositions-wrapper (ala PhrasePosition) that keeps the next
> position as a public variable and make the queue sort on both doc and
> position, or the sorted list of positions could be made with
> an array of
> ints instead of a SortedSet.
>
> I think the latter is probably the most efficient, but I
> can't really tell
> for sure unless I implement both and test. Any suggestions?

To make the latter efficient you would need to be sure to reuse the same
array, so you'd need a TermPositions wrapper anyway, to keep the associated
array. So I would vote for the former, using a TermPositions wrapper that
holds the current document and position.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

RE: PhrasePrefixQuery.java and MultipleTermPositions.java [ In reply to ]

anders at visator

May 15, 2002, 2:54 PM

Post #5 of 6 (610 views)

Permalink

Here's an updated version of MultipleTermPositions.java.

In this version next() uses adjustTop() instead of pop()-push() as per
Doug's suggestion.

And instead of a SortedSet for keeping positions, it keeps them in a class
that reuses an int[].

regards,
Anders Nielsen

RE: PhrasePrefixQuery.java and MultipleTermPositions.java [ In reply to ]

otis_gospodnetic at yahoo

Jul 17, 2002, 9:44 PM

Post #6 of 6 (617 views)

Permalink

Is anyone using this (+ PhrasePrefixQuery)?
I am trying to test them, but my unit test is failing, and I'm not sure
if I'm using this wrong or if the submitted code doesn't work as
described.

The test looks like this:

public void testPhrasePrefix()
throws IOException
{
RAMDirectory indexStore = new RAMDirectory();
IndexWriter writer = new IndexWriter(indexStore, new
SimpleAnalyzer(), true);
Document doc1 = new Document();
Document doc2 = new Document();
Document doc3 = new Document();
doc1.add(Field.Text("body", "blueberry pie"));
doc2.add(Field.Text("body", "blueberry pancake"));
doc3.add(Field.Text("body", "blueberry muffin"));
writer.addDocument(doc1);
writer.addDocument(doc2);
writer.addDocument(doc3);
writer.optimize();
writer.close();

IndexSearcher searcher = new IndexSearcher(indexStore);

PhrasePrefixQuery query = new PhrasePrefixQuery();
query.add(new Term("body", "blueberry"));

IndexReader ir = IndexReader.open(indexStore);
TermEnum te = ir.terms(new Term("body", "pi*"));
do {
query.add(te.term());
} while (te.next());

Hits result = searcher.search(query);
assertEquals(2, result.length());
}
}

Result:

Testcase: testPhrasePrefix took 1.047 sec
FAILED
expected:<2> but was:<0>

Thanks,
Otis

--- Anders Nielsen <anders@visator.com> wrote:
>
> Here's an updated version of MultipleTermPositions.java.
>
> In this version next() uses adjustTop() instead of pop()-push() as
> per
> Doug's suggestion.
>
> And instead of a SortedSet for keeping positions, it keeps them in a
> class
> that reuses an int[].
>
>
> regards,
> Anders Nielsen
>
>
>

> ATTACHMENT part 2 application/octet-stream
name=MultipleTermPositions.java
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>

__________________________________________________
Do You Yahoo!?
Yahoo! Autos - Get free new car price quotes
http://autos.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Mailing List Archive

Attached Files:

Attached Files: