Mailing List Archive: Sorting Options for Query Results

Sorting Options for Query Results

Nov 16, 2001, 3:01 PM

Post #1 of 10 (2408 views)

Hello. Does anyone know of a way to sort search results other than by
score? It seems like it would be very useful to be able to sort by date or
maybe even by any field that has been indexed (which I guess would include a
date). From what I can tell, Lucene does not provide any way to do this
beyond writing your own HitCollector. Is this correct? If so, has anyone
had any luck implementing alternate sorting methods? I have just started
experimenting with Lucene so any help is greatly appreciated.

Thanks,
Jeff

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Sorting Options for Query Results [ In reply to ]

DCutting at grandcentral

Nov 16, 2001, 3:47 PM

Post #2 of 10 (2379 views)

Permalink

This is not easy to do efficiently. The efficiency of the search code
depends on not constructing Document objects for every match. Thus it is
hard to efficiently perform calculations which require field values.

Things are easy if you need date order, and you have added documents in date
order. In this case you can use the document number passed to the hit
collector, since document numbers increase linearly as documents are added
to an index. Hits are in fact passed to the hit collector in
document-number order, so you don't even have to sort.

But if you need another ordering, besides by-score or by-addition-time, you
will have to do more work. The most efficient approach is to construct an
array mapping document numbers to the value you wish to sort by. Then use
this array in your hit collector. The array can be re-used by many queries,
but must be re-constructed when documents are added or removed from the
index.

Such an array can be constructed with a TermEnum() and TermDocs(), as
illustrated in some pseudo code I sent out earlier today.

Note that, in a hit collector, it is more efficient to maintain a set of the
top-N hits rather than collecting all hits, sorting, and then taking the
top-N. IndexSearcher.search() illustrates how this should be done when
sorting by score.

Probably this should be generalized into a HitsByField class:

public class HitsByField {
private String[] fieldValues;
private Searcher searcher;

public HitsByField(String field, IndexReader reader) {
... construct fieldValues array per previous message ...
searcher = new IndexSearcher(reader);
}

private class ByFieldCollector implements HitCollector {
String minValue = "";
TreeMap topHits = new TreeMap();
int maxHits;
int totalHits;
public ByFieldCollector(int maxHits) { this.maxHits = maxHits; }
public collect(int doc, float score) {
totalHits++;
String value = fieldValues[doc];
if (minValue.compareTo(value) < 0) {
topHits.put(value, new Integer(doc));
if (topHits.size() > maxHits) {
topHits.remove(topHits.firstKey());
minValue = (String)topHits.firstKey();
}
}
}
public int getHits(Document[] hits) {
... put topHits in hits, using searcher.doc() ...
return totalHits;
}
}

/** Returns the total number of hits. Stores top hits in hits. */
public int search(Query query, Document[] hits) {
ByFieldCollector collector = new ByFieldCollector(hits.length);
searcher.search(query, collector);
return collector.getHits(hits);
}
}

I leave it as an excercise for someone to fill in the blanks and post a
complete, debugged version of this. If it's useful, we can add it to
Lucene.

Doug

> -----Original Message-----
> From: Jeff Kunkle [mailto:Jeff.Kunkle@HERNDON-LAB.com]
> Sent: Friday, November 16, 2001 2:01 PM
> To: lucene-user@jakarta.apache.org
> Subject: Sorting Options for Query Results
>
>
> Hello. Does anyone know of a way to sort search results other than by
> score? It seems like it would be very useful to be able to
> sort by date or
> maybe even by any field that has been indexed (which I guess
> would include a
> date). From what I can tell, Lucene does not provide any way
> to do this
> beyond writing your own HitCollector. Is this correct? If
> so, has anyone
> had any luck implementing alternate sorting methods? I have
> just started
> experimenting with Lucene so any help is greatly appreciated.
>
> Thanks,
> Jeff
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Sorting Options for Query Results [ In reply to ]

jmadden at ics

Nov 16, 2001, 3:48 PM

Post #3 of 10 (2387 views)

Permalink

On Fri, 16 Nov 2001, Jeff Kunkle wrote:

> Hello. Does anyone know of a way to sort search results other than by
> score? It seems like it would be very useful to be able to sort by
> date or maybe even by any field that has been indexed (which I guess
> would include a date). From what I can tell, Lucene does not provide
> any way to do this beyond writing your own HitCollector. Is this
> correct? If so, has anyone had any luck implementing alternate
> sorting methods? I have just started experimenting with Lucene so any
> help is greatly appreciated.

If what you want to do is to arrange matters so that the Hits object
returned by the search is sorted in some order other than that given by
the built-in score function, I personally don't know how to do this other
than rewriting the HitCollector. If there *is* a way to do this, I'd also
be interested in knowing what it was.

However, if you're willing to re-sort the search results once you get
them, you should just be able to pull the Documents out of the Hits, copy
them into an array, and sort them according to whatever fields are in the
Document, using your favorite sorting algorithm. (Or at least I don't see
why you should not be able to do this.) This method would have the
advantage that you could allow the user the flexibility of being able to
re-sort the results ad hoc after the search returns, rather than forcing
them to choose a single ordering before the search.

Hope this helps...

Joshua

jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Sorting Options for Query Results [ In reply to ]

wdavies at overture

Nov 16, 2001, 5:00 PM

Post #4 of 10 (2398 views)

Permalink

Hi,

I was wondering why reader.document(i) is relatively slow ? Given
that it is, perhaps could anyone tell me if this would be work?
I'm currently trying to create a fast lookup, and the reader.doc(i)
is quite slow.

Question -- could I create this an array in the insertion order at
Indexing (assuming no adds/deletes?)

int i = 0;

for i from 1 to number of X's
add x to index;
xarray[i] = x;

....

Now, in HitCollector, I could simply refer to xarray[i] ?

Another question -- HitCollector -- Is there anyway for the
HitCollector to say to end the search ? I'm not quite sure how it is
used, but it seems that Hits and HitCollector don't work the same way
?

Cheers,
Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Sorting Options for Query Results [ In reply to ]

wdavies at overture

Nov 16, 2001, 7:53 PM

Post #5 of 10 (2390 views)

Permalink

Hi again :)

Hits versus HitCollector

I implemented HitCollector, but it didn't seem much faster. I'm
guessing its because Hits doesn't seem to run all the way through the
resultset before finishing whereas HitsCollector does ?

So, What I'd really like to know, is, does a Hits combined with
HitCollector make sense ?

Looking at the code, there seems to be a magical getMoreDocs, which
calls a private method that returns the topN results (without
enumerating the whole list?)

I also think I need to be able to externalise the whole Document
concept -- I don't need the data stored in the Index -- just the
DocID would suffice, then I could reference this from a big array (I
have a couple of Gigs of RAM to play with).

I'm down to about 200msecs response time for 8 million short records.
I need it below 50 (ideally around 5 :)).

Cheers,
Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Sorting Options for Query Results [ In reply to ]

Jeff.Kunkle at HERNDON-LAB

Nov 19, 2001, 7:53 AM

Post #6 of 10 (2384 views)

Permalink

This sounds like a good solution, but may not be viable in my situation. I
think I might run into problems since my index changes very often; several
times an hour. I don't think it would be very efficient to rebuild the
field mapped array after each new document is incrementally added to the
index. A solution could be to only remap them each day or each hour, but
unfortunately I need the documents available for searching as soon they are
indexed. I plan to experiment with this solution when time permits. I will
let you know how it goes or if I come up with any other ideas.

Regards,
Jeff

-----Original Message-----
From: Doug Cutting [mailto:DCutting@grandcentral.com]
Sent: Friday, November 16, 2001 5:47 PM
To: 'Lucene Users List'
Subject: RE: Sorting Options for Query Results

This is not easy to do efficiently. The efficiency of the search code
depends on not constructing Document objects for every match. Thus it is
hard to efficiently perform calculations which require field values.

Things are easy if you need date order, and you have added documents in date
order. In this case you can use the document number passed to the hit
collector, since document numbers increase linearly as documents are added
to an index. Hits are in fact passed to the hit collector in
document-number order, so you don't even have to sort.

But if you need another ordering, besides by-score or by-addition-time, you
will have to do more work. The most efficient approach is to construct an
array mapping document numbers to the value you wish to sort by. Then use
this array in your hit collector. The array can be re-used by many queries,
but must be re-constructed when documents are added or removed from the
index.

Such an array can be constructed with a TermEnum() and TermDocs(), as
illustrated in some pseudo code I sent out earlier today.

Note that, in a hit collector, it is more efficient to maintain a set of the
top-N hits rather than collecting all hits, sorting, and then taking the
top-N. IndexSearcher.search() illustrates how this should be done when
sorting by score.

Probably this should be generalized into a HitsByField class:

public class HitsByField {
private String[] fieldValues;
private Searcher searcher;

public HitsByField(String field, IndexReader reader) {
... construct fieldValues array per previous message ...
searcher = new IndexSearcher(reader);
}

private class ByFieldCollector implements HitCollector {
String minValue = "";
TreeMap topHits = new TreeMap();
int maxHits;
int totalHits;
public ByFieldCollector(int maxHits) { this.maxHits = maxHits; }
public collect(int doc, float score) {
totalHits++;
String value = fieldValues[doc];
if (minValue.compareTo(value) < 0) {
topHits.put(value, new Integer(doc));
if (topHits.size() > maxHits) {
topHits.remove(topHits.firstKey());
minValue = (String)topHits.firstKey();
}
}
}
public int getHits(Document[] hits) {
... put topHits in hits, using searcher.doc() ...
return totalHits;
}
}

/** Returns the total number of hits. Stores top hits in hits. */
public int search(Query query, Document[] hits) {
ByFieldCollector collector = new ByFieldCollector(hits.length);
searcher.search(query, collector);
return collector.getHits(hits);
}
}

I leave it as an excercise for someone to fill in the blanks and post a
complete, debugged version of this. If it's useful, we can add it to
Lucene.

Doug

> -----Original Message-----
> From: Jeff Kunkle [mailto:Jeff.Kunkle@HERNDON-LAB.com]
> Sent: Friday, November 16, 2001 2:01 PM
> To: lucene-user@jakarta.apache.org
> Subject: Sorting Options for Query Results
>
>
> Hello. Does anyone know of a way to sort search results other than by
> score? It seems like it would be very useful to be able to
> sort by date or
> maybe even by any field that has been indexed (which I guess
> would include a
> date). From what I can tell, Lucene does not provide any way
> to do this
> beyond writing your own HitCollector. Is this correct? If
> so, has anyone
> had any luck implementing alternate sorting methods? I have
> just started
> experimenting with Lucene so any help is greatly appreciated.
>
> Thanks,
> Jeff
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Sorting Options for Query Results [ In reply to ]

carlson at bookandhammer

Nov 19, 2001, 10:23 AM

Post #7 of 10 (2387 views)

Permalink

I think this still works if the the document number continue to increase
by one when documents are added incrementally.
Does anyone know if this is true (I haven't looked at the code yet).

If so, you might increase your fieldValues array when you index the new
documents?
Potentially even serializing the array so it can persist or be added to
off-line.

Does that work for you?

--Peter

On Monday, November 19, 2001, at 06:53 AM, Jeff Kunkle wrote:

> This sounds like a good solution, but may not be viable in my
> situation. I
> think I might run into problems since my index changes very often;
> several
> times an hour. I don't think it would be very efficient to rebuild the
> field mapped array after each new document is incrementally added to the
> index. A solution could be to only remap them each day or each hour,
> but
> unfortunately I need the documents available for searching as soon they
> are
> indexed. I plan to experiment with this solution when time permits. I
> will
> let you know how it goes or if I come up with any other ideas.
>
> Regards,
> Jeff
>
> -----Original Message-----
> From: Doug Cutting [mailto:DCutting@grandcentral.com]
> Sent: Friday, November 16, 2001 5:47 PM
> To: 'Lucene Users List'
> Subject: RE: Sorting Options for Query Results
>
>
> This is not easy to do efficiently. The efficiency of the search code
> depends on not constructing Document objects for every match. Thus it
> is
> hard to efficiently perform calculations which require field values.
>
> Things are easy if you need date order, and you have added documents in
> date
> order. In this case you can use the document number passed to the hit
> collector, since document numbers increase linearly as documents are
> added
> to an index. Hits are in fact passed to the hit collector in
> document-number order, so you don't even have to sort.
>
> But if you need another ordering, besides by-score or by-addition-time,
> you
> will have to do more work. The most efficient approach is to construct
> an
> array mapping document numbers to the value you wish to sort by. Then
> use
> this array in your hit collector. The array can be re-used by many
> queries,
> but must be re-constructed when documents are added or removed from the
> index.
>
> Such an array can be constructed with a TermEnum() and TermDocs(), as
> illustrated in some pseudo code I sent out earlier today.
>
> Note that, in a hit collector, it is more efficient to maintain a set
> of the
> top-N hits rather than collecting all hits, sorting, and then taking the
> top-N. IndexSearcher.search() illustrates how this should be done when
> sorting by score.
>
> Probably this should be generalized into a HitsByField class:
>
> public class HitsByField {
> private String[] fieldValues;
> private Searcher searcher;
>
> public HitsByField(String field, IndexReader reader) {
> ... construct fieldValues array per previous message ...
> searcher = new IndexSearcher(reader);
> }
>
> private class ByFieldCollector implements HitCollector {
> String minValue = "";
> TreeMap topHits = new TreeMap();
> int maxHits;
> int totalHits;
> public ByFieldCollector(int maxHits) { this.maxHits = maxHits; }
> public collect(int doc, float score) {
> totalHits++;
> String value = fieldValues[doc];
> if (minValue.compareTo(value) < 0) {
> topHits.put(value, new Integer(doc));
> if (topHits.size() > maxHits) {
> topHits.remove(topHits.firstKey());
> minValue = (String)topHits.firstKey();
> }
> }
> }
> public int getHits(Document[] hits) {
> ... put topHits in hits, using searcher.doc() ...
> return totalHits;
> }
> }
>
> /** Returns the total number of hits. Stores top hits in hits. */
> public int search(Query query, Document[] hits) {
> ByFieldCollector collector = new ByFieldCollector(hits.length);
> searcher.search(query, collector);
> return collector.getHits(hits);
> }
> }
>
> I leave it as an excercise for someone to fill in the blanks and post a
> complete, debugged version of this. If it's useful, we can add it to
> Lucene.
>
> Doug
>
>> -----Original Message-----
>> From: Jeff Kunkle [mailto:Jeff.Kunkle@HERNDON-LAB.com]
>> Sent: Friday, November 16, 2001 2:01 PM
>> To: lucene-user@jakarta.apache.org
>> Subject: Sorting Options for Query Results
>>
>>
>> Hello. Does anyone know of a way to sort search results other than by
>> score? It seems like it would be very useful to be able to
>> sort by date or
>> maybe even by any field that has been indexed (which I guess
>> would include a
>> date). From what I can tell, Lucene does not provide any way
>> to do this
>> beyond writing your own HitCollector. Is this correct? If
>> so, has anyone
>> had any luck implementing alternate sorting methods? I have
>> just started
>> experimenting with Lucene so any help is greatly appreciated.
>>
>> Thanks,
>> Jeff
>>
>> --
>> To unsubscribe, e-mail:
>> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
>> For additional commands, e-mail:
>> <mailto:lucene-user-help@jakarta.apache.org>
>>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-
> unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-
> help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Sorting Options for Query Results [ In reply to ]

DCutting at grandcentral

Nov 19, 2001, 10:24 AM

Post #8 of 10 (2392 views)

Permalink

> From: carlson@bookandhammer.com [mailto:carlson@bookandhammer.com]
>
> I think this still works if the the document number continue
> to increase
> by one when documents are added incrementally.
> Does anyone know if this is true (I haven't looked at the code yet).

Yes, that is true, so long as you do not delete documents. Deletions are
removed the next time the segment containing them is merged. You can force
removal of all deleted documents by optimizing an index.

> If so, you might increase your fieldValues array when you
> index the new
> documents?
> Potentially even serializing the array so it can persist or
> be added to
> off-line.

Those are good suggestions.

Also, I forgot to note earlier: you can drastically reduce the memory
requirements of this approach if a field value can be represented as an int
or long (e.g., a date). Java's String class uses a lot of memory, around 40
bytes for a 10-character string!

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Sorting Options for Query Results [ In reply to ]

carlson at bookandhammer

Nov 19, 2001, 11:02 AM

Post #9 of 10 (2392 views)

Permalink

What happens to the document numbers when documents are deleted and the
segment merged?
Is the document number of all the documents reset to be sequential based
on some offset for each segment?

Is there any pattern that might be followed instead of recreating the
entire array. I know this probably won't be a problem unless creating
the entire array takes a long time, but
I think I might be in the case.

Thanks

--Peter

On Monday, November 19, 2001, at 09:24 AM, Doug Cutting wrote:

> Deletions are
> removed the next time the segment containing them is merged. You can
> force
> removal of all deleted documents by optimizing an index.
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Sorting Options for Query Results [ In reply to ]

DCutting at grandcentral

Nov 19, 2001, 1:13 PM

Post #10 of 10 (2393 views)

Permalink

> From: carlson@bookandhammer.com [mailto:carlson@bookandhammer.com]
>
> What happens to the document numbers when documents are
> deleted and the
> segment merged?
> Is the document number of all the documents reset to be
> sequential based
> on some offset for each segment?
>
> Is there any pattern that might be followed instead of recreating the
> entire array. I know this probably won't be a problem unless creating
> the entire array takes a long time, but
> I think I might be in the case.

Yes, there is most definitely a pattern. Document order is preserved.
After optimization, numbering is dense. For example, if an index contains
six documents numbered 0 through 5, and documents 1, 3 and 4 are deleted,
then, after optimization, it will contain 0, 2, and 5, but now numbered 0,
1, 2.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>