Mailing List Archive

Sorting and seeking interaction
I'm trying to do a paged result set which is sorted by date. For
paging I'm using the ->seek method (or the offset and num_wanted
args to search, which comes to the same thing) and for sorting
I'm using a sort spec.

Unfortunately what it looks like is that the seek is happening
before the sort - which I guess makes sense from an implementation
point of view, but isn't expected behaviour from a user's point of
view.

To show how unexpected, here is some code which retrieves a given
number of results, then (hopefully) displays the earliest hit in
the dataset:

use Glob;
my $sort = KinoSearch::Search::SortSpec->new();
$sort->add( field => "date");

my $hits = Glob->searcher->search(
query => "tag:theology",
offset => 0,
num_wanted => shift(@ARGV),
sort_spec => $sort
);

my $hit = $hits->fetch_hit_hashref;
print $hit->{id}, " ", $hit->{date}, "\n";

Here's the earliest result when we're retrieving 10:
% perl test.pl 10
523 2005-10-05

And the earliest when we're retrieving 30:
% perl test.pl 30
836 2004-11-29

Now it is of course the earliest of the 10 hits, and the
earliest of the 30 hits, and of course these are different,
but what I really wanted was the earliest of the 10 earliest
hits and earliest of the 30 earliest hits and these ought to
be the same!

I don't know if this is a bug. It's a bug if setting a sort_spec
is expected to sort the document collection, but if it is
expected merely to sort the result set, then it's just a major
annoyance; either way it looks like I have to retrieve all the
hits and them sort them myself.

Is there a better way?

Simon
Sorting and seeking interaction [ In reply to ]
On Apr 14, 2007, at 9:26 AM, Simon Cozens wrote:

> use Glob;

Haven't seen 'Glob' before, and it doesn't look like this is it:

http://search.cpan.org/~holt/makepp-1.19/Glob.pm

Can you point me at the right module docs?

> Here's the earliest result when we're retrieving 10:
> % perl test.pl 10
> 523 2005-10-05
>
> And the earliest when we're retrieving 30:
> % perl test.pl 30
> 836 2004-11-29
>
> Now it is of course the earliest of the 10 hits, and the
> earliest of the 30 hits, and of course these are different,
> but what I really wanted was the earliest of the 10 earliest
> hits and earliest of the 30 earliest hits and these ought to
> be the same!
>
> I don't know if this is a bug.

Looks like a bug. Thank you for the report.

If your index has been incrementally updated and consists of more
than one segment, you may wish to try the current subversion trunk,
version 2306. Recent fixes to MultiTermList may affect the results
of sorted search.

svn co -r 2306 http://www.rectangular.com/svn/kinosearch/trunk
kinosearch

If that doesn't solve the problem, we'll need to isolate a failing
test case. The relevant test file, t/511-sort_spec.t, does not
contain a stress test at present. Your current problems suggest how
we might compose one.

> It's a bug if setting a sort_spec
> is expected to sort the document collection, but if it is
> expected merely to sort the result set, then it's just a major
> annoyance; either way it looks like I have to retrieve all the
> hits and them sort them myself.

That would be useless. KinoSearch is expected to work with result
sets reaching into the millions.

With a SortSpec in force, KS uses a different criteria for sorting
items in the priority queue where hits are collected. Instead of
documents with the highest scores winning the most favorable queue
positions, documents with the lowest "term number" win out. (Or
highest, depending on whether reverse sorting has been requested.)

During a search, all hits are passed through the hit collector; those
with the most desirable values are retained by the queue. No matter
how large the priority queue, the "best" should always remain the
best. Ordering of documents which "tie" will not necessarily be
consistent, but ordering of docs whose sort criteria differ -- as in
your case -- ought to be.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Sorting and seeking interaction [ In reply to ]
Marvin Humphrey wrote:
> Haven't seen 'Glob' before, and it doesn't look like this is it:

It's my blogging application, and currently it's all in a bit of flux
as I'm switching between Class::DBI and KinoSearch for the tagging
info...

> If your index has been incrementally updated and consists of more than
> one segment, you may wish to try the current subversion trunk, version
> 2306. Recent fixes to MultiTermList may affect the results of sorted
> search.

Nope, it's one segment. Here's a simple failing test case:

package Unanalyzed;
use base 'KinoSearch::Schema::FieldSpec';
sub analyzed { 0 }

package Schema;
use base 'KinoSearch::Schema';
our %FIELDS = (
id => "Unanalyzed",
tag => "KinoSearch::Schema::FieldSpec",
);
sub analyzer {
require KinoSearch::Analysis::PolyAnalyzer;
return KinoSearch::Analysis::PolyAnalyzer->new(language => 'en');
}

package main;
use File::Temp qw(tempdir);
use Test::More tests => 4;
my $dir = tempdir( CLEANUP=>1 );
use_ok("KinoSearch::InvIndexer");
use_ok("KinoSearch::Searcher");
use_ok("KinoSearch::Search::SortSpec");
my $indexer = KinoSearch::InvIndexer->new(invindex => Schema->create($dir));

$indexer->add_doc({ id => $_, tag => "test" }) for "aa".."bb";
$indexer->finish;

my $sort = KinoSearch::Search::SortSpec->new();
$sort->add( field => "id", reverse => 1);

sub get_a_hit {
my $searcher = KinoSearch::Searcher->new(invindex=> Schema->open($dir));
my $hits = $searcher->search(
query => "tag:test",
sort_spec => $sort,
num_wanted => shift
)->fetch_hit_hashref->{id};
}
is(get_a_hit(10), get_a_hit(30), "Hit results are sorted regardless of number");
Sorting and seeking interaction [ In reply to ]
At 10:19 -0700 2007.04.14, Marvin Humphrey wrote:
>During a search, all hits are passed through the hit collector; those
>with the most desirable values are retained by the queue. No matter
>how large the priority queue, the "best" should always remain the
>best. Ordering of documents which "tie" will not necessarily be
>consistent, but ordering of docs whose sort criteria differ -- as in
>your case -- ought to be.

Does that mean if I am picking the top 10 results, and the 10th and 11th
item are a tie, that when someone selects results 1 through 10, and then 11
through 20, that the 10th and 11th items could switch positions?

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
Sorting and seeking interaction [ In reply to ]
On Apr 14, 2007, at 11:59 AM, Chris Nandor wrote:

> At 10:19 -0700 2007.04.14, Marvin Humphrey wrote:
>> During a search, all hits are passed through the hit collector; those
>> with the most desirable values are retained by the queue. No matter
>> how large the priority queue, the "best" should always remain the
>> best. Ordering of documents which "tie" will not necessarily be
>> consistent, but ordering of docs whose sort criteria differ -- as in
>> your case -- ought to be.
>
> Does that mean if I am picking the top 10 results, and the 10th and
> 11th
> item are a tie, that when someone selects results 1 through 10, and
> then 11
> through 20, that the 10th and 11th items could switch positions?

Good thought.

Ties sort_field values are broken by score at present, but I think
tied scores could result in non-deterministic behavior, since the
heap sort used by the PriorityQueue is not a stable sort.

Breaking tied scores by document number should settle the issue.
Then, for a given IndexReader, the sort order should always be
consistent. Updates can mess with document numbers, so after an
update things could change, but I think we'll have to live with that.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Sorting and seeking interaction [ In reply to ]
On Apr 14, 2007, at 10:47 AM, Simon Cozens wrote:

> Nope, it's one segment. Here's a simple failing test case:

Excellent, thank you.

I believe the problem is fixed. Please try repository revision 2308.

svn co -r 2308 http://www.rectangular.com/svn/kinosearch/trunk
kinosearch

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Sorting and seeking interaction [ In reply to ]
At 12:20 -0700 2007.04.14, Marvin Humphrey wrote:
>Breaking tied scores by document number should settle the issue.
>Then, for a given IndexReader, the sort order should always be
>consistent. Updates can mess with document numbers, so after an
>update things could change, but I think we'll have to live with that.

Yeah, that sounds perfect.

Thanks,

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/