Mailing List Archive

getting total hits before a seek
Using v0.15 (still)

I have a pretty healthy document collection (around 15 million) that gets
moderate traffic (260k searches a day) and have been working on
improving performance as searches have crept into the >1s range.

My search server required the total number of hits to be returned
before seeking results ... mainly to short circuit some expensive
pre-processing, but we don't need to get into that here :-).

Anyway, I discovered that calling total_hits on a hits object BEFORE
calling seek on the hits object actually triggers the default 0,100
seek:

KinoSearch/Search/Hits.pm, line 67:

sub total_hits {
my $self = shift;
$self->seek( 0, 100 )
unless defined $self->{total_hits};
return $self->{total_hits};
}

For me, I juggled the pre-processing to avoid the total hits call until
I ran my 0,10 seek. This cut total search time by more than half (obviously).

Just out of curiosity, why is a seek required to populate total hits?

--
Brett Paden
paden@multiply.com
getting total hits before a seek [ In reply to ]
On Mar 8, 2007, at 9:16 AM, Brett Paden wrote:

> Just out of curiosity, why is a seek required to populate total hits?

The API for 0.15 is a little sneaky. Calling Searcher->search
doesn't actually run the matching/scoring. Calling Hits->seek does.

During matching/scoring, doc_num/score pairs are not accumulated in
an array or a hash, as most Perl programmers might suppose (including
this one, who did something like that in the original
Search::Kinosearch distribution). They are put into a priority
queue, which is much more efficient in terms of memory -- but also
discards any matches that fall off the end once its capacity is
exceeded.

Because Searcher->search in 0.15 doesn't know how many documents
you're going to need, it can't know how big the priority queue needs
to be. So KS waits until Hits->seek, when that number can be derived
by adding "offset" and "num_wanted".

KS has to complete the matching/scoring process before it can know
the value that should be returned by Hits->total_hits. In KS version
0.05, total_hits() actually threw an error if you hadn't called seek
() first. In his perl.com review, though, chromatic panned this
behavior as non-intuitive, so I added the internal seek().

In 0.20, things have changed. "offset" and "num_wanted" have been
added to the Searcher->search API so that it can actually run the
search, which is what I think most people would expect.

Also now, Hits->seek only reruns the search if the size of the
priority queue would exceed that of previous runs. So if you call
seek(0,100) then seek (0, 10), the search doesn't get rerun -- but if
you call seek(0, 10) then seek(0, 20) or seek(10, 10), it does.

The absence of "offset" and "num_wanted" from the Searcher->search
API and the activation of actual matching/scoring by the Hits object
in 0.15 and earlier are traits inherited from Lucene. People don't
much like the behavior of Lucene's Hits class either, I've come to know.

A number of the changes in 0.20 are the product of insights gleaned
after completing a working Lucene port. When I was originally
porting some of these classes, I didn't fully grok why Lucene did
things a certain way, even though I'd written an entire search engine
library myself earlier. Now I have a better understanding, and it's
possible to discard some of the cargo-cult programming.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
getting total hits before a seek [ In reply to ]
> Using v0.15 (still)
>
> I have a pretty healthy document collection (around 15 million) that gets
> moderate traffic (260k searches a day) and have been working on
> improving performance as searches have crept into the >1s range.

You can also try out Marvin's distributed multi-index search facility (an
extension to 0.15 I think - Marv can provide more details). For an index
of your size (~61% of ours) splitting the index into smaller chunks over
several search nodes might be a good idea.

However, Marvin *did* say that 0.20.x might not require this... :-)
Unless, of course, the search load requires it - 260,000 searches a day is
quite a bit. Spread the load, as we intend to.
getting total hits before a seek [ In reply to ]
on Tuesday, Mar 13, 2007, Henka, wrote:
>
>
> > Using v0.15 (still)
> >
> > I have a pretty healthy document collection (around 15 million) that gets
> > moderate traffic (260k searches a day) and have been working on
> > improving performance as searches have crept into the >1s range.
>
> You can also try out Marvin's distributed multi-index search facility (an
> extension to 0.15 I think - Marv can provide more details). For an index
> of your size (~61% of ours) splitting the index into smaller chunks over
> several search nodes might be a good idea.

I was planning on experimenting with this service. Do you have any
hints on the best way to split up an index? Or can I just make some
arbitrary divisions (5M documents per server in a three server setup,
for example)? Finally, can the search server handle multiple
simultaneous connections?

> However, Marvin *did* say that 0.20.x might not require this... :-)
> Unless, of course, the search load requires it - 260,000 searches a day is
> quite a bit. Spread the load, as we intend to.

Our search server is housed on a Dell 1850 with Quad Xeon processors and
8G of ram. I use Net::Server::PreForkSimple to spawn off 20 searchers,
thus caching a searcher on each child. The child is responsible for
cleaning up after itself and will respawn if its memory usage grows too
large.

The search index itself is pretty stripped down at 4.5G, as the text
being searched is not stored nor vectorized. This allows the OS to
cache the index in memory with plenty left over for search children to
use.

Our index definition:

id => {
indexed => 1,
analyzed => 0,
stored => 1,
vectorized => 0,
},
bodytext => {
indexed => 1,
analyzed => 1,
stored => 0,
vectorized => 0,
},
type => {
indexed => 1,
analyzed => 0,
stored => 0,
vectorized => 0,
},


On average, we're seeing .9s searches with a fair number exceeding 2s
... this is down significantly from the average 4s when I stupidly
looked for total hits before seeking out my ten hits.

I'll let the list know how MultiSearch works out for us.

>
>
>
> _______________________________________________
> KinoSearch mailing list
> KinoSearch@rectangular.com
> http://www.rectangular.com/mailman/listinfo/kinosearch
getting total hits before a seek [ In reply to ]
> on Tuesday, Mar 13, 2007, Henka, wrote:
>>
>>
>> > Using v0.15 (still)
>> >
>> > I have a pretty healthy document collection (around 15 million) that
>> gets
>> > moderate traffic (260k searches a day) and have been working on
>> > improving performance as searches have crept into the >1s range.
>>
>> You can also try out Marvin's distributed multi-index search facility
>> (an
>> extension to 0.15 I think - Marv can provide more details). For an
>> index
>> of your size (~61% of ours) splitting the index into smaller chunks over
>> several search nodes might be a good idea.
>
> I was planning on experimenting with this service. Do you have any
> hints on the best way to split up an index? Or can I just make some
> arbitrary divisions (5M documents per server in a three server setup,
> for example)? Finally, can the search server handle multiple
> simultaneous connections?

We just use a round-robin approach (3x sub-indexes). The target index is
chosen at crawl-time, making life simple.

>> However, Marvin *did* say that 0.20.x might not require this... :-)
>> Unless, of course, the search load requires it - 260,000 searches a day
>> is
>> quite a bit. Spread the load, as we intend to.
>
> Our search server is housed on a Dell 1850 with Quad Xeon processors and
> 8G of ram. I use Net::Server::PreForkSimple to spawn off 20 searchers,
> thus caching a searcher on each child. The child is responsible for
> cleaning up after itself and will respawn if its memory usage grows too
> large.
>
> The search index itself is pretty stripped down at 4.5G, as the text
> being searched is not stored nor vectorized. This allows the OS to
> cache the index in memory with plenty left over for search children to
> use.
>
> Our index definition:

hmm - our approach is a bit different will LOTS of special fields and
indexing on all of them, so the indexes are huge (several hundred GB and
growing).

You should be OK with the single machine considering your index size -
even better if you migrate to ks 0.20.x when it's a bit more stable.

Cheers
h