Mailing List Archive

sort times
Is there any way to speed up sorting for searches? Here is an output of
a search on a 4.1 GB index with -d:DProf. I need sorted searches to be
much faster. Any suggestions?

[root@localhost SearchProject]# dprofpp
Total Elapsed Time = 11.16279 Seconds
User+System Time = 7.532793 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
72.2 5.440 5.440 1 5.4400 5.4400
KinoSearch::Index::Lexicon::build_sort_cache
4.90 0.369 0.369 155 0.0024 0.0024
KinoSearch::Index::DocVector::_extract_tv_cache
4.30 0.324 0.914 155 0.0021 0.0059
KinoSearch::Highlight::Highlighter::_gen_excerpt
3.11 0.234 0.234 4673 0.0001 0.0001
KinoSearch::Store::InStream::lu_read
1.85 0.139 0.491 25 0.0056 0.0196 main::BEGIN
1.51 0.114 0.114 1041 0.0001 0.0001
KinoSearch::Index::DocVector::add_field_string
1.02 0.077 0.218 96 0.0008 0.0023 base::import
0.93 0.070 0.070 5 0.0140 0.0140 utf8::SWASHNEW
0.85 0.064 1.452 156 0.0004 0.0093
KinoSearch::Search::Hits::fetch_hit_hashref
0.80 0.060 0.060 1 0.0600 0.0600
KinoSearch::Index::LexReader::new
0.78 0.059 0.059 2001 0.0000 0.0000 KinoSearch::Util::Obj::DESTROY
0.77 0.058 0.196 155 0.0004 0.0013
KinoSearch::Index::DocReader::fetch_doc
0.76 0.057 0.460 155 0.0004 0.0030
KinoSearch::Index::DocVector::term_vector
0.69 0.052 0.143 349 0.0001 0.0004 KinoSearch::Util::Class::new
0.56 0.042 0.529 155 0.0003 0.0034
KinoSearch::Highlight::Highlighter::_starts_and_ends
sort times [ In reply to ]
On May 22, 2007, at 8:44 PM, Roger Dooley wrote:

> Is there any way to speed up sorting for searches? Here is an
> output of a search on a 4.1 GB index with -d:DProf. I need sorted
> searches to be much faster. Any suggestions?

Cache and reuse your Searcher/IndexReader.

> [root@localhost SearchProject]# dprofpp
> Total Elapsed Time = 11.16279 Seconds
> User+System Time = 7.532793 Seconds
> Exclusive Times
> %Time ExclSec CumulS #Calls sec/call Csec/c Name
> 72.2 5.440 5.440 1 5.4400 5.4400
> KinoSearch::Index::Lexicon::build_sort_cache

Building the sort cache is a one-time cost if you reuse the Searcher/
Reader.

> 4.90 0.369 0.369 155 0.0024 0.0024
> KinoSearch::Index::DocVector::_extract_tv_cache
> 4.30 0.324 0.914 155 0.0021 0.0059
> KinoSearch::Highlight::Highlighter::_gen_excerpt
> 3.11 0.234 0.234 4673 0.0001 0.0001
> KinoSearch::Store::InStream::lu_read

Those are all fetch and highlight related. How large are your
documents on average? The numbers seem high. You might consider not
storing only a portion as a separate field.

> 1.85 0.139 0.491 25 0.0056 0.0196 main::BEGIN
> 1.51 0.114 0.114 1041 0.0001 0.0001
> KinoSearch::Index::DocVector::add_field_string
> 1.02 0.077 0.218 96 0.0008 0.0023 base::import
> 0.93 0.070 0.070 5 0.0140 0.0140 utf8::SWASHNEW
> 0.85 0.064 1.452 156 0.0004 0.0093
> KinoSearch::Search::Hits::fetch_hit_hashref
> 0.80 0.060 0.060 1 0.0600 0.0600
> KinoSearch::Index::LexReader::new
> 0.78 0.059 0.059 2001 0.0000 0.0000
> KinoSearch::Util::Obj::DESTROY
> 0.77 0.058 0.196 155 0.0004 0.0013
> KinoSearch::Index::DocReader::fetch_doc
> 0.76 0.057 0.460 155 0.0004 0.0030
> KinoSearch::Index::DocVector::term_vector
> 0.69 0.052 0.143 349 0.0001 0.0004
> KinoSearch::Util::Class::new
> 0.56 0.042 0.529 155 0.0003 0.0034
> KinoSearch::Highlight::Highlighter::_starts_and_ends

Interestingly, the time to score the sorted search doesn't even make
the cut.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
sort times [ In reply to ]
Marvin Humphrey (5/23/2007 12:07 AM) wrote:
>
> On May 22, 2007, at 8:44 PM, Roger Dooley wrote:
>
>> Is there any way to speed up sorting for searches? Here is an output
>> of a search on a 4.1 GB index with -d:DProf. I need sorted searches
>> to be much faster. Any suggestions?
>
> Cache and reuse your Searcher/IndexReader.
>
>> [root@localhost SearchProject]# dprofpp
>> Total Elapsed Time = 11.16279 Seconds
>> User+System Time = 7.532793 Seconds
>> Exclusive Times
>> %Time ExclSec CumulS #Calls sec/call Csec/c Name
>> 72.2 5.440 5.440 1 5.4400 5.4400
>> KinoSearch::Index::Lexicon::build_sort_cache
>
> Building the sort cache is a one-time cost if you reuse the
> Searcher/Reader.
>

Is that per new search? Any new search needs to come back quickly as well.

>> 4.90 0.369 0.369 155 0.0024 0.0024
>> KinoSearch::Index::DocVector::_extract_tv_cache
>> 4.30 0.324 0.914 155 0.0021 0.0059
>> KinoSearch::Highlight::Highlighter::_gen_excerpt
>> 3.11 0.234 0.234 4673 0.0001 0.0001
>> KinoSearch::Store::InStream::lu_read
>
> Those are all fetch and highlight related. How large are your documents
> on average? The numbers seem high. You might consider not storing only
> a portion as a separate field.
>

The documents are probably around 1k. We have about 1.5 million of them.


>> 1.85 0.139 0.491 25 0.0056 0.0196 main::BEGIN
>> 1.51 0.114 0.114 1041 0.0001 0.0001
>> KinoSearch::Index::DocVector::add_field_string
>> 1.02 0.077 0.218 96 0.0008 0.0023 base::import
>> 0.93 0.070 0.070 5 0.0140 0.0140 utf8::SWASHNEW
>> 0.85 0.064 1.452 156 0.0004 0.0093
>> KinoSearch::Search::Hits::fetch_hit_hashref
>> 0.80 0.060 0.060 1 0.0600 0.0600
>> KinoSearch::Index::LexReader::new
>> 0.78 0.059 0.059 2001 0.0000 0.0000
>> KinoSearch::Util::Obj::DESTROY
>> 0.77 0.058 0.196 155 0.0004 0.0013
>> KinoSearch::Index::DocReader::fetch_doc
>> 0.76 0.057 0.460 155 0.0004 0.0030
>> KinoSearch::Index::DocVector::term_vector
>> 0.69 0.052 0.143 349 0.0001 0.0004 KinoSearch::Util::Class::new
>> 0.56 0.042 0.529 155 0.0003 0.0034
>> KinoSearch::Highlight::Highlighter::_starts_and_ends
>
> Interestingly, the time to score the sorted search doesn't even make the
> cut.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> _______________________________________________
> KinoSearch mailing list
> KinoSearch@rectangular.com
> http://www.rectangular.com/mailman/listinfo/kinosearch
sort times [ In reply to ]
On May 23, 2007, at 3:46 AM, Roger Dooley wrote:

>> Building the sort cache is a one-time cost if you reuse the
>> Searcher/Reader.
>
> Is that per new search? Any new search needs to come back quickly
> as well.

The first search will always be sluggish by comparison. If your
application logic allows it, you can warm up a Searcher with a dummy
query.

KS makes the engineering tradeoff of requiring significant caching up-
front to reduce costs of later searches. It's another way that the
behavior differs from that of the relational database systems many
people are familiar with.

>>> 4.90 0.369 0.369 155 0.0024 0.0024
>>> KinoSearch::Index::DocVector::_extract_tv_cache
>>> 4.30 0.324 0.914 155 0.0021 0.0059
>>> KinoSearch::Highlight::Highlighter::_gen_excerpt
>>> 3.11 0.234 0.234 4673 0.0001 0.0001
>>> KinoSearch::Store::InStream::lu_read
>> Those are all fetch and highlight related. How large are your
>> documents on average? The numbers seem high. You might consider
>> not storing only a portion as a separate field.
>
> The documents are probably around 1k. We have about 1.5 million of
> them.

OK, that's not so big. Interesting -- the time to retrieve and
highlight dominates this particular search. It's hard to say whether
this would hold true over most searches, though. We're still in
need of benchmarkers for search-time.

Theoretically, your one-off benchmark numbers bode well for
scalability. The cost of fetching/highlighting is roughly
proportional to the number of documents retrieved per search rather
than the size of the index, once you allow for slightly more hard
disk seek time when retrieving docs and doc vectors that are more
spaced out. The true limiting factor for most people, scalability-
wise, is the time it takes to score hits, and that's not even
registering. However, if this was a simple search for one
comparatively rare term, that would mislead us. It's anecdotal
evidence and we can't draw conclusions.

In apps where the search *has* to be performed cold, though,
scalability may be limited by cache-loading time. The more fields
you enable sorting against, the higher this cost.

Looking to the future... if you're sorting by date, the addition of
an epoch fixed-length field type to KS could cut down load-time, as
it would be less costly to unpack than a text field. However, I
don't expect to get to that task soon.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/