Mailing List Archive

serializing safely
I'm going to be using KS in a persistent environment (a fastcgi). The
Searcher/IndexReader docs recommend caching the searcher for better
performance.

Because my fastcgi (as so many others) has multiple children, my first instinct
is to store the searcher in the session so that any of the children can get at
it as needed. I know that a bunch of the KS guts are XS or C, though; what
things can be safely Storabled and put into a database, and what things will
either blow up or silently not work?

hdp.
serializing safely [ In reply to ]
On Wed, Jun 13, 2007 at 11:50:32AM -0400, Hans Dieter Pearcey wrote:
> Because my fastcgi (as so many others) has multiple children, my first instinct
> is to store the searcher in the session so that any of the children can get at
> it as needed. I know that a bunch of the KS guts are XS or C, though; what
> things can be safely Storabled and put into a database, and what things will
> either blow up or silently not work?

I answered at least part of my own question, namely that Searcher can't be
stored. How do people usually handle this sort of thing? My first thought is
to write something kind of like SearchServer and do simple RPC to it from my
application.

hdp.
serializing safely [ In reply to ]
On Jun 13, 2007, at 11:57 AM, Hans Dieter Pearcey wrote:

> On Wed, Jun 13, 2007 at 11:50:32AM -0400, Hans Dieter Pearcey wrote:
>> Because my fastcgi (as so many others) has multiple children, my
>> first instinct
>> is to store the searcher in the session so that any of the
>> children can get at
>> it as needed. I know that a bunch of the KS guts are XS or C,
>> though; what
>> things can be safely Storabled and put into a database, and what
>> things will
>> either blow up or silently not work?
>
> I answered at least part of my own question, namely that Searcher
> can't be
> stored. How do people usually handle this sort of thing? My first
> thought is
> to write something kind of like SearchServer and do simple RPC to
> it from my
> application.

This is usual way to cache a Searcher with FastCGI:

use CGI::Fast;
use KinoSearch::Searcher;

# load searcher once, outside loop
my $searcher = KinoSearch::Searcher->new(
invindex => Schema->open('/path/to/invindex'),
);

while ( my $cgi = CGI::Fast->new ) {
process_search();
}

If that doesn't work for you, can you you please illustrate how your
app differs?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
serializing safely [ In reply to ]
On Wed, Jun 13, 2007 at 09:48:49PM -0700, Marvin Humphrey wrote:
> # load searcher once, outside loop
> my $searcher = KinoSearch::Searcher->new(
> invindex => Schema->open('/path/to/invindex'),
> );
>
> while ( my $cgi = CGI::Fast->new ) {
> process_search();
> }
>
> If that doesn't work for you, can you you please illustrate how your
> app differs?

I had been thinking of putting it into an Apache::Session.

Will your suggestion survive a fork usefully? I don't know what's in
Searcher's guts.

My app primarily differs in that I was planning on having many invindexes, two
or three per user, so opening them all at program start would probably be
inefficient (there are several hundred of them).

hdp.
serializing safely [ In reply to ]
On Jun 14, 2007, at 4:50 AM, Hans Dieter Pearcey wrote:

> I had been thinking of putting it into an Apache::Session.

Serializing a Searcher so that the state of the *Searcher* *object*
can be preserved between requests? That wouldn't aid performance,
even if Searcher could be serialized.

What you're describing is analogous to serializing a filehandle --
you can write code to do it, but you probably don't want to. You can
record the filehandle's file position. In theory you can even
serialize the bytes held in the filehandle's read buffer, though
that's a bizarre thing to do, since the buffer is just an in-memory
cache that spares you from having to access the disk with every read op.

But what will you get when you deserialize that filehandle? Does the
file even exist anymore? Is the data from the old read buffer still
valid? Is the file the same length? Why would you ever do something
like serialize and restore a filehandle, rather than just open the
file again?

I think you may have been misled by the phrase, "caching a Searcher",
which appears in the KS documentation. The point is to cache a
Searcher *in* *RAM*, so that you don't pay the startup costs of
reading a bunch of data off disk and into RAM over and over with each
new search.

> Will your suggestion survive a fork usefully?

You won't get memory errors, but you can't use the Searcher in both
processes. Searchers keep several filehandles open. If both parent
and child attempt to read from the shared file descriptors after the
fork, they'll interfere with each other.

Because Searchers have a large RAM footprint due to all the caching,
yet you can't use duped Searchers because of IO sync issues, you
probably want to avoid creating them in parent processes.

> My app primarily differs in that I was planning on having many
> invindexes, two
> or three per user, so opening them all at program start would
> probably be
> inefficient (there are several hundred of them).

OK. With that architecture, you'll need to factor in the time it
takes to begin reading from any one of those invindexes.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
serializing safely [ In reply to ]
On Thu, Jun 14, 2007 at 07:09:12AM -0700, Marvin Humphrey wrote:
>
> On Jun 14, 2007, at 4:50 AM, Hans Dieter Pearcey wrote:
>
> >I had been thinking of putting it into an Apache::Session.
>
> Serializing a Searcher so that the state of the *Searcher* *object*
> can be preserved between requests? That wouldn't aid performance,
> even if Searcher could be serialized.
>
> What you're describing is analogous to serializing a filehandle --

That's more or less what I figured.

> >My app primarily differs in that I was planning on having many
> >invindexes, two
> >or three per user, so opening them all at program start would
> >probably be
> >inefficient (there are several hundred of them).
>
> OK. With that architecture, you'll need to factor in the time it
> takes to begin reading from any one of those invindexes.

It may be a stupid architecture; I'm not really very experienced with
invindexes. I want to index about 250G of email, which seems like a lot to me,
so I'm assuming that partitions will be useful (since each user only searches
their own email). Am I prematurely optimizing?

hdp.
serializing safely [ In reply to ]
On 6/14/07, Hans Dieter Pearcey <hdp@pobox.com> wrote:
> On Thu, Jun 14, 2007 at 07:09:12AM -0700, Marvin Humphrey wrote:
> > >My app primarily differs in that I was planning on having many
> > >invindexes, two
> > >or three per user, so opening them all at program start would
> > >probably be
> > >inefficient (there are several hundred of them).
> >
> > OK. With that architecture, you'll need to factor in the time it
> > takes to begin reading from any one of those invindexes.
>
> It may be a stupid architecture; I'm not really very experienced with
> invindexes. I want to index about 250G of email, which seems like a lot to me,
> so I'm assuming that partitions will be useful (since each user only searches
> their own email). Am I prematurely optimizing?

Hi Hans ---

I've been thinking about some similar architectural issues, and while
I don't have any experience with corpus sizes as large as you were
dealing with, I thought I'd jump in.

First, your architecture sounds reasonable to me: if searches are
never going to cross indexes, keeping them separate for each user
seems like a reasonable idea. Yes, the initialization costs of each
Searcher object will be expensive, but I think the smaller size of
each index is going to offset this. Starting with this architecture
strikes me as good forethought, and not premature.

Worrying about caching hot Searcher objects to those indexes does
strike me premature, or possibly misguided. The thing that takes the
most time (I'm guessing) is reading the index from the disk, thus
caching the object to disk isn't going to help you a lot. To get a
real advantage, you are going to need it hanging around in RAM, and
given the size of your corpus this is going to require finesse.

Presuming you are running Linux, most extra RAM on the system will be
used to cache recently read files so that they can read from
relatively fast memory rather than waiting for the relatively very
slow disk. The more you cache big objects, the less space available
for the system to cache files. It's a trade: if you know you are
going to reuse the object, it's a win, but if you don't you are
probably better off letting the system do its thing. I'd wait and
measure.

If disk IO does turn out to be a bottleneck (and it will with heavy
enough usage) the easiest solution may be to partition the search off
to separate machines, each handling only a subset of your users.
Rather than thinking about caching Searcher objects within the
FastCGI, you could prepare for this eventuality by running your search
in an external server process, either on the same machine or another.
This process could then cache Searchers for the indexes of the most
recent users and use the appropriate one for the search.

Alternatively, you could cache a small number of Searcher objects in
each FastCGI process, and then come up with a way of preferentially
directing users to the same process they used on the previous request.
Historically, there have been some affinity patches for mod_fastcgi
that did this, but I don't know if they have been updated. But in
general, I don't think there is going to be any good way for multiple
processes or threads to share a single Searcher object.
I'd start by sticking with the separate indexes, skipping the caching,
and seeing how it goes.


Hope this helps,

Nathan Kurz
nate@verse.com
serializing safely [ In reply to ]
On Thu, Jun 14, 2007 at 06:42:28PM -0600, Nathan Kurz wrote:
> But in general, I don't think there is going to be any good way for multiple
> processes or threads to share a single Searcher object.

Can Searchers be treated analogously to file handles, i.e. shared between
processes (opened in a parent, shared between children) as long as only one
process uses it at a time, or is there per-process state that will get screwed
up?

This doesn't really help with the question at hand, since I don't plan on
preloading the searchers, but it's an interesting thing to keep in mind.

hdp.
serializing safely [ In reply to ]
On Jun 14, 2007, at 6:07 PM, Hans Dieter Pearcey wrote:

> Can Searchers be treated analogously to file handles, i.e. shared
> between
> processes (opened in a parent, shared between children) as long as
> only one
> process uses it at a time, or is there per-process state that will
> get screwed
> up?

I believe that will work. In general, I wouldn't recommend doing
things that way with large indexes because you'll end up wasting a
lot of RAM... but that may not matter here.

FWIW, the optional read-locking mechanism needed for use with NFS
will break -- since it uses lock files that remember their pids --
but it's off by default.

> This doesn't really help with the question at hand, since I don't
> plan on
> preloading the searchers, but it's an interesting thing to keep in
> mind.

I don't think you should rule out pre-loading. KinoSearch is heavily
optimized for the use case of running many queries against a single
view of an index.

The costs for warming the a Searcher vary linearly with index size,
and get significantly higher if you perform sorting or range
operations. To put things in perspective, for very large indexes
(larger than you're likely to see for any one individual's email), it
can conceivably take several seconds to warm up a Searcher, then a
fraction of a second to process the query.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
serializing safely [ In reply to ]
On Jun 14, 2007, at 5:42 PM, Nathan Kurz wrote:

> First, your architecture sounds reasonable to me: if searches are
> never going to cross indexes, keeping them separate for each user
> seems like a reasonable idea.

I fully agree.

You want to avoid processing hits that you know can't match.
Definitely, break up the indexes if you know you will never have to
multiplex search results across them.

Search costs are dominated by the time that it takes to process the
matches for common terms. If you're looking for 'orpheus', that's
probably cheap; '+black +orpheus' will be more expensive in
comparison, assuming that 'black' is a more common term in the
corpus. Even though the intersection of the set that matches 'black'
and the set that matches 'orpheus' is small, you still have to
iterate over *all* the matches for both terms.

OTOH, if you knew you had to multiplex results from time to time,
searching several indexes is more expensive, particularly in terms of
disk i/o. In a single index, all the information about any given
term will be relatively concentrated. With multiple indexes, the
information is more scattered, so the disk has to seek a lot more.

> the easiest solution may be to partition the search off
> to separate machines, each handling only a subset of your users.
> Rather than thinking about caching Searcher objects within the
> FastCGI, you could prepare for this eventuality by running your search
> in an external server process, either on the same machine or another.
> This process could then cache Searchers for the indexes of the most
> recent users and use the appropriate one for the search.

This is a good plan.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
serializing safely [ In reply to ]
On Thu, Jun 14, 2007 at 07:32:36PM -0700, Marvin Humphrey wrote:
> I don't think you should rule out pre-loading. KinoSearch is heavily
> optimized for the use case of running many queries against a single
> view of an index.
>
> The costs for warming the a Searcher vary linearly with index size,
> and get significantly higher if you perform sorting or range
> operations. To put things in perspective, for very large indexes
> (larger than you're likely to see for any one individual's email), it
> can conceivably take several seconds to warm up a Searcher, then a
> fraction of a second to process the query.

That's good to know, since this is for mailing lists (and thus much larger than
an average user's email) and I plan to both sort and use RangeFilter(s).

It may still be worthwhile to preload, depending on ram usage for searchers and
actual startup times. I'll have to test with real data to find out, though.

hdp.