Mailing List Archive: KinoSearch::Docs::Cookbook::ReusingSearchers

KinoSearch::Docs::Cookbook::ReusingSearchers

marvin at rectangular

Sep 2, 2007, 5:31 PM

Post #1 of 10 (2808 views)

Greets,

I have a draft version of a ReusingSearchers Cookbook entry up on the
wiki:

http://www.rectangular.com/kinosearch/wiki/ReusingSearchers

It includes recipes for both FastCGI and mod_perl.

This is part of a planned expansion of KinoSearch's introductory
documentation, including a multi-chapter Tutorial.

I've also enabled anonymous editing on the wiki. It's been active
for a day with no spam; we'll see how long that lasts before I have
to put the TypeKey authentication back in place.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: KinoSearch::Docs::Cookbook::ReusingSearchers [ In reply to ]

henka at cityweb

Sep 3, 2007, 4:02 AM

Post #2 of 10 (2716 views)

> It includes recipes for both FastCGI and mod_perl.
>
> This is part of a planned expansion of KinoSearch's introductory
> documentation, including a multi-chapter Tutorial.

Excellent. This will reduce the learning curve somewhat. It's so friggin
refreshing seeing an open source project so well documented.

Two queries, if I may:

1) WRT
http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/Search/MultiSearcher.html
(ie, SortSpec in MultiSearcher); any idea when you'll get around to
filling that little gnawing cavity?

2) At http://www.rectangular.com/kinosearch/wiki/ScalingUp you discuss
possible cluster arrangements, etc. I'm thinking of the best way to apply
this idea without buying more hardware (for now). In our clustering
solution, there is no "master" node, can you foresee any problem having
*all* nodes the "master" (with the search load being spread around and the
results being aggregated on whichever node the client hit).

This topology is used to spread the HTTP (and DB) load around the cluster
allowing for higher loads and easy scaling (and HA). Having a single
search master breaks this somewhat. We'll migrate searching onto a
separate cluster later, but for now I need to make things fit.

Thanks
Henry

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: KinoSearch::Docs::Cookbook::ReusingSearchers [ In reply to ]

marvin at rectangular

Sep 14, 2007, 12:05 AM

Post #3 of 10 (2703 views)

On Sep 13, 2007, at 12:12 AM, Henry wrote:

> Dunno if you missed the above post (or you're playing catchup): any
> comments on whether all search nodes can be "masters" for
> aggregating/sorting results?

Yeah, that will work fine.

You had also asked about the MultiSearcher sort. I'd been back-
burnering that one because I was hoping that a new approach would
present itself during the course of fixing other things. Well, I
believe that one has.

What we need is to do is break up nodes by task. We'll have a
dedicated lexicon server, a dedicated document server, one or more
master search nodes, and multiple machines dedicated to the task of
crunching through posting lists. The lexicon server and the document
server will be abstractions behind which one or more machines will sit.

These tasks are already more or less broken up inside IndexReader, so
all we'll need to do is build the appropriate subclasses of
DocReader, LexReader, etc.

If you'll recall, the problem with the MultiSearcher sort has to do
with the overhead of loading large fields into memory to cut down on
disk seeks. This solves that problem by loading the whole lexicon
into one shared space for the whole search cluster.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: KinoSearch::Docs::Cookbook::ReusingSearchers [ In reply to ]

Sep 14, 2007, 1:39 AM

Post #4 of 10 (2699 views)

On 9/14/07, Marvin Humphrey <marvin@rectangular.com> wrote:
>
> On Sep 13, 2007, at 12:12 AM, Henry wrote:
>
> > Dunno if you missed the above post (or you're playing catchup): any
> > comments on whether all search nodes can be "masters" for
> > aggregating/sorting results?
>
> Yeah, that will work fine.

For what it's worth, that approach appeals to me as well. The
simplicity of having each node identical seems ideal so long as the
resources on each machine are reasonably utilized.

I think you mentioned somewhere earlier, but how large is your dataset
Henry? Are you using MultiSearcher because your index is too large
to fit on local disks or is it mid-size and you are trying to keep
everything in RAM?

> You had also asked about the MultiSearcher sort. I'd been back-
> burnering that one because I was hoping that a new approach would
> present itself during the course of fixing other things. Well, I
> believe that one has.
>
> What we need is to do is break up nodes by task.

Ouch --- this runs counter to the simplicity I appreciate about the
masterless system Henry proposed. I agree that it would be pretty
easy to go to programmatically, but it doesn't sound much fun to
administer. I see this being of benefit only to really gigantic
loads/indexes with the hardware customized to the role of each node.
What are the cases you are thinking it would benefit?

> If you'll recall, the problem with the MultiSearcher sort has to do
> with the overhead of loading large fields into memory to cut down on
> disk seeks. This solves that problem by loading the whole lexicon
> into one shared space for the whole search cluster.

I think that coming up with a good way of returning the field value to
the requester is going to be a better final solution. The fear of
disk seeks seems like a red herring --- if a block is being read
often, it's going to be cached, if it's not often, it doesn't matter.
And If for some reason we are trashing the page cache and forcing a
re-read, let's figure out how to change that!

But perhaps I'm missing part of this equation?

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: KinoSearch::Docs::Cookbook::ReusingSearchers [ In reply to ]

henka at cityweb

Sep 14, 2007, 12:02 PM

Post #5 of 10 (2690 views)

> What we need is to do is break up nodes by task. We'll have a
> dedicated lexicon server, a dedicated document server, one or more
> master search nodes, and multiple machines dedicated to the task of
> crunching through posting lists. The lexicon server and the document
> server will be abstractions behind which one or more machines will sit.

OK - this makes things a bit more involved from an admin perspective - and
also requires a lot more hardware (if I understand your proposed
architecture correctly).

Correct me if I'm wrong: instead of chopping up the index into smaller
and smaller sub-indexes (spread across multiple 'master' search nodes) to
improve performance and handle concurrent search load, you're proposing
separate physical machines with specialized roles/tasks (lexicon, docs,
search)?

By "multiple machines dedicated to the task of crunching through posting
lists" I presume you mean the indexing machines?

Forgive my ignorance, I'm just wondering whether the benefits of this
approach outweigh the elegance of many simple master nodes and the way it
dovetails with classic clustering. ie, I need more performance, I just
add more master nodes and re-divide the index across all nodes; the same
principle applies when the corpus and index grows - just chop it up into
smaller and smaller pieces across more and more nodes.

Regards
Henry

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: KinoSearch::Docs::Cookbook::ReusingSearchers [ In reply to ]

henka at cityweb

Sep 14, 2007, 12:19 PM

Post #6 of 10 (2691 views)

> For what it's worth, that approach appeals to me as well. The
> simplicity of having each node identical seems ideal so long as the
> resources on each machine are reasonably utilized.
>
> I think you mentioned somewhere earlier, but how large is your dataset
> Henry? Are you using MultiSearcher because your index is too large
> to fit on local disks or is it mid-size and you are trying to keep
> everything in RAM?

There are ~31M docs (growing hourly), with a total size of several
hundred gigabytes and growing. So,... in order to provide decent search
performance, the idea is to split the index across several machines in a
cluster. If performance is limp, simply add more nodes and re-divide. If
load becomes an issue (it will), apply the same brute force formula.

Nice and simple.

>> You had also asked about the MultiSearcher sort. I'd been back-
>> burnering that one because I was hoping that a new approach would
>> present itself during the course of fixing other things. Well, I
>> believe that one has.
>>
>> What we need is to do is break up nodes by task.
>
> Ouch --- this runs counter to the simplicity I appreciate about the
> masterless system Henry proposed. I agree that it would be pretty
> easy to go to programmatically, but it doesn't sound much fun to
> administer. I see this being of benefit only to really gigantic
> loads/indexes with the hardware customized to the role of each node.
> What are the cases you are thinking it would benefit?
>
>> If you'll recall, the problem with the MultiSearcher sort has to do
>> with the overhead of loading large fields into memory to cut down on
>> disk seeks. This solves that problem by loading the whole lexicon
>> into one shared space for the whole search cluster.
>
> I think that coming up with a good way of returning the field value to
> the requester is going to be a better final solution. The fear of
> disk seeks seems like a red herring --- if a block is being read
> often, it's going to be cached, if it's not often, it doesn't matter.
> And If for some reason we are trashing the page cache and forcing a
> re-read, let's figure out how to change that!
>
> But perhaps I'm missing part of this equation?

My primary concern is customer-perceived search times. Ideally, it should
be sub-second, no matter what. Marvin's intimate with the details of why
this could be a problem.

I'd love to perform some search tests with a subset of our index - but
multisearcher doesn't have sorting yet (kinda critical in our project).
;-)

Regards
Henry

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: KinoSearch::Docs::Cookbook::ReusingSearchers [ In reply to ]

Sep 14, 2007, 2:16 PM

Post #7 of 10 (2691 views)

On 9/14/07, Henry <henka@cityweb.co.za> wrote:
> > What we need is to do is break up nodes by task. We'll have a
> > dedicated lexicon server, a dedicated document server, one or more
> > master search nodes, and multiple machines dedicated to the task of
> > crunching through posting lists. The lexicon server and the document
> > server will be abstractions behind which one or more machines will sit.
>
> OK - this makes things a bit more involved from an admin perspective - and
> also requires a lot more hardware (if I understand your proposed
> architecture correctly).
>
> Correct me if I'm wrong: instead of chopping up the index into smaller
> and smaller sub-indexes (spread across multiple 'master' search nodes) to
> improve performance and handle concurrent search load, you're proposing
> separate physical machines with specialized roles/tasks (lexicon, docs,
> search)?

His proposal doesn't require separate physical machines ----
everything is virtual. And you still split the index across machines
for performance just a you are doing. The difference is that at least
one machine has a full copy of the portion of the index that allows it
to do the sorting.

Instead of having to ask the search node for this information, it can
look it up locally, and because this is all it does it could keep this
information in memory. This would likely be the only physical machine
you would be adding.

> By "multiple machines dedicated to the task of crunching through posting
> lists" I presume you mean the indexing machines?

I think he's referring to the core of the search process here, rather
than indexing: "chewing through raw posting lists and doing nothing
but spitting out document numbers and scores". The indexing would be
separate.

Marvin's right that this approach could be more efficient. I'm
worried that the complexity added by requiring one omniscient machine
is large, and think that coming up with an efficient way to return
field values would be of more general use. He may be right, though.

> There are ~31M docs (growing hourly), with a total size of several
> hundred gigabytes and growing.
> ...
> My primary concern is customer-perceived search times. Ideally, it should
> be sub-second, no matter what.

I haven't played with large data sets like this yet. How long does a
straight search take if you were to run it on a single machine with
the full dataset? Conversely, how large can you make the partitioned
index in the search farm while keeping response time where you want
it. I'm interested in what the overhead of the MultiSearcher approach
actually is.

Thanks!

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: KinoSearch::Docs::Cookbook::ReusingSearchers [ In reply to ]

henka at cityweb

Sep 14, 2007, 2:56 PM

Post #8 of 10 (2698 views)

> I haven't played with large data sets like this yet. How long does a
> straight search take if you were to run it on a single machine with
> the full dataset? Conversely, how large can you make the partitioned
> index in the search farm while keeping response time where you want
> it. I'm interested in what the overhead of the MultiSearcher approach
> actually is.

For the first question, I have no idea - a single node doesn't have enough
disk space for the full index anyway. For the latter question, I'll be
starting with an index size of ~100-150+GB per node. If that yields
search times (under load with hammerhead or similar) which are not
acceptable, then we'll keep adding nodes (and splitting the index) until
results are acceptable.

Busy re-indexing though, so it may be a while before I can do tests on the
full monty. Will perform tests on a small chunk soon and use that to
extrapolate.

Regards
Henry

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: KinoSearch::Docs::Cookbook::ReusingSearchers [ In reply to ]

marvin at rectangular

Sep 14, 2007, 3:32 PM

Post #9 of 10 (2690 views)

On Sep 14, 2007, at 1:39 AM, Nathan Kurz wrote:

> I see this being of benefit only to really gigantic
> loads/indexes with the hardware customized to the role of each node.

In fact, it was during a conversation with someone about just such a
situation that the divide-by-task idea was initially sketched out.

> What are the cases you are thinking it would benefit?

The sort cache problem is described here:

http://www.rectangular.com/pipermail/kinosearch/2007-June/000993.html

(You may recall one other scaling challenge: the BitVectors used by
the Filter subclasses are too big to pass between nodes in a cluster.)

> I think that coming up with a good way of returning the field value to
> the requester is going to be a better final solution.

I don't think we can get optimum performance until the whole term
dictionary resides in RAM. Certainly to solve the sort cache
problem, we need at least the whole sort field in RAM.

Another avenue of attack might be to load the sort field's .lex files
into RAM, but not decompress them. Then we'd use an InStream with an
inner RAMFileDes (instead of an FSFileDes). No more disk seeks.

This stratagem is a little dicey, because it's hard to predict how
much space a field's terms will occupy. It depends not only on how
many values the sort field has and how long they are, but how well
they compress.

> The fear of
> disk seeks seems like a red herring --- if a block is being read
> often, it's going to be cached, if it's not often, it doesn't matter.
> And If for some reason we are trashing the page cache and forcing a
> re-read, let's figure out how to change that!

The .lex files for the sort cache field will tend to be scattered
because of the segmented index format that allows for incremental
indexing. You can consolidate them by optimizing the index, but
that's expensive.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: KinoSearch::Docs::Cookbook::ReusingSearchers [ In reply to ]

Sep 15, 2007, 1:54 AM

Post #10 of 10 (2686 views)

On 9/14/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> > I think that coming up with a good way of returning the field value to
> > the requester is going to be a better final solution.
>
> I don't think we can get optimum performance until the whole term
> dictionary resides in RAM. Certainly to solve the sort cache
> problem, we need at least the whole sort field in RAM.

I think I agree with this, although I think it's true to the same
extent that the PostingLists need to be in RAM most of the time. The
amount of data that one can transfer in from disk per request is very
small, so anything else you want to use had better be cached.

Take a situation like Henry's: he wants average response time less
than a second. During this time he can read at most 100 MB from disk.
If he is indexing unstopped English, the vector for the term 'the'
is going to be something like 1/20th the entire index size. Thus if
his index is greater than 2GB and he does not have 'the' cached, he's
not going to be able to read it off disk in time.

So assume he does have it cached, along with with as many terms as he
needs to get largest uncached term to be less than 100 MB. I'm not
quite sure how the curve really goes, but at a glance
(http://www.comp.lancs.ac.uk/ucrel/bncfreq/) it looks like top 20
terms are going to be about 1/4th the index size, and that the 20th
term is about 1/200th on it own . So with 100 GB of index per node,
you could cache 25 GB of top terms and any single term you had to read
would be less than 100 MB.

I don't know how many of his documents will fill 100 GB of index, but
he mentioned his total corpus was 30 million, so let's use that number
as a high guess. 30 million documents * 30 bytes per doc for the
sorted field's value is less than 1 GB. So while I agree with you
that reading that from disk in less than a second is out of the
question, I feel like it's small compared to the size of the other
stuff that already needs to be in cache.

> Another avenue of attack might be to load the sort field's .lex files
> into RAM, but not decompress them. Then we'd use an InStream with an
> inner RAMFileDes (instead of an FSFileDes). No more disk seeks.

I think it's simpler than you are thinking: they are not that big,
and if you read them often the system is going to cache them for you.
Are you familiar with the virtual file system works in Linux?
http://www.linux-security.cn/ebooks/ulk3-html/0596005652/understandlk-CHP-16.html
All unused memory in the system is used to cache the filesystem, so
the second time a file is read the actual hard disk isn't touched at
all. And since we are limited as to how much disk information we can
read per search, anything in the cache is likely to stay there if we
use it every few searches.

---

So while I think understand the problem you are trying to solve, I'm
not sure that the straightforward solution (do the sort locally and
send along the value) is really that unwieldy.

Nathan Kurz
nate@verse.com

ps. There's a wonderful description of the VFS that made me think of
your BoilerPlater implementation when I read it:
http://www.spinellis.gr/pubs/inbook/beautiful_code/html/Spi07g.html

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch