Mailing List Archive

CachingDirectory contribution
A while back I wrote a CachingDirectory implementation for Lucene which
allows for caching an index on a local machine other than the "root"
machine. This can be very useful for handling heavy load (such as David
Snyder's 13 GB index :-))

I'd really love to see it included in the Lucene package (I'm okay with
it being put under the Apache license). If there's some interest I could
provide the sources to the abstract CachingDirectory as well as actual
FSCachingDirectory and RAMCachingDirectory implementations.

All these classes compile fine with Lucene 1.2rc1, but I'd like to
further test them before I provide the sources.


In addition to that I could also provide my OracleDirectory
implementation which stores all index files into an Oracle database
instead of a file system structure. I haven't done a SQLServerDirectory,
but I'm willing to implement it as well :)

Any comments?

--
Maik Schreiber
IQ Computing - http://www.iq-computing.de
mailto: info@iq-computing.de
Re: CachingDirectory contribution [ In reply to ]
> A while back I wrote a CachingDirectory
> implementation for Lucene which
> allows for caching an index on a local machine other
> than the "root"
> machine. This can be very useful for handling heavy
> load (such as David
> Snyder's 13 GB index :-))

13GB is considered a light load for Lucene. I am
currently running a lucene demo on my old but trusty
pentium-120mhz laptop with a 9GB index. It takes
Lucene about 20 seconds to handle the very first
query, probably because it is loading the index into
memory. All subsequent queries are instantaneous.

Anyway, I'm very curious as to how your directory
caching code works. What does it cache? files?
previously read data? Have you measured the amount of
performance improvement that was gained using your
caching system? What was the index size you used to
measure performance improvement?

If it does improve performance for huge indexes, I'll
+1.

This leads me to yet another of my buring questions..
has anyone pushed Lucene to its limits yet? If so,
what are they? What happens when Lucene hit its limit?
Does it throw an exception? coredump?


> In addition to that I could also provide my
> OracleDirectory
> implementation which stores all index files into an
> Oracle database
> instead of a file system structure. I haven't done a
> SQLServerDirectory,
> but I'm willing to implement it as well :)

I assume you're using BLOBs to store the index files?
What are the advantages of using the Oracle directory
over just using the file system?




__________________________________________________
Do You Yahoo!?
NEW from Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month.
http://geocities.yahoo.com/ps/info1
Re: CachingDirectory contribution [ In reply to ]
> This leads me to yet another of my buring questions..
> has anyone pushed Lucene to its limits yet? If so,
> what are they? What happens when Lucene hit its limit?
> Does it throw an exception? coredump?

I haven't done that, but I've been to one job interview a few months
ago in which I mentioned Lucene, after hearing that they were using
Verity.
This company had supposedly tried using Lucene, but according to them
at some point they had hit the wall with it and the performance just
dropped.
I do not know how large their indices were, but I suspect they were
quite large. This company indexes news articles, press releases, et
cetera, and keeps them around for a while (a few months or so).

Otis


__________________________________________________
Do You Yahoo!?
NEW from Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month.
http://geocities.yahoo.com/ps/info1
RE: CachingDirectory contribution [ In reply to ]
> From: Maik Schreiber [mailto:bZ@iq-computing.de]
>
> A while back I wrote a CachingDirectory implementation for
> Lucene which
> allows for caching an index on a local machine other than the "root"
> machine. This can be very useful for handling heavy load
> (such as David
> Snyder's 13 GB index :-))
>
> I'd really love to see it included in the Lucene package (I'm
> okay with
> it being put under the Apache license). If there's some
> interest I could
> provide the sources to the abstract CachingDirectory as well as actual
> FSCachingDirectory and RAMCachingDirectory implementations.

Please post them. If folks find them useful, they can be added to Lucene.

There are lots of places to cache. The operating system usually makes a
valiant effort to cache files on the local FS using available RAM. NFS
mounted files are also usually cached in RAM, and there are even NFS client
implementation which cache entire files on the local FS, so that if they
fall out of the RAM cache then a network read is not required. Explicitly
caching multi-gigabyte files in the Java heap is probably not practical.
But if NFS or the local FS do not cache things adequately, there might be a
place for Java-based caches that use either RAM or the local FS. I'm not
sure what they are, but that doesn't mean they don't exist! Benchmarking is
always the best way to determine these things, and if folks find that a
cache helps, let's add it.

Doug
RE: CachingDirectory contribution [ In reply to ]
> From: Dave Kor [mailto:davekkw@yahoo.com]
>
> This leads me to yet another of my buring questions..
> has anyone pushed Lucene to its limits yet? If so,
> what are they? What happens when Lucene hit its limit?
> Does it throw an exception? coredump?

There are many limits that could be hit. Lucene's design is that hard
limits should be hard to hit. Lucene only caches a few critical data
structures in memory, in order to keep from hitting the JVM's heap size
limit, relying instead on the file system's caches for performance. Lucene
uses 63-bit file pointers, so it will be a long time before raw index size
is a limit, however filesystems that do not support files larger than, e.g.,
2GB will limit things. Document and term numbers are 31-bit, so two billion
documents or terms is another limit that will will probably not be hit too
soon.

Performance for large indices is frequently governed by i/o performance. If
an index is larger than RAM then searches will need to read data from disk.
This can quickly become a bottleneck. A search for a term that occurs in a
million documents can require over 1MB of data, which can take some time to
read. With multiple searching threads, the disk can easily become a
bottleneck. Disk arrays can alleviate this, more RAM helps even more!

For some folks, queries that take over a second are unacceptable, for
others, ten seconds is okay.

Performance should be more-or-less linear: a two-million document index will
be almost twice as slow to search as a one-million document index. There
are lots of factors, including document size, CPU-speed, RAM-size, i/o
subsystem, but a rough rule-of-thumb for Lucene performance might be that,
in a "typical" configuration, it can search a million documents per second.

So if you need to search 20 million 100kB documents on a 100Mhz 386 with 8MB
of RAM with sub-second response time, Lucene will probably fail. But if you
need to search two million 2kB documents on a 500Mhz Pentium with 128MB of
RAM in a couple of seconds per query, you're probably okay.

Doug
RE: CachingDirectory contribution [ In reply to ]
This is great discussion - thanks for explaining some of the limits, Doug.
I always wonder how this sort of email can turn into useful documentation.

>Performance should be more-or-less linear: a two-million document index will
>be almost twice as slow to search as a one-million document index.

Is performance linear in the number of documents? I would naively
think it would be linear in the number of terms. Also, how does
document size change things? Is searching 1 million 100k documents
twice as fast as 1 million 200k documents?

nelson@monkey.org
. . . . . . . . http://www.media.mit.edu/~nelson/
RE: CachingDirectory contribution [ In reply to ]
> From: nelson@monkey.org [mailto:nelson@monkey.org]
>
> >Performance should be more-or-less linear: a two-million
> document index will
> >be almost twice as slow to search as a one-million document index.
>
> Is performance linear in the number of documents? I would naively
> think it would be linear in the number of terms. Also, how does
> document size change things? Is searching 1 million 100k documents
> twice as fast as 1 million 200k documents?

Some more precise statements: The cost to search for a term is proportional
to the number of documents that contain that term. The cost to search for a
phrase is proportional to the sum of the number of occurrences of its
constituent terms. The cost to execute a boolean query is the sum of the
costs of its sub-queries. Longer documents contain more terms: usually both
more unique terms and more occurrences.

Total vocabulary size is not a big factor in search performance. When you
open an index Lucene does read one out of every 128 unique terms into a
table, so an index with a large number of unique terms will be slower to
open. Searching that table for query terms is also slower for bigger
indexes, but the time to search that table is not significant in overall
performance. Lucene also reads at index open one byte per document per
indexed field (the normalization factor). So an index with lots of
documents and fields will also be slower to open. But, once opened, the
cost of searching is largely dependent on the frequency characteristics of
query terms. And, since IndexReaders and Searchers are thread safe, you
don't need to open indexes very often.

Doug
Re: CachingDirectory contribution [ In reply to ]
Doug, what about scaling with the number of concurrent queries? In other
words, assuming that the queries are similar, how is the search
performance affected by increase in the number of concurrent users?

Doug Cutting wrote:

>>From: nelson@monkey.org [mailto:nelson@monkey.org]
>>
>>>Performance should be more-or-less linear: a two-million
>>>
>>document index will
>>
>>>be almost twice as slow to search as a one-million document index.
>>>
>>Is performance linear in the number of documents? I would naively
>>think it would be linear in the number of terms. Also, how does
>>document size change things? Is searching 1 million 100k documents
>>twice as fast as 1 million 200k documents?
>>
>
>Some more precise statements: The cost to search for a term is proportional
>to the number of documents that contain that term. The cost to search for a
>phrase is proportional to the sum of the number of occurrences of its
>constituent terms. The cost to execute a boolean query is the sum of the
>costs of its sub-queries. Longer documents contain more terms: usually both
>more unique terms and more occurrences.
>
>Total vocabulary size is not a big factor in search performance. When you
>open an index Lucene does read one out of every 128 unique terms into a
>table, so an index with a large number of unique terms will be slower to
>open. Searching that table for query terms is also slower for bigger
>indexes, but the time to search that table is not significant in overall
>performance. Lucene also reads at index open one byte per document per
>indexed field (the normalization factor). So an index with lots of
>documents and fields will also be slower to open. But, once opened, the
>cost of searching is largely dependent on the frequency characteristics of
>query terms. And, since IndexReaders and Searchers are thread safe, you
>don't need to open indexes very often.
>
>Doug
>
RE: CachingDirectory contribution [ In reply to ]
> From: Dmitry Serebrennikov [mailto:dmitrys@earthlink.net]
>
> Doug, what about scaling with the number of concurrent
> queries? In other
> words, assuming that the queries are similar, how is the search
> performance affected by increase in the number of concurrent users?

On a single-CPU, single-disk machine, throughput will likely be best with a
few simultaneous queries. For example, if each query were to spend half of
its time on the CPU and half the time waiting for i/o, and queries take two
seconds total, then the ideal would be to have queries arrive every second,
so that while one was using the CPU the other could use the disk. Things
don't usually work out quite so neatly, but you get the idea.

Multi-CPU and/or multi-disk systems can provide greater parallelism and
hence query throughput. However Lucene's FSDirectory serializes reads to a
given file (since it only has a single file descriptor per file) which
limits i/o parallelism. Someone with a large disk array would be better
served by a Directory implementation that uses Java 1.4's new i/o classes.
In particular, the FileChannel class supports reads that do not move the
file pointer, so that multiple reads on the same file can be in progress at
the same time.

Instead of building big boxes, another approach to scaling performance is to
use lots of little boxes. Here you split the index into sub-indexes,
broadcast the search to all of the sub-indexes, and merge the results. To
minimize the number of boxes, you want to give each box the largest index it
can search quickly, typically one that mostly fits in RAM. It would be
fairly easy to implement such a distributed search system with Lucene,
writing a multi-threaded version of MultiSearcher that uses RPC to send
queries to remote systems.

Deciding which of adding CPUs, disks or RAM is most cost-effective is
complex. In short, the best approach is to benchmark various
configurations.

The big internet search engines do all of the above, plus other tricks. For
example, they can keep two indexes: one containing one million "popular"
pages, and another containing 100 million less popular pages. If they don't
let folks see past the 1000th hit, and they find 1500 hits in the popular
index, then they can return just the hits from the popular index, and report
that there are 150,000 hits total. They don't know how many there really
are, but that's a good estimate and no one can prove them wrong. If they
can thus answer the vast majority of queries from a small, "popular" index,
then they don't have to make the big index as scalable, and can save lots of
money.

Doug