Mailing List Archive: using lucene with a very large index

using lucene with a very large index

Jan 16, 2002, 4:09 AM

Post #1 of 6 (1047 views)

Hi,
I'm building a very large index, that contains several categories.
I have several questions I hope you can answare.

1) Is there a way to use lucene with several indexes without merging them?
2) Does the Document id changes after merging indexes adding or deleting documents?
3) Has anyone implemented a GUI to the lucene index, such that enables to deletions by id or sql-like queries?
4) assuming I have a term query that has a large number of hits say 10 millions, is there a way to get the say the top 10 results without going through all the hits?

thx tal.

using lucene with a very large index [ In reply to ]

thetalthe at hotmail

Jan 16, 2002, 3:04 AM

Post #2 of 6 (1032 views)

Permalink

Hi, I'm building a very large index, that contains several categories.
I have several questions I hope you can answare.
1) Is there a way to use lucene with several indexes without merging them?
2) Does the Document id changes after merging indexes adding or deleting documents?
3) Has anyone implemented a GUI to the lucene index, such that enables to deletions by id or sql-like queries?
4) assuming I have a term query that has a large number of hits say 10 millions, is there a way to get the say the top 10 results without going through all the hits?

thx tal.

Re: using lucene with a very large index [ In reply to ]

otis_gospodnetic at yahoo

Feb 13, 2002, 6:29 PM

Post #3 of 6 (1034 views)

Permalink

--- tal blum <thetalthe@hotmail.com> wrote:
> Hi, I'm building a very large index, that contains several
> categories.
> I have several questions I hope you can answare.
> 1) Is there a way to use lucene with several indexes without merging
> them?

Look at MultiSearcher class.

> 2) Does the Document id changes after merging indexes adding or
> deleting documents?

Not sure.

> 3) Has anyone implemented a GUI to the lucene index, such that
> enables to deletions by id or sql-like queries?

I haven't seen anything like it.

> 4) assuming I have a term query that has a large number of hits say
> 10 millions, is there a way to get the say the top 10 results
> without going through all the hits?

See the Javadocs for Searcher and IndexSearcher, I think you'll find
the answer there.

Otis

__________________________________________________
Do You Yahoo!?
Send FREE Valentine eCards with Yahoo! Greetings!
http://greetings.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: using lucene with a very large index [ In reply to ]

Mhayes at verisign

Feb 13, 2002, 9:26 PM

Post #4 of 6 (1030 views)

Permalink

> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> --- tal blum <thetalthe@hotmail.com> wrote:
[...]
> > 4) assuming I have a term query that has a large number of hits say
> > 10 millions, is there a way to get the say the top 10 results
> > without going through all the hits?
>
> See the Javadocs for Searcher and IndexSearcher, I think you'll find
> the answer there.

I have the same question but I can't see the answer in the javadocs. Do you
mean this statement?:

"The high-level search API (search(Query)) is usually more efficient, as it
skips non-high-scoring hits."

It is not clear to me what non-high-scoring hits means -- do you know?

thanks,
mark

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: using lucene with a very large index [ In reply to ]

thetalthe at hotmail

Feb 14, 2002, 5:37 AM

Post #5 of 6 (1033 views)

Permalink

> 4) assuming I have a term query that has a large number of hits say
> 10 millions, is there a way to get the say the top 10 results
> without going through all the hits?

See the Javadocs for Searcher and IndexSearcher, I think you'll find
the answer there.

thx Otis,
but I still don't understand , because the documents are stored per Term
sorted by docId,
in order to get the top ranking document for a TermQuery you have to go over
all of the TermDocs
for that Term, this causes problems for search engine that contain many
Documents.
one solution to that is to change the implementation and store the docs
sorted by their term score.
what do you think?

tal.

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: using lucene with a very large index [ In reply to ]

DCutting at grandcentral

Feb 14, 2002, 10:12 AM

Post #6 of 6 (1039 views)

Permalink

> From: tal blum [mailto:thetalthe@hotmail.com]
>
> 2) Does the Document id changes after merging indexes adding
> or deleting documents?

Yes.

> 4) assuming I have a term query that has a large number of
> hits say 10 millions, is there a way to get the say the top
> 10 results without going through all the hits?

Your best bet is to use the normal search API.

> From: tal blum [mailto:thetalthe@hotmail.com]
>
> one solution to that is to change the implementation and
> store the docs
> sorted by their term score.

That would make incremental index updates much slower, since every time a
document is added, the list of documents containing each term in that
document would need to be re-sorted. Currently we only need to append new
entries, which is much simpler. You could optimize this in various ways
(e.g., instead take the hit at search time) but it would still make things
slower for rapidly changing indexes.

Also, while this would make single term queries faster, multi-term queries
are more complex to accellerate. The highest scoring match for a two term
query may be in a document where one term has a very high weight and the
other has a very low weight. There have been papers written (I don't have
the references handy) exploring this issue, and, in general, there isn't an
algorithm that is guaranteed to return the highest scoring documents for
multi-term queries that does not in most cases have to process nearly all of
the documents containing those terms. That said, it is possible to use such
an index to vastly accellerate searches that *usually* return the highest
scoring documents.

Such a heuristic search technique is among the things required to scale
Lucene to extremely large collections (e.g., hundreds of millions of
documents). There are also lower-tech optimizations. For example, one can
simply keep a small index containing the highest-quality documents that is
always searched first. If enough hits are found there, you're done. A real
internet search engine combines lots of tricks in order to scale: segmenting
indexes by quality; heuristic search methods; and distributed searching.
Deploying something like Google is not a small task.

I would someday like to add a heuristic search component to Lucene, that
uses a special index format (possibly with term document lists sorted by
normalized frequency, as you suggest). I have some experience doing this at
Excite, and it pays off big time. But it would take me several weeks
full-time to implement this, and I don't currently have that time. Perhaps
(with the support of an interested sponsor) I could make time this summer to
implement this.

In the meantime, if you encounter performance problems with a very large
index, you might try segmenting your index by document quality and/or
distributed search.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>