Mailing List Archive

strange search problems(cannot query for more than the first 10000 words!?!)
I have created a testclass for working with Analyzers and ran into a strange
problem; I cannot search for text in fields with more than 10000 words!?!?

I have tested for various bugs in my test class, but I cannot find anything
there (please have a look, files are attached).


the class "AnalayzerTest" can be used like this:

"java -cp lucene-1.2-rc3-dev.jar org.apache.lucene.analysis.AnalyzerTest
voc.txt voc_out.txt"

where the "voc.txt" and "voc_out.txt" also are included in the zip file.


The approach is simple: voc.txt contains 20628 Norwegian words, to test the
Analyzer I try to do this:

- create a string containing all the 20628 words separated with " ".
- create a lucene document and index this string as a text field.
- add this one document to an index
- loop trough the words again and query the index for each of the same words
in the list.
- if everything works every word should yield a hit in the single document
that exist in the index.


To be sure nothing is filtered I have used the WhitespaceAnalyzer analyzer
(or NullAnalyzer...).


But here comes the problems:
----------------------------

If I try to run all the 20628 words, the last 10628 words can not be found
by the IndexSearcher. If I flip the words around(reverse alpha-order). I
cannot find the 10628 first words!!.

If I limit the wordlist to 10000, I get a perfect match for either the first
or last 10000 words. If I set the limit to 10005 I will get 5 words not
found at the beginning or end of the list according to order.


Does anyone know what's going on here?? I would be very happy if someone
could point to a place in my code where I have done something really stupid,
because I have tried to track this for a hole day.


mvh karl øie/gan media
RE: strange search problems(cannot query for more than the first 10000 words!?!) [ In reply to ]
> From: Karl Øie [mailto:karl@gan.no]
>
> I have created a testclass for working with Analyzers and ran
> into a strange
> problem; I cannot search for text in fields with more than
> 10000 words!?!?

Lucene by default stops indexing after the 10,000th token.
See
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWrite
r.html#maxFieldLength

You can disable this with:
indexWriter.maxFieldLength = Integer.MAX_VALUE;

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: strange search problems(SOLVED!) [ In reply to ]
....er, after i sent the previous mail i grep'ed trough the source for
"10000" and found this:


jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.java


/** The maximum number of terms that will be indexed for a single field in
a
document. This limits the amount of memory required for indexing, so
that
collections with very large files will not crash the indexing process by
running out of memory.

<p>By default, no more than 10,000 terms will be indexed for a field. */
public int maxFieldLength = 10000;


gentlemen, you may throw your tomatoes now!... sorry to bother you!


mvh karl øie



-----Original Message-----
From: Karl Øie [mailto:karl@gan.no]
Sent: 28. januar 2002 18:16
To: lucene-user@jakarta.apache.org
Subject: strange search problems(cannot query for more than the first
10000 words!?!)


I have created a testclass for working with Analyzers and ran into a strange
problem; I cannot search for text in fields with more than 10000 words!?!?

I have tested for various bugs in my test class, but I cannot find anything
there (please have a look, files are attached).


the class "AnalayzerTest" can be used like this:

"java -cp lucene-1.2-rc3-dev.jar org.apache.lucene.analysis.AnalyzerTest
voc.txt voc_out.txt"

where the "voc.txt" and "voc_out.txt" also are included in the zip file.


The approach is simple: voc.txt contains 20628 Norwegian words, to test the
Analyzer I try to do this:

- create a string containing all the 20628 words separated with " ".
- create a lucene document and index this string as a text field.
- add this one document to an index
- loop trough the words again and query the index for each of the same words
in the list.
- if everything works every word should yield a hit in the single document
that exist in the index.


To be sure nothing is filtered I have used the WhitespaceAnalyzer analyzer
(or NullAnalyzer...).


But here comes the problems:
----------------------------

If I try to run all the 20628 words, the last 10628 words can not be found
by the IndexSearcher. If I flip the words around(reverse alpha-order). I
cannot find the 10628 first words!!.

If I limit the wordlist to 10000, I get a perfect match for either the first
or last 10000 words. If I set the limit to 10005 I will get 5 words not
found at the beginning or end of the list according to order.


Does anyone know what's going on here?? I would be very happy if someone
could point to a place in my code where I have done something really stupid,
because I have tried to track this for a hole day.


mvh karl øie/gan media


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: strange search problems(cannot query for more than the first 10000 words!?!) [ In reply to ]
I believe there is a hard-coded limit in the Lucene code that ensures
that only the first 10000 terms are indexed. I can't remember which
class that is in, but you can do a find.....|xargs grep 10000 under
Unix to find it.

Otis

--- Karl_Øie <karl@gan.no> wrote:
> I have created a testclass for working with Analyzers and ran into a
> strange
> problem; I cannot search for text in fields with more than 10000
> words!?!?
>
> I have tested for various bugs in my test class, but I cannot find
> anything
> there (please have a look, files are attached).
>
>
> the class "AnalayzerTest" can be used like this:
>
> "java -cp lucene-1.2-rc3-dev.jar
> org.apache.lucene.analysis.AnalyzerTest
> voc.txt voc_out.txt"
>
> where the "voc.txt" and "voc_out.txt" also are included in the zip
> file.
>
>
> The approach is simple: voc.txt contains 20628 Norwegian words, to
> test the
> Analyzer I try to do this:
>
> - create a string containing all the 20628 words separated with " ".
> - create a lucene document and index this string as a text field.
> - add this one document to an index
> - loop trough the words again and query the index for each of the
> same words
> in the list.
> - if everything works every word should yield a hit in the single
> document
> that exist in the index.
>
>
> To be sure nothing is filtered I have used the WhitespaceAnalyzer
> analyzer
> (or NullAnalyzer...).
>
>
> But here comes the problems:
> ----------------------------
>
> If I try to run all the 20628 words, the last 10628 words can not be
> found
> by the IndexSearcher. If I flip the words around(reverse
> alpha-order). I
> cannot find the 10628 first words!!.
>
> If I limit the wordlist to 10000, I get a perfect match for either
> the first
> or last 10000 words. If I set the limit to 10005 I will get 5 words
> not
> found at the beginning or end of the list according to order.
>
>
> Does anyone know what's going on here?? I would be very happy if
> someone
> could point to a place in my code where I have done something really
> stupid,
> because I have tried to track this for a hole day.
>
>
> mvh karl øie/gan media
>

> ATTACHMENT part 2 application/x-zip-compressed name=AnalyzerTest.zip
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>


__________________________________________________
Do You Yahoo!?
Great stuff seeking new owners in Yahoo! Auctions!
http://auctions.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>