I have created a testclass for working with Analyzers and ran into a strange
problem; I cannot search for text in fields with more than 10000 words!?!?
I have tested for various bugs in my test class, but I cannot find anything
there (please have a look, files are attached).
the class "AnalayzerTest" can be used like this:
"java -cp lucene-1.2-rc3-dev.jar org.apache.lucene.analysis.AnalyzerTest
voc.txt voc_out.txt"
where the "voc.txt" and "voc_out.txt" also are included in the zip file.
The approach is simple: voc.txt contains 20628 Norwegian words, to test the
Analyzer I try to do this:
- create a string containing all the 20628 words separated with " ".
- create a lucene document and index this string as a text field.
- add this one document to an index
- loop trough the words again and query the index for each of the same words
in the list.
- if everything works every word should yield a hit in the single document
that exist in the index.
To be sure nothing is filtered I have used the WhitespaceAnalyzer analyzer
(or NullAnalyzer...).
But here comes the problems:
----------------------------
If I try to run all the 20628 words, the last 10628 words can not be found
by the IndexSearcher. If I flip the words around(reverse alpha-order). I
cannot find the 10628 first words!!.
If I limit the wordlist to 10000, I get a perfect match for either the first
or last 10000 words. If I set the limit to 10005 I will get 5 words not
found at the beginning or end of the list according to order.
Does anyone know what's going on here?? I would be very happy if someone
could point to a place in my code where I have done something really stupid,
because I have tried to track this for a hole day.
mvh karl øie/gan media
problem; I cannot search for text in fields with more than 10000 words!?!?
I have tested for various bugs in my test class, but I cannot find anything
there (please have a look, files are attached).
the class "AnalayzerTest" can be used like this:
"java -cp lucene-1.2-rc3-dev.jar org.apache.lucene.analysis.AnalyzerTest
voc.txt voc_out.txt"
where the "voc.txt" and "voc_out.txt" also are included in the zip file.
The approach is simple: voc.txt contains 20628 Norwegian words, to test the
Analyzer I try to do this:
- create a string containing all the 20628 words separated with " ".
- create a lucene document and index this string as a text field.
- add this one document to an index
- loop trough the words again and query the index for each of the same words
in the list.
- if everything works every word should yield a hit in the single document
that exist in the index.
To be sure nothing is filtered I have used the WhitespaceAnalyzer analyzer
(or NullAnalyzer...).
But here comes the problems:
----------------------------
If I try to run all the 20628 words, the last 10628 words can not be found
by the IndexSearcher. If I flip the words around(reverse alpha-order). I
cannot find the 10628 first words!!.
If I limit the wordlist to 10000, I get a perfect match for either the first
or last 10000 words. If I set the limit to 10005 I will get 5 words not
found at the beginning or end of the list according to order.
Does anyone know what's going on here?? I would be very happy if someone
could point to a place in my code where I have done something really stupid,
because I have tried to track this for a hole day.
mvh karl øie/gan media