Question from a Lucene newbie... I'm trying to index a file structure which
happens to include a relatively large file (310kb with 55,700 words) and
for some reason it appears to hanging the whole indexing process. Here's a
quick run-down..
1) Am using a webcrawler to retrieve files and copy to my local disk.
2) For files like .pdf's... I'm copying an .html equivalent of the file to
my disk (but leaving .pdf extension).
3) Then later in a serperate batch process I run pretty much the standard
out of the box "org.apache.lucene.IndexHTML" demo class (except I've added
.pdf as a possible indexing type).
That's about it. No big deal. The transformation from pdf to html is not
perfected yet either... so file size will definitely drop in the future...
as nonsense terms are being included in these files. But for now... what
should I be looking at or altering to find out what is causing the hang?
Thanks!
Jon Wasson
--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
happens to include a relatively large file (310kb with 55,700 words) and
for some reason it appears to hanging the whole indexing process. Here's a
quick run-down..
1) Am using a webcrawler to retrieve files and copy to my local disk.
2) For files like .pdf's... I'm copying an .html equivalent of the file to
my disk (but leaving .pdf extension).
3) Then later in a serperate batch process I run pretty much the standard
out of the box "org.apache.lucene.IndexHTML" demo class (except I've added
.pdf as a possible indexing type).
That's about it. No big deal. The transformation from pdf to html is not
perfected yet either... so file size will definitely drop in the future...
as nonsense terms are being included in these files. But for now... what
should I be looking at or altering to find out what is causing the hang?
Thanks!
Jon Wasson
--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>