Mailing List Archive: indexing big files

indexing big files

Jan 8, 2002, 3:08 PM

Post #1 of 4 (1127 views)

Question from a Lucene newbie... I'm trying to index a file structure which
happens to include a relatively large file (310kb with 55,700 words) and
for some reason it appears to hanging the whole indexing process. Here's a
quick run-down..

1) Am using a webcrawler to retrieve files and copy to my local disk.
2) For files like .pdf's... I'm copying an .html equivalent of the file to
my disk (but leaving .pdf extension).
3) Then later in a serperate batch process I run pretty much the standard
out of the box "org.apache.lucene.IndexHTML" demo class (except I've added
.pdf as a possible indexing type).

That's about it. No big deal. The transformation from pdf to html is not
perfected yet either... so file size will definitely drop in the future...
as nonsense terms are being included in these files. But for now... what
should I be looking at or altering to find out what is causing the hang?
Thanks!

Jon Wasson

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: indexing big files [ In reply to ]

wdavies at overture

Jan 8, 2002, 3:30 PM

Post #2 of 4 (1108 views)

Permalink

My guess is Garbage Collection -- Try allocating twice as much Heap as before.
or more. Try running with -gc:verbose (or whatever).

Cheers,
Winton

>Question from a Lucene newbie... I'm trying to index a file structure which
>happens to include a relatively large file (310kb with 55,700 words) and
>for some reason it appears to hanging the whole indexing process. Here's a
>quick run-down..
>
>1) Am using a webcrawler to retrieve files and copy to my local disk.
>2) For files like .pdf's... I'm copying an .html equivalent of the file to
>my disk (but leaving .pdf extension).
>3) Then later in a serperate batch process I run pretty much the standard
>out of the box "org.apache.lucene.IndexHTML" demo class (except I've added
>.pdf as a possible indexing type).
>
>That's about it. No big deal. The transformation from pdf to html is not
>perfected yet either... so file size will definitely drop in the future...
>as nonsense terms are being included in these files. But for now... what
>should I be looking at or altering to find out what is causing the hang?
>Thanks!
>
>Jon Wasson
>
>
>--
>To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: indexing big files [ In reply to ]

Jonathan_Wasson at dom

Jan 14, 2002, 4:14 PM

Post #3 of 4 (1109 views)

Permalink

Have tried that... even going so as as to push it to a Solaris server with
plenty more RAM than my NT box... still hanging, so assume it is something
other than memory. So far have stepped into it and it appears to be
hanging on HTMLParser parser = new HTMLParser(f); in HTMLDocument.class...
think this may have something to do with JavaCC.zip & HTLParser.jj?
Similarly org.apache.lucene.HTMLParser.Test appears to be hanging.

Winton Davies
<wdavies@over To: "Lucene Users List"
ture.com> <lucene-user@jakarta.apache.org>
cc:
01/08/02 Subject: Re: indexing big files
05:30 PM
Please
respond to
"Lucene Users
List"

My guess is Garbage Collection -- Try allocating twice as much Heap as
before.
or more. Try running with -gc:verbose (or whatever).

Cheers,
Winton

>Question from a Lucene newbie... I'm trying to index a file structure
which
>happens to include a relatively large file (310kb with 55,700 words) and
>for some reason it appears to hanging the whole indexing process. Here's
a
>quick run-down..
>
>1) Am using a webcrawler to retrieve files and copy to my local disk.
>2) For files like .pdf's... I'm copying an .html equivalent of the file to
>my disk (but leaving .pdf extension).
>3) Then later in a serperate batch process I run pretty much the standard
>out of the box "org.apache.lucene.IndexHTML" demo class (except I've added
>.pdf as a possible indexing type).
>
>That's about it. No big deal. The transformation from pdf to html is not
>perfected yet either... so file size will definitely drop in the future...
>as nonsense terms are being included in these files. But for now... what
>should I be looking at or altering to find out what is causing the hang?
>Thanks!
>
>Jon Wasson
>
>
>--
>To unsubscribe, e-mail: <
mailto:lucene-user-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <
mailto:lucene-user-help@jakarta.apache.org>

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail: <
mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <
mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: indexing big files [ In reply to ]

Jonathan_Wasson at dom

Jan 22, 2002, 3:14 PM

Post #4 of 4 (1113 views)

Permalink

Found the problem. Not related to memory. May actually be a bug in
initial distribution of HTMLParser.jj. If you have a title with a huge
number of characters (in my case 652 char's... though most are
erroneous)... the Parser hangs the whole indexing process. Can anyone else
verify this? I was able to fix it by simply limiting the title when my
webcrawler is parsing my pdf's, but this should be probably be addressed in
the parser as well to be safe.

Jonathan_Wass
on@dom.com To: "Lucene Users List"
<lucene-user@jakarta.apache.org>
01/14/2002 cc:
06:14 PM Subject: Re: indexing big files
Please
respond to
"Lucene Users
List"

Have tried that... even going so as as to push it to a Solaris server with
plenty more RAM than my NT box... still hanging, so assume it is something
other than memory. So far have stepped into it and it appears to be
hanging on HTMLParser parser = new HTMLParser(f); in HTMLDocument.class...
think this may have something to do with JavaCC.zip & HTLParser.jj?
Similarly org.apache.lucene.HTMLParser.Test appears to be hanging.

Winton Davies
<wdavies@over To: "Lucene Users List"
ture.com> <lucene-user@jakarta.apache.org>
cc:
01/08/02 Subject: Re: indexing big
files
05:30 PM
Please
respond to
"Lucene Users
List"

My guess is Garbage Collection -- Try allocating twice as much Heap as
before.
or more. Try running with -gc:verbose (or whatever).

Cheers,
Winton

>Question from a Lucene newbie... I'm trying to index a file structure
which
>happens to include a relatively large file (310kb with 55,700 words) and
>for some reason it appears to hanging the whole indexing process. Here's
a
>quick run-down..
>
>1) Am using a webcrawler to retrieve files and copy to my local disk.
>2) For files like .pdf's... I'm copying an .html equivalent of the file to
>my disk (but leaving .pdf extension).
>3) Then later in a serperate batch process I run pretty much the standard
>out of the box "org.apache.lucene.IndexHTML" demo class (except I've added
>.pdf as a possible indexing type).
>
>That's about it. No big deal. The transformation from pdf to html is not
>perfected yet either... so file size will definitely drop in the future...
>as nonsense terms are being included in these files. But for now... what
>should I be looking at or altering to find out what is causing the hang?
>Thanks!
>
>Jon Wasson
>
>
>--
>To unsubscribe, e-mail: <
mailto:lucene-user-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <
mailto:lucene-user-help@jakarta.apache.org>

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail: <
mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <
mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <
mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <
mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>