Mailing List Archive

Fw: PDF / Word document parsers
Kelvin,
Thanks for your quick reply. I think it is built for Linux/Unix platform..
I am working on Windows platform.

Anita
----- Original Message -----
From: "Kelvin Tan" <kelvin@relevanz.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Friday, April 19, 2002 10:25 AM
Subject: Re: PDF / Word document parsers


> Anita,
>
> I've experienced a moderate amount of success using Etymon for PDF
parsing.
> It does consume quite alot of memory for larger PDF documents, but
otherwise
> it's ok. What difficulties are you facing?
>
> For MS Word parsing, The Jakarta POI project is working something out, but
> in the meanwhile I've managed to search MS Word documents by reading the
> file and stripping out nonsense characters. It's a hack I think, but if I
> increase the indexWriter's maxFieldLength to about a million, I can search
> like 13-15MB word documents with ease.
>
> Kelvin
> ----- Original Message -----
> From: "Anita Srinivas" <srinivasa@tecin.mu>
> To: "Lucene Users List" <lucene-user@jakarta.apache.org>
> Sent: Friday, April 19, 2002 2:13 PM
> Subject: PDF / Word document parsers
>
>
> Hi...
>
> I have been looking for PDF and Word document parsers. I have tried the
> contributions page on the Lucene site as suggested by a Lucene User. The
> PJEtymon does not have a Windows version. The XPDF does not do the
parsing
> very well.
>
> Can someone suggest some better Word document or PDF parsers other than
the
> ones I mentioned here, .
>
> Thanks
>
> Anita Srinivas
>
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>