Mailing List Archive

PDF / Word document parsers
Hi...

I have been looking for PDF and Word document parsers. I have tried the contributions page on the Lucene site as suggested by a Lucene User. The PJEtymon does not have a Windows version. The XPDF does not do the parsing very well.

Can someone suggest some better Word document or PDF parsers other than the ones I mentioned here, .

Thanks

Anita Srinivas
Re: PDF / Word document parsers [ In reply to ]
Anita,

I've experienced a moderate amount of success using Etymon for PDF parsing.
It does consume quite alot of memory for larger PDF documents, but otherwise
it's ok. What difficulties are you facing?

For MS Word parsing, The Jakarta POI project is working something out, but
in the meanwhile I've managed to search MS Word documents by reading the
file and stripping out nonsense characters. It's a hack I think, but if I
increase the indexWriter's maxFieldLength to about a million, I can search
like 13-15MB word documents with ease.

Kelvin
----- Original Message -----
From: "Anita Srinivas" <srinivasa@tecin.mu>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Friday, April 19, 2002 2:13 PM
Subject: PDF / Word document parsers


Hi...

I have been looking for PDF and Word document parsers. I have tried the
contributions page on the Lucene site as suggested by a Lucene User. The
PJEtymon does not have a Windows version. The XPDF does not do the parsing
very well.

Can someone suggest some better Word document or PDF parsers other than the
ones I mentioned here, .

Thanks

Anita Srinivas



--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: PDF / Word document parsers [ In reply to ]
> I have been looking for PDF and Word document parsers. I have tried the
> contributions page on the Lucene site as suggested by a Lucene User. The
> PJEtymon does not have a Windows version. The XPDF does not do the parsing
> very well.

I've run Etymon with some degree of success in window boxes. To parse word
document you can have a look for OpenOffice. You can start OpenOffice to
receive a socket connection. From your Java app, you open a connection to
OpenOffice (using OpenOffice SDK), send the word document and it will convert
it to text.

You can also use OpenOffice various other parsing. The url: www.openoffice.org

Note: I've never tried OpenOffice under windows, so I'm not sure how it will
work, but we are using it here to index our word documents.

Regards,

--
Victor Hadianto
---------------
More are taken in by hope than by cunning. -- Vauvenargues

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: PDF / Word document parsers [ In reply to ]
>To parse word
> document you can have a look for OpenOffice. You can start OpenOffice to
> receive a socket connection. From your Java app, you open a connection to
> OpenOffice (using OpenOffice SDK), send the word document and it will
convert
> it to text.
>

That's actually quite a novel idea. I haven't tried it, is it complicated to
communicate with OpenOffice?

Kelvin


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: PDF / Word document parsers [ In reply to ]
> >To parse word
> > document you can have a look for OpenOffice. You can start OpenOffice to
> > receive a socket connection. From your Java app, you open a connection to
> > OpenOffice (using OpenOffice SDK), send the word document and it will
> convert it to text.
>
> That's actually quite a novel idea. I haven't tried it, is it complicated
> to communicate with OpenOffice?
>

It's a bit finnicky but fortunately there are examples how to do this.

Java-OpenOffice page: http://udk.openoffice.org/java/man/index.html
OpenOffice API: http://api.openoffice.org/
Samples:
http://api.openoffice.org/unbranded-source/browse/~checkout~/api/odk/examples/examples.html


--
Victor Hadianto
---------------
God, I ask for patience -- and I want it right now!

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: PDF / Word document parsers [ In reply to ]
Anita,

For Word, IT is possible if you're willing to hack to use Ryan's
prototype code that is being refactored into HDF. It converts DOC->FOP
and well we have XML parsers.

Obviously (Excel) HSSF @ POI is pretty robust at this stage.

Document summary (HPSF) information is read only at this stage, but that
should be fine for your needs.

So essentially you can grab what you need via POI. The HDF is going to
be the most work

checkout jakarta.apache.org/poi for more details.

On Fri, 2002-04-19 at 02:25, Kelvin Tan wrote:
> Anita,
>
> I've experienced a moderate amount of success using Etymon for PDF parsing.
> It does consume quite alot of memory for larger PDF documents, but otherwise
> it's ok. What difficulties are you facing?
>
> For MS Word parsing, The Jakarta POI project is working something out, but
> in the meanwhile I've managed to search MS Word documents by reading the
> file and stripping out nonsense characters. It's a hack I think, but if I
> increase the indexWriter's maxFieldLength to about a million, I can search
> like 13-15MB word documents with ease.
>
> Kelvin
> ----- Original Message -----
> From: "Anita Srinivas" <srinivasa@tecin.mu>
> To: "Lucene Users List" <lucene-user@jakarta.apache.org>
> Sent: Friday, April 19, 2002 2:13 PM
> Subject: PDF / Word document parsers
>
>
> Hi...
>
> I have been looking for PDF and Word document parsers. I have tried the
> contributions page on the Lucene site as suggested by a Lucene User. The
> PJEtymon does not have a Windows version. The XPDF does not do the parsing
> very well.
>
> Can someone suggest some better Word document or PDF parsers other than the
> ones I mentioned here, .
>
> Thanks
>
> Anita Srinivas
>
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
>
--
http://www.superlinksoftware.com
http://jakarta.apache.org/poi - port of Excel/Word/OLE 2 Compound
Document
format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
- fix java generics!
The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>