Mailing List Archive: Parsing PDF documents

Parsing PDF documents

Feb 16, 2002, 7:59 PM

Post #1 of 4 (764 views)

If you want to parse PDF documents, the best approach would be to use
the Adobe IFilter for PDF, which is a COM component. You will need to
write a java client, which interacts with that COM component.
I believe it is easilly doable, but I have never done anything like
this.
It's a very interesting project, though.
Also, you will have to perform the pdf-text conversion on a windows
machine.

http://www.adobe.com/support/downloads/detail.jsp?ftpID=1276

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsr
v/ixrefint_9sfm.asp

Regards,
Ivaylo Zlatev

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
Sent: Saturday, February 16, 2002 7:15 AM
To: Lucene Developers List
Subject: RE: HTMLParser

Hm, I thought this place would have a PDF parser, but it does not.
It does seem to have a RTF parser:
http://cobase-www.cs.ucla.edu/pub/javacc/

Perhaps some of these things can be adopted by Lucene, people could
contribute Java classes for interacting with specific parsers, and all
that could then be included in Lucene to work together with those
DocumentHandlers mentioned a few days ago.

Otis

--- Daniel Calvo <dcalvo@task.com.br> wrote:
> Maybe...I'll have to give it a try first
>
> Anyway, I was playing with Lucene's HTMParser in order to understand
> a little better how JavaCC works. My real interest is in PDF
> and RTF parsers. I've tried Websearch PDF parser but it only worked
> well with the examples provided. I wasn't able to parse
> correctly even PDF files distributed by Adobe. I've also had a lot of
> trouble with files converted to PDF (probably via dvi2pdf or
> something like that). Recently I read on this list (or maybe it was
> on the users list) that someone else was having trouble with
> both Websearch and PJ library parsers.
>
> I've just downloaded Adobe's PDF Specification and later I'll try to
> see if there's any room for improvement in Websearch code. I
> know PDF has various features (compression, cryptography, etc.) that
> complicate the parsing and I'm not willing to spend much time
> doing this but I'll probably try something.
>
> --Daniel
>
> > -----Original Message-----
> > From: Paulo Gaspar [mailto:paulo.gaspar@krankikom.de]
> > Sent: sexta-feira, 15 de fevereiro de 2002 23:14
> > To: Lucene Developers List
> > Subject: RE: HTMLParser
> >
> >
> > Can the following Xerces based HTML parser be interesting for
> > your work?
> >
> > This is just the initial ANNOUNCE but there are further
> > developments.
> >
> >
> > Have fun,
> > Paulo Gaspar
> >
> > > -----Original Message-----
> > > From: Andy Clark [mailto:andyc@apache.org]
> > > Sent: Saturday, February 09, 2002 4:16 AM
> > > To: general@xml.apache.org
> > > Cc: xerces-j-dev@xml.apache.org
> > > Subject: [ANNOUNCE] Xerces HTML Parser
> > >
> > >
> > > For a long time users have asked if Xerces can parse HTML files.
> > > But since most HTML documents are not well-formed XML documents,
> > > it is generally not possible to use a conforming XML parser to
> > > read HTML documents.
> > >
> > > However, the Xerces Native Interface (XNI) that is the foundation
> > > of the Xerces2 implementation defines a framework that allows
> > > different kinds of parsers to be constructed by connecting a
> > > pipeline of parser components. Therefore, as long as a component
> > > can be written that generates the appropriate XNI "events", then
> > > it can be used to emit SAX events, build DOM trees, or anything
> > > else that you can think of.
> > >
> > > So, as a fun little exercise, I have written a basic HTML parser
> > > using XNI. It consists of an HTML scanner component that can scan
> > > HTML files and generate XNI events and a tag balancing component.
> > > The tag balancer cleans up the events produced by the scanner,
> > > balancing mismatched tags and adding tags where necessary. And
> > > it does all of this in a streaming manner to minimize the amount
> > > of memory required.
> > >
> > > Since I wrote the HTML parser as an example of using XNI and
> > > because the code is considered alpha quality (but it seems to
> > > work quite well, actually!), I am posting the code with a very
> > > limited license. Even though it contains the complete source
> > > code for the HTML parser, the license only allows the user to
> > > experiment but gives no right to actually use the code in a
> > > product.
> > >
> > > If the source isn't "free" or "open", why release it at all?
> > > I want to get an idea of what people think of the code first.
> > > Then, if there's enough interest, I would like to either donate
> > > the code to the Xerces-J project or make it available elsewhere
> > > under a true open source license.
> > >
> > > So, if you've been looking for a way to parse HTML documents
> > > please try out the HTML parser and let me know what you think.
> > > There should be enough information in the documentation to get
> > > you started. Check out the "NekoHTML" project listed on my
> > > Apache web site: http://www.apache.org/~andyc/
> > >
> > > Have fun!
> > >
> > > --
> > > Andy Clark * andyc@apache.org
> > >
> > >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> > > For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
> > >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> >
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>

__________________________________________________
Do You Yahoo!?
Yahoo! Sports - Coverage of the 2002 Olympic Games
http://sports.yahoo.com

--
To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Parsing PDF documents [ In reply to ]

macmillanr at rogers

Feb 16, 2002, 9:36 PM

Post #2 of 4 (732 views)

Permalink

I found that you can use the Etymon PJ classes
(http://www.etymon.com/pj/) and extract the text from PDF documents with
very little effort. The advantage with the Etymon classes is there is no
need for COM objects. It worked extremely well for the majority of
documents; at the very worst some documents would extract all the text with
some of it out of order.(That has to do more with the layout of the document
then anything else.)

On that note, I have started working on a more-effective (and efficient)
set of classes to extract text from PDF docs. The plan was to contribute the
classes to this community and build on the functionality over time. The
process seems to be pretty straightforward and I hope to complete the first
version in the near future.

In the intern, if anyone would like my Etymon "implementation" I'll be
happy to send off the code provided whoever requests it is aware it was
slapped together quickly for a concept-test and could/should be tightened up
a LOT. The set of classes I'm currently working on address a lot of the
limitations that are visible in the implementation. (It would probably
suffice to say it's an example of how to use the PJ classes to extract the
text from a PDF doc.)

Cheers

Robert MacMillan

On 2/16/02 9:59 PM, "Ivaylo Zlatev" <IZlatev@entigen.com> wrote:

>
> If you want to parse PDF documents, the best approach would be to use
> the Adobe IFilter for PDF, which is a COM component. You will need to
> write a java client, which interacts with that COM component.
> I believe it is easilly doable, but I have never done anything like
> this.
> It's a very interesting project, though.
> Also, you will have to perform the pdf-text conversion on a windows
> machine.
>
> http://www.adobe.com/support/downloads/detail.jsp?ftpID=1276
>
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsr
> v/ixrefint_9sfm.asp
>
>
> Regards,
> Ivaylo Zlatev
>
>
>
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> Sent: Saturday, February 16, 2002 7:15 AM
> To: Lucene Developers List
> Subject: RE: HTMLParser
>
>
> Hm, I thought this place would have a PDF parser, but it does not.
> It does seem to have a RTF parser:
> http://cobase-www.cs.ucla.edu/pub/javacc/
>
> Perhaps some of these things can be adopted by Lucene, people could
> contribute Java classes for interacting with specific parsers, and all
> that could then be included in Lucene to work together with those
> DocumentHandlers mentioned a few days ago.
>
> Otis

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Parsing PDF documents [ In reply to ]

carlson at bookandhammer

Feb 17, 2002, 8:56 PM

Post #3 of 4 (735 views)

Permalink

Robert,
If you supply your code I'll add it the contributions area.
It would be great to have some code that already already converts the PDF
directly to a Lucene Document.

--Peter

On 2/16/02 8:36 PM, "Robert MacMillan" <macmillanr@rogers.com> wrote:

>
> I found that you can use the Etymon PJ classes
> (http://www.etymon.com/pj/) and extract the text from PDF documents with
> very little effort. The advantage with the Etymon classes is there is no
> need for COM objects. It worked extremely well for the majority of
> documents; at the very worst some documents would extract all the text with
> some of it out of order.(That has to do more with the layout of the document
> then anything else.)
>
> On that note, I have started working on a more-effective (and efficient)
> set of classes to extract text from PDF docs. The plan was to contribute the
> classes to this community and build on the functionality over time. The
> process seems to be pretty straightforward and I hope to complete the first
> version in the near future.
>
> In the intern, if anyone would like my Etymon "implementation" I'll be
> happy to send off the code provided whoever requests it is aware it was
> slapped together quickly for a concept-test and could/should be tightened up
> a LOT. The set of classes I'm currently working on address a lot of the
> limitations that are visible in the implementation. (It would probably
> suffice to say it's an example of how to use the PJ classes to extract the
> text from a PDF doc.)
>
> Cheers
>
> Robert MacMillan
>
> On 2/16/02 9:59 PM, "Ivaylo Zlatev" <IZlatev@entigen.com> wrote:
>
>>
>> If you want to parse PDF documents, the best approach would be to use
>> the Adobe IFilter for PDF, which is a COM component. You will need to
>> write a java client, which interacts with that COM component.
>> I believe it is easilly doable, but I have never done anything like
>> this.
>> It's a very interesting project, though.
>> Also, you will have to perform the pdf-text conversion on a windows
>> machine.
>>
>> http://www.adobe.com/support/downloads/detail.jsp?ftpID=1276
>>
>> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsr
>> v/ixrefint_9sfm.asp
>>
>>
>> Regards,
>> Ivaylo Zlatev
>>
>>
>>
>> -----Original Message-----
>> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
>> Sent: Saturday, February 16, 2002 7:15 AM
>> To: Lucene Developers List
>> Subject: RE: HTMLParser
>>
>>
>> Hm, I thought this place would have a PDF parser, but it does not.
>> It does seem to have a RTF parser:
>> http://cobase-www.cs.ucla.edu/pub/javacc/
>>
>> Perhaps some of these things can be adopted by Lucene, people could
>> contribute Java classes for interacting with specific parsers, and all
>> that could then be included in Lucene to work together with those
>> DocumentHandlers mentioned a few days ago.
>>
>> Otis
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Parsing PDF documents [ In reply to ]

macmillanr at rogers

Feb 18, 2002, 1:55 PM

Post #4 of 4 (729 views)

Permalink

Peter,

I put up a link to the the PJ-example source code at
http://www.omniInvestments.ca/gnu_pdf/ . I wasn't sure to which you were
referring; the pdf classes that I am currently working on or the PJ example.

Time permitting, I'll have a first-tested round of my classes completed
by next weekend. That said, I wasn't planning on taking it to the stage of
converting a PDF directly into a Lucene Document, but it's an interesting
thought. The immediate problem that I see is that most people don't properly
title their documents, for example. It was for that one reason alone I
figured I might be best to provide but an interface for extracting data
that's wanted from the PDF document and let the developer decide what to do
with it.

Any ideas?

Cheers

Robert MacMillan

On 2/17/02 10:56 PM, "Peter Carlson" <carlson@bookandhammer.com> wrote:

> Robert,
> If you supply your code I'll add it the contributions area.
> It would be great to have some code that already already converts the PDF
> directly to a Lucene Document.
>
> --Peter

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>