Mailing List Archive: Re: indexing PDF files

Re: indexing PDF files

petite_abeille at mac

May 1, 2002, 12:15 AM

Post #1 of 10 (1986 views)

On Tuesday, April 30, 2002, at 10:46 PM, Otis Gospodnetic wrote:

> Hm, this should be a FAQ.

Maybe it should... ;-)

> Check Lucene contributions page, there are some starting points there,

Well, this seems to be a very popular request... In fact I need
something like that also. Unfortunately, there seems to be no
authoritative answer as far as converting pdf files to text in a pure
Java environment... Maybe I'm missing something here as usual?

Also, on a related note, what would be a good approach to convert any
random document into pdf? I was thinking to have a two steps process for
document indexing in Lucene:

- First, convert everything to pdf (with Acrobat or something)
- Second, convert pdf to text and index it.

Any practical suggestions about how to do that in a pure Java
environment very welcome.

Thanks :-)

PA.

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: indexing PDF files [ In reply to ]

carlson at bookandhammer

May 1, 2002, 6:50 AM

Post #2 of 10 (1959 views)

I don't know what they have to offer, but I think adobe has something.

Here is something I just found on the topic from Abobe's site.

How can I license Acrobat Viewer to distribute with my own products or to
use in my custom Java development? How much will it cost to license?
Adobe Acrobat Viewer can be licensed for free. Refer to the End User License
Agreement for more information.

This is just viewer, but you can search for words in the reader product (I
don't know what viewer is).

--Peter

On 5/1/02 12:15 AM, "petite_abeille" <petite_abeille@mac.com> wrote:

> On Tuesday, April 30, 2002, at 10:46 PM, Otis Gospodnetic wrote:
>
>> Hm, this should be a FAQ.
>
> Maybe it should... ;-)
>
>> Check Lucene contributions page, there are some starting points there,
>
> Well, this seems to be a very popular request... In fact I need
> something like that also. Unfortunately, there seems to be no
> authoritative answer as far as converting pdf files to text in a pure
> Java environment... Maybe I'm missing something here as usual?
>
> Also, on a related note, what would be a good approach to convert any
> random document into pdf? I was thinking to have a two steps process for
> document indexing in Lucene:
>
> - First, convert everything to pdf (with Acrobat or something)
> - Second, convert pdf to text and index it.
>
> Any practical suggestions about how to do that in a pure Java
> environment very welcome.
>
> Thanks :-)
>
> PA.
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: indexing PDF files [ In reply to ]

otis_gospodnetic at yahoo

May 1, 2002, 8:41 AM

Post #3 of 10 (1955 views)

> > Hm, this should be a FAQ.
>
> Maybe it should... ;-)

It is now.

> > Check Lucene contributions page, there are some starting points
> there,
>
> Well, this seems to be a very popular request... In fact I need
> something like that also. Unfortunately, there seems to be no
> authoritative answer as far as converting pdf files to text in a pure
>
> Java environment... Maybe I'm missing something here as usual?
>
> Also, on a related note, what would be a good approach to convert any
>
> random document into pdf? I was thinking to have a two steps process
> for
> document indexing in Lucene:
>
> - First, convert everything to pdf (with Acrobat or something)
> - Second, convert pdf to text and index it.
>
> Any practical suggestions about how to do that in a pure Java
> environment very welcome.

Wouldn't you want to convert to XML instead and use XSLT to transform
the XML representation to any desired format by just applying a style
sheet?
Sounds like less work with bigger document type coverage.

Otis

__________________________________________________
Do You Yahoo!?
Yahoo! Health - your guide to health and wellness
http://health.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: indexing PDF files [ In reply to ]

petite_abeille at mac

May 3, 2002, 2:35 AM

Post #4 of 10 (1961 views)

On Wednesday, May 1, 2002, at 05:41 PM, Otis Gospodnetic wrote:

> Wouldn't you want to convert to XML instead and use XSLT to transform
> the XML representation to any desired format by just applying a style
> sheet?
> Sounds like less work with bigger document type coverage.

Sounds good... But what does it mean? I'm not that familiar with any of
the XML, XSLT hype so I don't really understand what you are getting
at... I just want to convert any type of document to text for indexing
purpose... I'm not planning to do anything else with it... However,
converting everything to PDF as a first step allow you to provide a
"preview" of any documents even if you happen not to understand the
original format (eg MS Office)...

PA

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: indexing PDF files [ In reply to ]

Praveen.Moturu at cnalife

May 3, 2002, 6:16 AM

Post #5 of 10 (1953 views)

Good Morning to you all. Can I assume none of the poeple on the lucene user
group had implemented indexing a pdf document using lucene. If some one
has.. Please help me by providing the solution.

Thanks

> Praveen Moturu
>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: indexing PDF files [ In reply to ]

eliot at isogen

May 3, 2002, 6:59 AM

Post #6 of 10 (1961 views)

"Moturu,Praveen" wrote:
>
> Good Morning to you all. Can I assume none of the poeple on the lucene user
> group had implemented indexing a pdf document using lucene. If some one
> has.. Please help me by providing the solution.

You can try using Eytemon's PJ library (www.eytemon.com). But be aware
that the code as provided does not support some features of PDF and has
some bugs that prevent it from reading some PDFs.

Note also that there are some inherent problems with full-text indexing
of PDFs, namely that the word order in the PDF does not necessarily
reflect its reading order (for example, in two-column layouts), so if
your tokenizer is doing phrase analysis it may produce incorrect
results. You can see this by doing a multi-word search in Acrobat Reader
on a two-column document. It can also be difficult to accurately
determine word boundaries because of the way that PDF can represent text
strings as sequences of characters and placement instructions. The
Adobe-provided C libraries have largely solved this problem but the PJ
library does not--you will have to write your own algorithms to reduce
text sequences with explicit kerning instructions into meaningful
tokens. Not impossible but takes a little doing.

If you have money to spend you could license the Adobe PDF libraries and
create a Java binding for them. It does not appear that Adobe has any
plans to provide a Java library for accessing PDFs, free or otherwise.

However, implementing a Java PDF reader would not be too hard--I started
trying to implement one just to see how hard it would be and got as a
far as being able to get page objects by page number after an intense
weekend's work [.unfortunately my employment contract prevents me from
creating open-source software without explicit approval and I didn't
want to create a PDF library that wasn't open source, so I haven't done
any more work on it yet]. The PDF spec (www.pdfzone.com) is pretty
clear, although the PDF format is pretty convoluted (lots of byte
offsets and such). But once you get the basic infrastructure in place
for parsing out specific objects, the rest of it is just tedious parser
implementation--there are scads of different field types once you get
down to text streams.

Adding the business logic to figure out where things are on the page
would be more involved--you'd have to implement Adobe's layout logic.
However, you need this functionality in order to correlate PDF
annotations (links, bookmarks, notes) to the page objects they relate
to--it's all done with bounding boxes.

Cheers,

Eliot
--
W. Eliot Kimber, eliot@isogen.com
Consultant, ISOGEN International

1016 La Posada Dr., Suite 240
Austin, TX 78752 Phone: 512.656.4139

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: indexing PDF files [ In reply to ]

petite_abeille at mac

May 3, 2002, 7:57 AM

Post #7 of 10 (1959 views)

On Friday, May 3, 2002, at 03:16 PM, Moturu,Praveen wrote:

> Can I assume none of the poeple on the lucene user group had
> implemented indexing a pdf document using lucene.

Who knows...?!? In any case, it's not public knowledge...

> If some one has.. Please help me by providing the solution.

I use to believe in Santa Claus also... ;-)

All that said, there seems to be a real demand to do something about pdf
to text conversion (in java preferably). I'm willing to invest some time
and brain cell to nail it down, but I'm note sure where to start...

I'm aware of the PJ library, but it's really a pig as far as resources
goes. Anything else?

Any (concrete) pointer appreciated.

Thanks.

PA.

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: indexing PDF files [ In reply to ]

kelvin at relevanz

May 4, 2002, 1:28 AM

Post #8 of 10 (1950 views)

You might want to take a look at WebSearch http://www.i2a.com/websearch/. It
has an _ok_ system going with respect to PDFs. PDFGo supports viewing of PDF
but a guy I contacted there says there's no current support for text
extraction but that he's "planning to do it".

Definitely agreed on the PJ resources bit. Doesn't really scale well in
terms of PDF file size.

If you haven't already seen the post, I once did a cursory examination of
the options for extracting text from PDF files via Java and the limitations
of the approaches.
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00280.html

The Etymon lib is GPL'ed, so I guess that's a nice place to start. As far as
the libs I've seen so far, most of them are really concerned with the
display and manipulation of PDF pages. Since we're looking for something
less complex (i.e text extraction), maybe it's not so bad. I've spent abit
of time in this area before so feel free to email me offline about this. Not
sure how much help I can be though.

----- Original Message -----
From: "petite_abeille" <petite_abeille@mac.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Friday, May 03, 2002 10:57 PM
Subject: Re: indexing PDF files

> On Friday, May 3, 2002, at 03:16 PM, Moturu,Praveen wrote:
>
> > Can I assume none of the poeple on the lucene user group had
> > implemented indexing a pdf document using lucene.
>
> Who knows...?!? In any case, it's not public knowledge...
>
> > If some one has.. Please help me by providing the solution.
>
> I use to believe in Santa Claus also... ;-)
>
> All that said, there seems to be a real demand to do something about pdf
> to text conversion (in java preferably). I'm willing to invest some time
> and brain cell to nail it down, but I'm note sure where to start...
>
> I'm aware of the PJ library, but it's really a pig as far as resources
> goes. Anything else?
>
> Any (concrete) pointer appreciated.
>
> Thanks.
>
> PA.
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: indexing PDF files [ In reply to ]

May 5, 2002, 5:59 PM

Post #9 of 10 (1964 views)

I think most of the PDF creation knowledge using Java resides in the iText
and FOP projects.

both open source.

I would seem that java-pdf-writing code would be a good place to start on
java-pdf-reading code.

just a thought.

----- Original Message -----
From: Kelvin Tan <kelvin@relevanz.com>
To: Lucene Users List <lucene-user@jakarta.apache.org>
Sent: Saturday, May 04, 2002 1:28 AM
Subject: Re: indexing PDF files

> You might want to take a look at WebSearch http://www.i2a.com/websearch/.
It
> has an _ok_ system going with respect to PDFs. PDFGo supports viewing of
PDF
> but a guy I contacted there says there's no current support for text
> extraction but that he's "planning to do it".
>
> Definitely agreed on the PJ resources bit. Doesn't really scale well in
> terms of PDF file size.
>
> If you haven't already seen the post, I once did a cursory examination of
> the options for extracting text from PDF files via Java and the
limitations
> of the approaches.
> http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00280.html
>
> The Etymon lib is GPL'ed, so I guess that's a nice place to start. As far
as
> the libs I've seen so far, most of them are really concerned with the
> display and manipulation of PDF pages. Since we're looking for something
> less complex (i.e text extraction), maybe it's not so bad. I've spent abit
> of time in this area before so feel free to email me offline about this.
Not
> sure how much help I can be though.
>
> ----- Original Message -----
> From: "petite_abeille" <petite_abeille@mac.com>
> To: "Lucene Users List" <lucene-user@jakarta.apache.org>
> Sent: Friday, May 03, 2002 10:57 PM
> Subject: Re: indexing PDF files
>
>
> > On Friday, May 3, 2002, at 03:16 PM, Moturu,Praveen wrote:
> >
> > > Can I assume none of the poeple on the lucene user group had
> > > implemented indexing a pdf document using lucene.
> >
> > Who knows...?!? In any case, it's not public knowledge...
> >
> > > If some one has.. Please help me by providing the solution.
> >
> > I use to believe in Santa Claus also... ;-)
> >
> > All that said, there seems to be a real demand to do something about pdf
> > to text conversion (in java preferably). I'm willing to invest some time
> > and brain cell to nail it down, but I'm note sure where to start...
> >
> > I'm aware of the PJ library, but it's really a pig as far as resources
> > goes. Anything else?
> >
> > Any (concrete) pointer appreciated.
> >
> > Thanks.
> >
> > PA.
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: indexing PDF files [ In reply to ]

puffmail at darksleep

May 7, 2002, 6:52 PM

Post #10 of 10 (1960 views)

> On Wednesday, May 1, 2002, at 05:41 PM, Otis Gospodnetic wrote:
> >Wouldn't you want to convert to XML instead and use XSLT to transform
> >the XML representation to any desired format by just applying a style
> >sheet?
> >Sounds like less work with bigger document type coverage.

And then, On Fri, May 03, 2002 at 11:35:10AM +0200, petite_abeille wrote:
> Sounds good... But what does it mean? I'm not that familiar with any of
> the XML, XSLT hype so I don't really understand what you are getting
> at... I just want to convert any type of document to text for indexing
> purpose... I'm not planning to do anything else with it... However,
> converting everything to PDF as a first step allow you to provide a
> "preview" of any documents even if you happen not to understand the
> original format (eg MS Office)...

What Otis is getting at is that, while, yes,normalizing all
docs to one format before indexing them is probably a good idea, it
may also be a good idea to choose a target format other than PDF.
XML is probably a good format for two simple
reasons:

it's becoming the defacto standard for data exchange, including
numerous document development, delivery and management systems,

there are lots and lots of tools out there, particularly in java
and in open source, and more coming every day, for working with XML.

PDF is a format designed for presentation in general and
particularly for presenting print documents on screen. The majority
of use I've seen of PDF in the years since it was introduced is as a
portable printable file format. No need for postscript printers or a
copy of microsoft word to print the file, just get the small, free,
easily downloaded (and already installed in most browsers) acrobat
reader. XML is a format designed for conversion, manipulation and
transformation and in general much more heavily supported in the
programming world.

A good example in this case might be the Apache FOP project
(http://xml.apache.org/fop/), which can generate PDF from XML. This
is in general a straightforwad task; searching google for "convert pdf
xml" turns up tons of links on how to convert from XML to PDF, but
none on how to convert from PDF to XML.

Steven J. Owens
puff@darksleep.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>