Mailing List Archive

reading PDF using Python [Q]
Hi!

I need to read some PDF files and look through them for some keywords I
have in a list. Is there a python module to read inside a PDF file? If there
isn't one, is there one in C that you may be aware of?

TIA

/B

Bruno Mattarollo <bruno@gaiasur.com.ar>
... proud to be a PSA member <http://www.python.org/psa>
reading PDF using Python [Q] [ In reply to ]
On Tue, May 04, 1999 at 07:05:01PM +0000, Bruno Mattarollo wrote:
> Hi!
>
> I need to read some PDF files and look through them for some keywords I
> have in a list. Is there a python module to read inside a PDF file? If there
> isn't one, is there one in C that you may be aware of?
>
> TIA
>
> /B
>
> Bruno Mattarollo <bruno@gaiasur.com.ar>
> ... proud to be a PSA member <http://www.python.org/psa>
>
>
> --
> http://www.python.org/mailman/listinfo/python-list

Have you looked into Ghostscript's manual? I'm sure .pdf to .txt
converter is available somewhere related to ghostscript or TeX stuffs.

William
reading PDF using Python [Q] [ In reply to ]
This site has a C PDF lib.

http://www.ifconnection.de/~tm/


>Hi!
>
> I need to read some PDF files and look through them for some keywords I
>have in a list. Is there a python module to read inside a PDF file? If there
>isn't one, is there one in C that you may be aware of?
>
reading PDF using Python [Q] [ In reply to ]
I have been playing with parsing pdf files in python. The format of .pdf
is documented on Adobe's web site. If it weren't for the encryption and
compression options you could simply work on the files directly.


Cheers,



Nick.
reading PDF using Python [Q] [ In reply to ]
This is a multi-part message in MIME format.
--------------33948AEAB9B83FECD4C036D1
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Nick Moon wrote:
>
> I have been playing with parsing pdf files in python. The format
> of .pdf is documented on Adobe's web site.

Any usefull URL?

> If it weren't for the encryption and compression options you could
> simply work on the files directly.

Do you know more about PDF encryption and compression?

Thanks,


Laurent.
--------------33948AEAB9B83FECD4C036D1
Content-Type: text/x-vcard; charset=us-ascii;
name="l.szyster.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Laurent Szyster
Content-Disposition: attachment;
filename="l.szyster.vcf"

begin:vcard
n:Szyster;Laurent
tel;work:+322 679 62 08
x-mozilla-html:FALSE
url:http://www.rinet.com
org:RINET s.c.
adr:;;blv. du Souverain, 100;Bruxelles;;1170;BELGIUM
version:2.1
email;internet:l.szyster@ibm.net
fn:Laurent Szyster
end:vcard

--------------33948AEAB9B83FECD4C036D1--
reading PDF using Python [Q] [ In reply to ]
> > I have been playing with parsing pdf files in python. The format
> > of .pdf is documented on Adobe's web site.
>
> Any usefull URL?

Try the adobe site. www.adobe.com but you knew that. The document you want
is called 'Portable Document Format Reference Manual - Version 1.2'.
Though I think Acrobat v4 means there is now a version 1.3. It's in
surprisingly .pdf format and it's big - about 400 pages when printed.

It is pretty unreadable, but it does describe the file format in mind
numbingly boring detail. The pdf format itself, looks like the work of
several different people over several different years. Different bits of
the format seem to use rather different styles of data structures.


> Do you know more about PDF encryption and compression?

PDF files have a general structure, something like: A header, A list of
objects, A lookup table, An end. The lookup table is a list of offsets to
each object. It allows program to open the file from the end and then jump
direct to each object as required. Updates can be appended to a file
without changing any of the contents of the file. The updates consist of
some objects and a new lookup table and end section.

Actual page descriptions which is probably what you want to look at are
stored in a stream - the stream is then inside an object. Streams may be
written/read using various filters. A typical filter set would be:

ASCII85Decode / LZWDecode

Which means it has been compressed using LZW then the binary output of LZW
has been turned into ASCII (base 85)


Cheers,



Nick.
reading PDF using Python [Q] [ In reply to ]
This is a multi-part message in MIME format.
--------------3187D51E17ABEBDFCE282626
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Thanks a lot for all that valuable info ;-)


Laurent
--------------3187D51E17ABEBDFCE282626
Content-Type: text/x-vcard; charset=us-ascii;
name="l.szyster.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Laurent Szyster
Content-Disposition: attachment;
filename="l.szyster.vcf"

begin:vcard
n:Szyster;Laurent
tel;work:+322 679 62 08
x-mozilla-html:FALSE
url:http://www.rinet.com
org:RINET s.c.
adr:;;blv. du Souverain, 100;Bruxelles;;1170;BELGIUM
version:2.1
email;internet:l.szyster@ibm.net
fn:Laurent Szyster
end:vcard

--------------3187D51E17ABEBDFCE282626--
reading PDF using Python [Q] [ In reply to ]
In article <FBKAs7.y4@cix.compulink.co.uk>,
ncmoon@cix.compulink.co.uk ("Nick Moon") wrote:
> > > I have been playing with parsing pdf files in python. The format
> > > of .pdf is documented on Adobe's web site.
> >
> > Any usefull URL?

I guess this is what you're looking for (including PS):

http://partners.adobe.com/supportservice/devrelations/technotes.html


> Try the adobe site. www.adobe.com but you knew that. The document you
want
> is called 'Portable Document Format Reference Manual - Version 1.2'.
> Though I think Acrobat v4 means there is now a version 1.3. It's in
> surprisingly .pdf format and it's big - about 400 pages when printed.
>
> It is pretty unreadable, but it does describe the file format in mind
> numbingly boring detail. The pdf format itself, looks like the work of
> several different people over several different years. Different bits
of
> the format seem to use rather different styles of data structures.

In fact, I don't think it's unreadybble at all! I've seen much
more boring standards specifications already, like those of W3C.
The PDF specification explains quite nicely the general architec-
ture of a PDF document, the file format, etc. Give it a try!

It even inspired me to start yet-another-rainy-sunday-or-boring-
work-day project to create what might become the world's slowest
but most portable PDF parser... ;-)

Nothing-to-be-released-yet,

Dinu



--== Sent via Deja.com http://www.deja.com/ ==--
---Share what you know. Learn what you don't.---
reading PDF using Python [Q] [ In reply to ]
In article <3736A340.A056EA0C@ibm.net>,
l.szyster@ibm.net wrote:
> This is a multi-part message in MIME format.
> --------------33948AEAB9B83FECD4C036D1
> Content-Type: text/plain; charset=us-ascii
> Content-Transfer-Encoding: 7bit
>
> Nick Moon wrote:
> >
> > If it weren't for the encryption and compression options you could
> > simply work on the files directly.
>
> Do you know more about PDF encryption and compression?

You may want to check the following paper below, also
discussing PDF, if only briefly and basically about
image compression. I'd assume, though, that algorithms
like LZW are also used for textual compression.

Regards,

Dinu



Information Architecture White Paper
IA-6801: Electronic Image Formats and
Compression Algorithms

http://www.lanl.gov/projects/ia/stds/ia680120.html

Abstract

This white paper discusses relative strengths and weaknesses
of image formats and corresponding compression algorithms
well suited to the sharing of simple graphical images across
multiple platforms. It does not address formats for specialized
applications such as GIS, CAD, three-dimensional modeling,
scientific visualization, atmospheric/environmental modeling,
or tools used to create/process graphics primarily intended
for printing.

The specific formats discussed are TIFF, JPEG/JFIF, GIF, PNG,
and PDF; the compression algorithms are CCITT 3 and 4, JPEG,
LZW, and PNG 0.



--== Sent via Deja.com http://www.deja.com/ ==--
---Share what you know. Learn what you don't.---
reading PDF using Python [Q] [ In reply to ]
> You may want to check the following paper below, also
> discussing PDF, if only briefly and basically about
> image compression. I'd assume, though, that algorithms
> like LZW are also used for textual compression.
>
[MCepl] Please, do not use LZW in any application! It cause
trouble and it is (IMHO, and I am not specialist) worse compression
factor than PNG 0 compression. See
http://www.gnu.org/philosophy/gif.html and
http://lpf.ai.mit.edu/Patents/Gif/Gif.html for more information.
Moreover, it seems to me that even Adobe doesn't use LZW in its products
anymore, when it supports zlib compression (called Deflate in Adobe
documentation), even though it is able to use LZW as well (because of
compatibility with lower versions of PDF).

Matthew
reading PDF using Python [Q] [ In reply to ]
> In fact, I don't think it's unreadybble at all! I've seen much
> more boring standards specifications already, like those of W3C.

Perhaps not unreadable just an awful lot to plough through. I can't help
but feel that the pdf standard was developed by a variety of people at
rather different times. Different bits of the standard seem to incorporate
rather different styles of programming and use rather different data
structures.

Cheers,


Nick
reading PDF using Python [Q] [ In reply to ]
> > the format seem to use rather different styles of data structures.
>
> In fact, I don't think it's unreadybble at all! I've seen much
> more boring standards specifications already, like those of W3C.
> The PDF specification explains quite nicely the general architec-
> ture of a PDF document, the file format, etc. Give it a try!

I agree (although I like the more established w3c stuff as well ;-)

I wrote a PDF parser in Perl for a contract a while back; manipulating the
basic structure of the PDF is remarkably easy, and I plan to play with it in
Python some day. I never got down to the level of working with the streams
itself, which is what it sounds like is needed here.

Good luck,