Mailing List Archive: simple text file 'parsing' question

simple text file 'parsing' question

terocr at mysolution

Jun 18, 1999, 9:21 PM

Post #1 of 6 (714 views)

Here's my dilema: a directory filled (200+) with small emails. My goal
is to strip all the headers and combine them into one file. I can read
all the files just fine and write them all to one file, but I cannot
discern how to strip the headers. The answer must be very simple, yet
I cannot see it. Can anyone give a few pointers on how to do it, our
what module might be best? Thank you.
Ken

simple text file 'parsing' question [ In reply to ]

nospam at bitbucket

Jun 18, 1999, 11:44 PM

Post #2 of 6 (699 views)

KP wrote in message <376B1AAC.19FE8BCE@mysolution.com>...
>Here's my dilema: a directory filled (200+) with small emails. My goal
>is to strip all the headers and combine them into one file. I can read
>all the files just fine and write them all to one file, but I cannot
>discern how to strip the headers. The answer must be very simple, yet
>I cannot see it. Can anyone give a few pointers on how to do it, our
>what module might be best? Thank you.
>Ken

A raw email always has a blank line between the header and the body.
(To be pendantic, it should also have all its lines ending in CRLF.)
So you can read it in and find the gap by looking for 2 EOLs:

import string
f = open('c:\\apps\\eudora\\in.mbx', 'r')
all = f.read()
x = string.find(all, '\n\n')
body = all[x+2:]
# append body to output file
--
Phil Mayes pmayes AT olivebr DOT com

simple text file 'parsing' question [ In reply to ]

digitome at iol

Jun 19, 1999, 2:05 AM

Post #3 of 6 (711 views)

Unfortunately, Eudora's .mbx files are not consistent
in how a message starts. The sentinel string is always
the same. from memory it is something like
"From ???@@???"

The problem is that it sometimes occurs in the middle of a line.
As long as you allow the sentinel to occur anywhere on a line
and keep the bit to the left of the sentinel, you can skip from
their to the blank line -- it will be all headers.

BTW, two weeks ago a new programmer with two years
college joined my company. No experience in Python. No
experience in XML. Two weeks later he has:-

A python parser for Eudora .mbx mail archives that
uses rfc822.py to tease out the headers
An XML transformation script in Python
Used Python reporting scripts to gelp create
a DTD for rfc822 e-mail
The beginnings of a down-translate to Folio Views
in Python.

Does this language make programmers productive or what!!!!!

On Fri, 18 Jun 1999 23:44:11 -0700, "Phil Mayes"
<nospam@bitbucket.com> wrote:

>KP wrote in message <376B1AAC.19FE8BCE@mysolution.com>...
>>Here's my dilema: a directory filled (200+) with small emails. My goal
>>is to strip all the headers and combine them into one file. I can read
>>all the files just fine and write them all to one file, but I cannot
>>discern how to strip the headers. The answer must be very simple, yet
>>I cannot see it. Can anyone give a few pointers on how to do it, our
>>what module might be best? Thank you.
>>Ken
>
>
>A raw email always has a blank line between the header and the body.
>(To be pendantic, it should also have all its lines ending in CRLF.)
>So you can read it in and find the gap by looking for 2 EOLs:
>
>import string
>f = open('c:\\apps\\eudora\\in.mbx', 'r')
>all = f.read()
>x = string.find(all, '\n\n')
>body = all[x+2:]
># append body to output file
>--
>Phil Mayes pmayes AT olivebr DOT com
>
>
>
>

simple text file 'parsing' question [ In reply to ]

mgushee at havenrock

Jun 20, 1999, 8:56 AM

Post #4 of 6 (709 views)

KP <terocr@mysolution.com> writes:

> Here's my dilema: a directory filled (200+) with small emails. My goal
> is to strip all the headers and combine them into one file. I can read
> all the files just fine and write them all to one file, but I cannot
> discern how to strip the headers.

I have no expertise in this area, but I've been reading the "Internet
Data Handling" section of the Library Reference (Ch. 12 of the 1.5.2
edition), and it seems like there are several modules that might help
you. In particular, check out 'rfc822.'

Hope this helps.

Matt Gushee
Portland, Maine, USA
mgushee@havenrock.com

simple text file 'parsing' question [ In reply to ]

jam at newimage

Jun 20, 1999, 1:12 PM

Post #5 of 6 (702 views)

--Dxnq1zWXvFF0Q93v
Content-Type: text/plain; charset=us-ascii

On Mon, Jun 21, 1999 at 12:56:07AM +0900, Matt Gushee wrote:
> KP <terocr@mysolution.com> writes:
>
> > Here's my dilema: a directory filled (200+) with small emails. My goal
> > is to strip all the headers and combine them into one file. I can read
> > all the files just fine and write them all to one file, but I cannot
> > discern how to strip the headers.
>
> I have no expertise in this area, but I've been reading the "Internet
> Data Handling" section of the Library Reference (Ch. 12 of the 1.5.2
> edition), and it seems like there are several modules that might help
> you. In particular, check out 'rfc822.'
>
> Hope this helps.
>
> Matt Gushee
> Portland, Maine, USA
> mgushee@havenrock.com
>

I wrote a small piece of code that does *exactly* what you are describing.
it doesn't exactly strip the headers, but it parses the message using rfc822
and deals with it. you'll find it attached to this message. if for some
reason it doesn't come through, let me know, and I'll resend it.

regards,
Jeff
--
|| visit gfd <http://quark.newimage.com/>
|| psa member #293 <http://www.python.org/>
|| New Image Systems & Services, Inc. <http://www.newimage.com/>

--Dxnq1zWXvFF0Q93v
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="importcola.py"

#!/usr/bin/env python

import os
import dircache
import mimetools

import colacanister

import getdate

from rfc822 import Message

_COLAROOT="/home/jam/projects/cola/cola.archive"
_COLABASEHREF="http://www.cs.helsinki.fi/%7Emjrauhal/linux/cola.archive/"

if __name__ == "__main__":
l = dircache.listdir(_COLAROOT)
print len(l)
for item in l:
p = os.path.join(_COLAROOT, item)
if os.path.isdir(p):
articles = dircache.listdir(p)
for a in articles:
if a[:5] != "cola." and a[:4] != "mjr.":
continue

fp = open(os.path.join(p, a), "r")
m = Message(fp, seekable=0)
fp.close()

if not m.has_key("subject"):
print "** message does not have subject line. skipped."
continue

url = os.path.join(item, a)
print "processing '%s'" % (url),
if colacanister.get_cola_by_archiveurl(url) is None:
c = colacanister.colacanister()
c["cola_from"] = m["from"]

if m.has_key("date"):
c["cola_dateposted"] = getdate.getdate(m["date"])

c["cola_subject"] = m["subject"]
c["cola_archiveurl"] = url
c.insert()
print "added."
else:
print "already archived."

--Dxnq1zWXvFF0Q93v--

simple text file 'parsing' question [ In reply to ]

Jun 21, 1999, 5:29 AM

Post #6 of 6 (703 views)

Matt Gushee <mgushee@havenrock.com> writes:

> KP <terocr@mysolution.com> writes:
>
> > Here's my dilema: a directory filled (200+) with small emails. My goal
> > is to strip all the headers and combine them into one file. I can read
> > all the files just fine and write them all to one file, but I cannot
> > discern how to strip the headers.

Just remove everything up to and including the first blank line. That
represents the end of the header. For instance:

file = open(resultFileName,'w')

# ...

lines = open(name).readlines()
lines = lines[lines.index("")+1:]

file.writelines(lines)

# ...

file.close()

--

Magnus Making no sound / Yet smouldering with passion
Lie The firefly is still sadder / Than the moaning insect
Hetland : Minamoto Shigeyuki