Mailing List Archive

HTML "sanitizer" in Python
Hi,

I am new to Python. I have an idea of a work-related project I want to do, and I was hoping some folks on this list might be able to help me realize it. I have Mark Lutz' _Programming Python_ book, and that has been a helpful orientation. I like his basic packer and unpacker scripts, but what I want to do is something in between that basic program and its later, more complex manifestations.

I am on a Y2K project with 14 manufacturing plants, each of which has an inventory of plant process components that need to be tested and/or replaced. I want to put each plant's current inventory on the corporate intranet on a weekly or biweekly basis. All the plant data is in an Access database. We are querying the data we need and importing into 14 MS Excel 97 spreadsheets. Then we are saving the Excel sheets as HTML. The HTML files bloat out with a near 100% increase in file size over the original Excel files. This is because the HTML converter in Excel adds all kinds of unnecessary HTML code, such as <FONT FACE="Times New Roman"> for every single cell in the table. Many of these tables have over 1000 cells, and this code, along with its accompanying closing FONT tag, add up quick. The other main, unnecessary code is the ALIGN="left" attribute in <TD> tags (the default alignment _is_ left). The unnecessary tags are consistent and easy to identify, and a routine should be writable that will automate the removal of them.

I created a Macro in Visual SlickEdit that automatically opens all these HTML files, finds and deletes all the tags that can be deleted, saves the changes and closes them. I originally wanted to do this in Python, and I would still like to know how, but time constraints prevented it at the time. Now I want to work on how to create a Python program that will do this. Can anyone help? Has anyone written anything like this in Python already that they can point me too? I would really appreciate it.

Again, the main flow of the program is:

>> Open 14 HTML files, all in the same folder and all with the .html extension.
>> Find certain character strings and delete them from the files. In one case (the <TD> tags) it is easier to find the whole tag with attributes and then _replace_ the original tag with a plain <TD>.
>> Save the files.
>> Close the files.
>> Exit the program.

More advanced options would be the ability for the user to set parameters for the program upon running it, to keep from hard-coding the find and replace parms.

OK, thanks to any help you can provide. I partly was turned on to Python by Eric Raymond's article, "How to Become a Hacker" (featured on /.). I use Linux at home, but this program would be for use on a Windows 95 platform at work, if that makes any difference. I do have the latest Python interpreter and editor for Windows here at work.

Yours truly,
Scott

Scott M. Stirling
Visit the HOLNAM Year 2000 Web Site: http://web/y2k
Keane - Holnam Year 2000 Project
Office: 734/529-2411 ext. 2327 fax: 734/529-5066 email: sstirlin@holnam.com
HTML "sanitizer" in Python [ In reply to ]
On Wed, Apr 28, 1999 at 12:49:55PM -0400, Scott Stirling wrote:
> Hi,
>
> I am new to Python. I have an idea of a work-related project I want
> to do, and I was hoping some folks on this list might be able to
> help me realize it. I have Mark Lutz' _Programming Python_ book,
> and that has been a helpful orientation. I like his basic packer
> and unpacker scripts, but what I want to do is something in between
> that basic program and its later, more complex manifestations.
>
> I am on a Y2K project with 14 manufacturing plants, each of which
> has an inventory of plant process components that need to be tested
> and/or replaced. I want to put each plant's current inventory on
> the corporate intranet on a weekly or biweekly basis. All the plant
> data is in an Access database. We are querying the data we need and
> importing into 14 MS Excel 97 spreadsheets. Then we are saving the
> Excel sheets as HTML. The HTML files bloat out with a near 100%
> increase in file size over the original Excel files. This is
> because the HTML converter in Excel adds all kinds of unnecessary
> HTML code, such as <FONT FACE="Times New Roman"> for every single
> cell in the table. Many of these tables have over 1000 cells, and
> this code, along with its accompanying closing FONT tag, add up
> quick. The other main, unnecessary code is the ALIGN="left"
> attribute in <TD> tags (the default alignment _is_ left). The
> unnecessary tags are consistent and easy to identify, and a routine
> sh!
> ould be writable that will automate the removal of them.
>
> I created a Macro in Visual SlickEdit that automatically opens all
> these HTML files, finds and deletes all the tags that can be
> deleted, saves the changes and closes them. I originally wanted to
> do this in Python, and I would still like to know how, but time
> constraints prevented it at the time. Now I want to work on how to
> create a Python program that will do this. Can anyone help? Has
> anyone written anything like this in Python already that they can
> point me too? I would really appreciate it.
>
> Again, the main flow of the program is:
>
> >> Open 14 HTML files, all in the same folder and all with the .html
> >> extension. Find certain character strings and delete them from
> >> the files. In one case (the <TD> tags) it is easier to find the
> >> whole tag with attributes and then _replace_ the original tag
> >> with a plain <TD>. Save the files. Close the files. Exit the
> >> program.

Hi Scott,

I shall assume that a <TD ...> tag occurs in one line. Try 'sed',
for i in *.html
do sed -e 's/<TD ALIGN="left">/<TD>/g" $i > /tmp/$i && mv /tmp/$i $i
done
or, in Python,
for s in open('...', 'r').readlines():
s = string.replace('<TD ALIGN="left">', '<TD>', s)
print string.strip(s)

If <TD ...> tag spans over more than one line, then read the file in
whole, like
for s in open('...', 'r').read():

If the tag is not consistent, then you may have to use regular
expression with 're' module.

Hopes this helps.
William


>
> More advanced options would be the ability for the user to set
> parameters for the program upon running it, to keep from hard-coding
> the find and replace parms.

To use command line parameters, like
$ cleantd 'ALIGN="left"'
change to
s = string.replace('<TD %s>' % sys.argv[1], '<TD>', s)

>
> OK, thanks to any help you can provide. I partly was turned on to
> Python by Eric Raymond's article, "How to Become a Hacker" (featured
> on /.). I use Linux at home, but this program would be for use on a
> Windows 95 platform at work, if that makes any difference. I do
> have the latest Python interpreter and editor for Windows here at
> work.
>
> Yours truly,
> Scott
>
> Scott M. Stirling
> Visit the HOLNAM Year 2000 Web Site: http://web/y2k
> Keane - Holnam Year 2000 Project
> Office: 734/529-2411 ext. 2327 fax: 734/529-5066 email: sstirlin@holnam.com
>
>
> --
> http://www.python.org/mailman/listinfo/python-list
HTML "sanitizer" in Python [ In reply to ]
In article <s72703fc.021@holnam.com>,
"Scott Stirling" <SSTirlin@holnam.com> wrote:
> Hi,
>
> I am new to Python. I have an idea of a work-related project I want to do,
> and I was hoping some folks on this list might be able to help me realize it.
> I have Mark Lutz' _Programming Python_ book, and that has been a helpful
> orientation. I like his basic packer and unpacker scripts, but what I want
> to do is something in between that basic program and its later, more complex
> manifestations.
>
> I am on a Y2K project with 14 manufacturing plants, each of which has an
> inventory of plant process components that need to be tested and/or
> replaced. I want to put each plant's current inventory on the corporate
> intranet on a weekly or biweekly basis. All the plant data is in an Access
> database. We are querying the data we need and importing into 14 MS Excel 97
> spreadsheets. Then we are saving the Excel sheets as HTML. The HTML files
> bloat out with a near 100% increase in file size over the original Excel
> files. This is because the HTML converter in Excel adds all kinds of
> unnecessary HTML code, such as <FONT FACE="Times New Roman"> for every
> single cell in the table. Many of these tables have over 1000 cells, and
> this code, along with its accompanying closing FONT tag, add up quick.
> The other main, unnecessary code is the ALIGN="left" attribute in <TD>
> tags (the default alignment _is_ left). The unnecessary tags are
> consistent and easy to identify, and a routine should be writable that
> will automate the removal of them.
>
> I created a Macro in Visual SlickEdit that automatically opens all these
> HTML files, finds and deletes all the tags that can be deleted, saves the
> changes and closes them. I originally wanted to do this in Python, and I
> would still like to know how, but time constraints prevented it at the
> time. Now I want to work on how to create a Python program that will do
> this. Can anyone help? Has anyone written anything like this in Python
> already that they can point me too? I would really appreciate it.
>

Well, it wouldn't be that hard in Python to parse the HTML files and reformat
them in various ways. You can either go the route of straight text
substitution using regular expressions, or you could use htmllib to actually
parse the HTML files into a data structure, and the write them back out
again.

However, may I suggest a different method?

You've got your original data in Access. There are several different ways to
talk to Access from Python. You could pull your data directly from Access
using Python and skip Excel all together. And Python's got some great modules
for generating HTML. Heck, add CGI or Zope to the mix and you could generate
your inventory lists at the web server on the fly!

Ok, I'll calm down now.

-Chris

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
HTML "sanitizer" in Python [ In reply to ]
There's a better (albeit non-Python) way.

Check out http://www.w3.org/People/Raggett/tidy/

Tidy will do wonderful things in terms of making HTML compliant with the
spec (closing tags, cleaning up the crud that Word makes, etc.) As a big
bonus, it will remove all <FONT> tags, etc, and replace them with CSS1 style
sheets. Wow.

It's C, and is also available with a windows GUI (HTML-Kit) that makes a
pretty good HTML editor as well. On Unix, it's a command line utility, so
you can use it (clumsily) from a Python program.

I suppose an extension could also be written; will look into this (or if
anyone does it, please tell me!)






----- Original Message -----
From: William Park <parkw@better.net>
Newsgroups: comp.lang.python
To: <python-list@cwi.nl>
Sent: Thursday, April 29, 1999 5:20
Subject: Re: HTML "sanitizer" in Python


> On Wed, Apr 28, 1999 at 12:49:55PM -0400, Scott Stirling wrote:
> > Hi,
> >
> > I am new to Python. I have an idea of a work-related project I want
> > to do, and I was hoping some folks on this list might be able to
> > help me realize it. I have Mark Lutz' _Programming Python_ book,
> > and that has been a helpful orientation. I like his basic packer
> > and unpacker scripts, but what I want to do is something in between
> > that basic program and its later, more complex manifestations.
> >
> > I am on a Y2K project with 14 manufacturing plants, each of which
> > has an inventory of plant process components that need to be tested
> > and/or replaced. I want to put each plant's current inventory on
> > the corporate intranet on a weekly or biweekly basis. All the plant
> > data is in an Access database. We are querying the data we need and
> > importing into 14 MS Excel 97 spreadsheets. Then we are saving the
> > Excel sheets as HTML. The HTML files bloat out with a near 100%
> > increase in file size over the original Excel files. This is
> > because the HTML converter in Excel adds all kinds of unnecessary
> > HTML code, such as <FONT FACE="Times New Roman"> for every single
> > cell in the table. Many of these tables have over 1000 cells, and
> > this code, along with its accompanying closing FONT tag, add up
> > quick. The other main, unnecessary code is the ALIGN="left"
> > attribute in <TD> tags (the default alignment _is_ left). The
> > unnecessary tags are consistent and easy to identify, and a routine
> > sh!
> > ould be writable that will automate the removal of them.
> >
> > I created a Macro in Visual SlickEdit that automatically opens all
> > these HTML files, finds and deletes all the tags that can be
> > deleted, saves the changes and closes them. I originally wanted to
> > do this in Python, and I would still like to know how, but time
> > constraints prevented it at the time. Now I want to work on how to
> > create a Python program that will do this. Can anyone help? Has
> > anyone written anything like this in Python already that they can
> > point me too? I would really appreciate it.
> >
> > Again, the main flow of the program is:
> >
> > >> Open 14 HTML files, all in the same folder and all with the .html
> > >> extension. Find certain character strings and delete them from
> > >> the files. In one case (the <TD> tags) it is easier to find the
> > >> whole tag with attributes and then _replace_ the original tag
> > >> with a plain <TD>. Save the files. Close the files. Exit the
> > >> program.
>
> Hi Scott,
>
> I shall assume that a <TD ...> tag occurs in one line. Try 'sed',
> for i in *.html
> do sed -e 's/<TD ALIGN="left">/<TD>/g" $i > /tmp/$i && mv /tmp/$i $i
> done
> or, in Python,
> for s in open('...', 'r').readlines():
> s = string.replace('<TD ALIGN="left">', '<TD>', s)
> print string.strip(s)
>
> If <TD ...> tag spans over more than one line, then read the file in
> whole, like
> for s in open('...', 'r').read():
>
> If the tag is not consistent, then you may have to use regular
> expression with 're' module.
>
> Hopes this helps.
> William
>
>
> >
> > More advanced options would be the ability for the user to set
> > parameters for the program upon running it, to keep from hard-coding
> > the find and replace parms.
>
> To use command line parameters, like
> $ cleantd 'ALIGN="left"'
> change to
> s = string.replace('<TD %s>' % sys.argv[1], '<TD>', s)
>
> >
> > OK, thanks to any help you can provide. I partly was turned on to
> > Python by Eric Raymond's article, "How to Become a Hacker" (featured
> > on /.). I use Linux at home, but this program would be for use on a
> > Windows 95 platform at work, if that makes any difference. I do
> > have the latest Python interpreter and editor for Windows here at
> > work.
> >
> > Yours truly,
> > Scott
> >
> > Scott M. Stirling
> > Visit the HOLNAM Year 2000 Web Site: http://web/y2k
> > Keane - Holnam Year 2000 Project
> > Office: 734/529-2411 ext. 2327 fax: 734/529-5066 email:
sstirlin@holnam.com
> >
> >
> > --
> > http://www.python.org/mailman/listinfo/python-list
>
>
>
>
HTML "sanitizer" in Python [ In reply to ]
Um, a vote of confidence here for tidy.

I've rewritten tidy to do several different specialized things.

I am no C hacker, and have been told it's 'awful' code, but I
sure had no problems with it.
,
just-another-2c-in-the-bucket-ly-yours

--jim

Mark Nottingham wrote:
>
> There's a better (albeit non-Python) way.
>
> Check out http://www.w3.org/People/Raggett/tidy/
>
> Tidy will do wonderful things in terms of making HTML compliant with the
> spec (closing tags, cleaning up the crud that Word makes, etc.) As a big
> bonus, it will remove all <FONT> tags, etc, and replace them with CSS1 style
> sheets. Wow.
>
> It's C, and is also available with a windows GUI (HTML-Kit) that makes a
> pretty good HTML editor as well. On Unix, it's a command line utility, so
> you can use it (clumsily) from a Python program.
>
> I suppose an extension could also be written; will look into this (or if
> anyone does it, please tell me!)
HTML "sanitizer" in Python [ In reply to ]
Thanks, Mark! That is a very cool tool. It will make a nice HTML editor for me here at work.

The only feature I immediately saw lacking (but maybe I missed it--I just downloaded it this AM) is the ability to record macros. For my Excel problem, I really need the ability to batch process the HTML files because there are 14 of them.

Anyway, this is a great reference. Thank you again.

Scott
>>> "Mark Nottingham" <mnot@pobox.com> 04/28 6:17 PM >>>
There's a better (albeit non-Python) way.

Check out http://www.w3.org/People/Raggett/tidy/

Tidy will do wonderful things in terms of making HTML compliant with the
spec (closing tags, cleaning up the crud that Word makes, etc.) As a big
bonus, it will remove all <FONT> tags, etc, and replace them with CSS1 style
sheets. Wow.

It's C, and is also available with a windows GUI (HTML-Kit) that makes a
pretty good HTML editor as well. On Unix, it's a command line utility, so
you can use it (clumsily) from a Python program.

I suppose an extension could also be written; will look into this (or if
anyone does it, please tell me!)

__________________________________________________________________
| Scott M. Stirling |
| Visit the HOLNAM Year 2000 Web Site: http://web/y2k |
| Keane - Holnam Year 2000 Project |
| Office: 734/529-2411 ext. 2327 fax: 734/529-5066 email: sstirlin@holnam.com |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
HTML "sanitizer" in Python [ In reply to ]
Will,

Thank you. So far you are the only person who has offered the kind of practical HOW-TO that I was mainly hoping for! This is not to disparage the many other helpful and interesting suggestions I have received.

I should reiterate that I have 14 fairly large HTML files that I want to _batch process_, taking out a few specific HTML tags that Excel adds unnecessarily. I don't have the time or the inclination to write an HTML generator and process the Access data from scratch. I also have to work with a team of people who don't care at all about doing things smarter or trying out new programming languages.

Besides, someone on the team has already put a lot of effort into writing a VB program that batch processes the Excel sheets from an Access query. And, as I said, I have a Visual SlickEdit macro that does exactly what I need very quickly. I am out to learn a little Python more than anything. So, while any more suggestions and comments are welcome, I will ask some more specific questions in the meantime. And then you can see how far I am from writing even the simplest program in Python!

1) What is the Python syntax for opening a file in MS Windows? I was following Guido's tutorial yesterday, but I could not figure out how to open a file in Windows.

2) How do I find a string of text in the open file and delete it iteratively?

3) How do I save the file in Windows after I have edited it with the Python program? How do I close it?

4) If someone helps me out, I think I should be able to use this info. and the tutorial and the Lutz book to loop the process and make the program run until all *.htm files in a folder have been handled once.

What do you say?

Scott
>>> William Park <parkw@better.net> 04/28 3:20 PM >>>
On Wed, Apr 28, 1999 at 12:49:55PM -0400, Scott Stirling wrote:
> Hi,
>
> I am new to Python. I have an idea of a work-related project I want
> to do, and I was hoping some folks on this list might be able to
> help me realize it. I have Mark Lutz' _Programming Python_ book,
> and that has been a helpful orientation. I like his basic packer
> and unpacker scripts, but what I want to do is something in between
> that basic program and its later, more complex manifestations.
>
> I am on a Y2K project with 14 manufacturing plants, each of which
> has an inventory of plant process components that need to be tested
> and/or replaced. I want to put each plant's current inventory on
> the corporate intranet on a weekly or biweekly basis. All the plant
> data is in an Access database. We are querying the data we need and
> importing into 14 MS Excel 97 spreadsheets. Then we are saving the
> Excel sheets as HTML. The HTML files bloat out with a near 100%
> increase in file size over the original Excel files. This is
> because the HTML converter in Excel adds all kinds of unnecessary
> HTML code, such as <FONT FACE="Times New Roman"> for every single
> cell in the table. Many of these tables have over 1000 cells, and
> this code, along with its accompanying closing FONT tag, add up
> quick. The other main, unnecessary code is the ALIGN="left"
> attribute in <TD> tags (the default alignment _is_ left). The
> unnecessary tags are consistent and easy to identify, and a routine
> sh!
> ould be writable that will automate the removal of them.
>
> I created a Macro in Visual SlickEdit that automatically opens all
> these HTML files, finds and deletes all the tags that can be
> deleted, saves the changes and closes them. I originally wanted to
> do this in Python, and I would still like to know how, but time
> constraints prevented it at the time. Now I want to work on how to
> create a Python program that will do this. Can anyone help? Has
> anyone written anything like this in Python already that they can
> point me too? I would really appreciate it.
>
> Again, the main flow of the program is:
>
> >> Open 14 HTML files, all in the same folder and all with the .html
> >> extension. Find certain character strings and delete them from
> >> the files. In one case (the <TD> tags) it is easier to find the
> >> whole tag with attributes and then _replace_ the original tag
> >> with a plain <TD>. Save the files. Close the files. Exit the
> >> program.

Hi Scott,

I shall assume that a <TD ...> tag occurs in one line. Try 'sed',
for i in *.html
do sed -e 's/<TD ALIGN="left">/<TD>/g" $i > /tmp/$i && mv /tmp/$i $i
done
or, in Python,
for s in open('...', 'r').readlines():
s = string.replace('<TD ALIGN="left">', '<TD>', s)
print string.strip(s)

If <TD ...> tag spans over more than one line, then read the file in
whole, like
for s in open('...', 'r').read():

If the tag is not consistent, then you may have to use regular
expression with 're' module.

Hopes this helps.
William


>
> More advanced options would be the ability for the user to set
> parameters for the program upon running it, to keep from hard-coding
> the find and replace parms.

To use command line parameters, like
$ cleantd 'ALIGN="left"'
change to
s = string.replace('<TD %s>' % sys.argv[1], '<TD>', s)

>
> OK, thanks to any help you can provide. I partly was turned on to
> Python by Eric Raymond's article, "How to Become a Hacker" (featured
> on /.). I use Linux at home, but this program would be for use on a
> Windows 95 platform at work, if that makes any difference. I do
> have the latest Python interpreter and editor for Windows here at
> work.
>
> Yours truly,
> Scott
>
> Scott M. Stirling
> Visit the HOLNAM Year 2000 Web Site: http://web/y2k
> Keane - Holnam Year 2000 Project
> Office: 734/529-2411 ext. 2327 fax: 734/529-5066 email: sstirlin@holnam.com
>
>
> --
> http://www.python.org/mailman/listinfo/python-list

--
http://www.python.org/mailman/listinfo/python-list

__________________________________________________________________
| Scott M. Stirling |
| Visit the HOLNAM Year 2000 Web Site: http://web/y2k |
| Keane - Holnam Year 2000 Project |
| Office: 734/529-2411 ext. 2327 fax: 734/529-5066 email: sstirlin@holnam.com |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
HTML "sanitizer" in Python [ In reply to ]
"Scott Stirling" <SSTirlin@holnam.com> writes:


> 1) What is the Python syntax for opening a file in MS Windows? I was following Guido's tutorial yesterday, but I could not figure out how to open a file in Windows.

??? I don't think it's different on windows than on linux.
Just do:

f = open("my_file.html", "rt")

(OK, there *is* a difference, I guess; you really need the "t" in "rt".
Otherwise the carriage returns show up in your file.)

> 2) How do I find a string of text in the open file and delete it iteratively?

Check out the "string" module.

> 3) How do I save the file in Windows after I have edited it with the Python program? How do I close it?

Well you open a second file, for writing this time:
f2 = open("output.html", "wt")

Then you write to it to your heart's content:
f2.write("blahblahblah")

Then you close it:
f2.close()

But all this is in the Python docs, so perhaps you should try to read them.

> 4) If someone helps me out, I think I should be able to use this info. and the tutorial and the Lutz book to loop the process and make the program run until all *.htm files in a folder have been handled once.

Well, if I understand correctly, the *only* thing you're trying to do
is to remove some specific strings from a bunch of files. Now if I
were you, I wouldn't even bother to use Python on something that
simple; I would just use sed. With sed, you could do:

sed 'g/string_to_be_eliminated//g' my_file.html > output.html

Presto, that's it. I think that there is a version for GNU sed for
Windows somewhere out there; do yourself a favour and get it.

Greetings,

Stephan
HTML "sanitizer" in Python [ In reply to ]
On opening files in Windows--I was hoping there was a way to give python the full file path. Everything I have seen so far just tells me how to open a file if it's in the same directory I am running python from.

I don't have sed on my MS Windows PC at work. This was part of the initial explanation--I am working for a company where we have DOS, Windows and Office 97. No sed, no Unix. This is a Y2K Project too, so we are on a budget with little leeway for new ideas that weren't included in the original statement of work and project plan.

Scott
>>> Stephan Houben <stephan@pcrm.win.tue.nl> 04/29 9:49 AM >>>
"Scott Stirling" <SSTirlin@holnam.com> writes:


> 1) What is the Python syntax for opening a file in MS Windows? I was following Guido's tutorial yesterday, but I could not figure out how to open a file in Windows.

??? I don't think it's different on windows than on linux.
Just do:

f = open("my_file.html", "rt")

(OK, there *is* a difference, I guess; you really need the "t" in "rt".
Otherwise the carriage returns show up in your file.)

> 2) How do I find a string of text in the open file and delete it iteratively?

Check out the "string" module.

> 3) How do I save the file in Windows after I have edited it with the Python program? How do I close it?

Well you open a second file, for writing this time:
f2 = open("output.html", "wt")

Then you write to it to your heart's content:
f2.write("blahblahblah")

Then you close it:
f2.close()

But all this is in the Python docs, so perhaps you should try to read them.

> 4) If someone helps me out, I think I should be able to use this info. and the tutorial and the Lutz book to loop the process and make the program run until all *.htm files in a folder have been handled once.

Well, if I understand correctly, the *only* thing you're trying to do
is to remove some specific strings from a bunch of files. Now if I
were you, I wouldn't even bother to use Python on something that
simple; I would just use sed. With sed, you could do:

sed 'g/string_to_be_eliminated//g' my_file.html > output.html

Presto, that's it. I think that there is a version for GNU sed for
Windows somewhere out there; do yourself a favour and get it.

Greetings,

Stephan

--
http://www.python.org/mailman/listinfo/python-list

________________________________________________________________
Scott M. Stirling
Visit the HOLNAM Year 2000 Web Site: http://web/y2k
Keane - Holnam Year 2000 Project
Office: 734/529-2411 ext. 2327 fax: 734/529-5066 email: sstirlin@holnam.com
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
HTML "sanitizer" in Python [ In reply to ]
On Thu, 29 Apr 1999 12:20:27 -0400, "Scott Stirling" <SSTirlin@holnam.com>
wrote:

>On opening files in Windows--I was hoping there was a way to give python the full file path.
>Everything I have seen so far just tells me how to open a file if it's in the same directory I am
>running python from.
Uuh,

f = open ("c:/my_path/my_file.txt", "r")

Every in function in the Python library that has a file name argument
accepts a full/relative path also (except when dealing with path and
name components explicitely)!

Note the normal slashes.. with backslashes you had to write

f = open ("c:\\my_path\\my_file.txt", "r")

or

f = open ( r"c:\my_path\my_file.txt", "r")

because Python uses the backslash as an escape character inside
string literals, which can be suppressed by using "raw" strings with
a leading 'r'.

Here's a quick outline of some file processing of your kind, which
may give you a first impression (typed without testing):

source = open("/path/file.txt", "r")
dest = open("/path/file.txt", "w")

content = source.read() # read the entire file as a string

# Do some processing, perhaps
import string
string.replace (content, "some_substring", "by_another")

dest.write (content)
source.close()
dest.close()


Hope that helps,
Stefan
HTML "sanitizer" in Python [ In reply to ]
Oh, just saw a bug:

On Thu, 29 Apr 1999 18:01:54 GMT, spamfranke@bigfoot.de (Stefan Franke) wrote:

>source = open("/path/file.txt", "r")
>dest = open("/path/file.txt", "w")
One should use different filenames here of course.

>Stefan
HTML "sanitizer" in Python [ In reply to ]
On Thu, 29 Apr 1999 12:20:27 -0400, "Scott Stirling"
<SSTirlin@holnam.com> declaimed the following in comp.lang.python:

> On opening files in Windows--I was hoping there was a way to give python the full file path. Everything I have seen so far just tells me how to open a file if it's in the same directory I am running python from.
>
> I don't have sed on my MS Windows PC at work. This was part of the initial explanation--I am working for a company where we have DOS, Windows and Office 97. No sed, no Unix. This is a Y2K Project too, so we are on a budget with little leeway for new ideas that weren't included in the original statement of work and project plan.
>
Did you try?

Actually, you might have gotten caught on the \ treatment.

> PythonWin 1.5 (#0, Dec 30 1997, 23:24:20) [MSC 32 bit (Intel)] on win32
> Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
> Portions copyright 1994-1998 Mark Hammond (MHammond@skippinet.com.au)
> >>> fo = open("h:\temp\somefile.txt","w")
> Traceback (innermost last):
> File "<interactive input>", line 1, in ?
> IOError: (2, 'No such file or directory')
> >>> fo = open("h:/temp/somefile.txt", "w")
> >>> fo.write("This is a line of text\n")
> >>> fo.close()
> >>>
> >>> fo=open("h:\\temp\\somefile.txt","r")
> >>> for ln in fo.readlines():
> ... print ln
> ...
> This is a line of text
>
> >>> fo.close()
> >>>

Note the use of reversed / on the first open, and the doubled \\
on the second.

--
> ============================================================== <
> wlfraed@ix.netcom.com | Wulfraed Dennis Lee Bieber KD6MOG <
> wulfraed@dm.net | Bestiaria Support Staff <
> ============================================================== <
> Bestiaria Home Page: http://www.beastie.dm.net/ <
> Home Page: http://www.dm.net/~wulfraed/ <
HTML "sanitizer" in Python [ In reply to ]
Scott Stirling wrote:
> > 4) If someone helps me out, I think I should be able to use this info. and the tutorial and the Lutz book to loop the process and make the program run until all *.htm files in a folder have been handled once.
>
> Well, if I understand correctly, the *only* thing you're trying to do
> is to remove some specific strings from a bunch of files. Now if I
> were you, I wouldn't even bother to use Python on something that
> simple; I would just use sed. With sed, you could do:
>
> sed 'g/string_to_be_eliminated//g' my_file.html > output.html
>
> Presto, that's it. I think that there is a version for GNU sed for
> Windows somewhere out there; do yourself a favour and get it.

Look for the "user tools" under http://sourceware.cygnus.com/cygwin/

--
=========================================================
Tres Seaver tseaver@palladion.com 713-523-6582
Palladion Software http://www.palladion.com