Mailing List Archive: HTML "sanitizer" in Python

HTML "sanitizer" in Python

Apr 28, 1999, 9:49 AM

Post #1 of 13 (2462 views)

Hi,

I am new to Python. I have an idea of a work-related project I want to do, and I was hoping some folks on this list might be able to help me realize it. I have Mark Lutz' _Programming Python_ book, and that has been a helpful orientation. I like his basic packer and unpacker scripts, but what I want to do is something in between that basic program and its later, more complex manifestations.

I am on a Y2K project with 14 manufacturing plants, each of which has an inventory of plant process components that need to be tested and/or replaced. I want to put each plant's current inventory on the corporate intranet on a weekly or biweekly basis. All the plant data is in an Access database. We are querying the data we need and importing into 14 MS Excel 97 spreadsheets. Then we are saving the Excel sheets as HTML. The HTML files bloat out with a near 100% increase in file size over the original Excel files. This is because the HTML converter in Excel adds all kinds of unnecessary HTML code, such as for every single cell in the table. Many of these tables have over 1000 cells, and this code, along with its accompanying closing FONT tag, add up quick. The other main, unnecessary code is the ALIGN="left" attribute in <TD> tags (the default alignment _is_ left). The unnecessary tags are consistent and easy to identify, and a routine should be writable that will automate the removal of them.

I created a Macro in Visual SlickEdit that automatically opens all these HTML files, finds and deletes all the tags that can be deleted, saves the changes and closes them. I originally wanted to do this in Python, and I would still like to know how, but time constraints prevented it at the time. Now I want to work on how to create a Python program that will do this. Can anyone help? Has anyone written anything like this in Python already that they can point me too? I would really appreciate it.

Again, the main flow of the program is:

>> Open 14 HTML files, all in the same folder and all with the .html extension.
>> Find certain character strings and delete them from the files. In one case (the <TD> tags) it is easier to find the whole tag with attributes and then _replace_ the original tag with a plain <TD>.
>> Save the files.
>> Close the files.
>> Exit the program.

More advanced options would be the ability for the user to set parameters for the program upon running it, to keep from hard-coding the find and replace parms.

OK, thanks to any help you can provide. I partly was turned on to Python by Eric Raymond's article, "How to Become a Hacker" (featured on /.). I use Linux at home, but this program would be for use on a Windows 95 platform at work, if that makes any difference. I do have the latest Python interpreter and editor for Windows here at work.

Yours truly,
Scott

Scott M. Stirling
Visit the HOLNAM Year 2000 Web Site: http://web/y2k
Keane - Holnam Year 2000 Project
Office: 734/529-2411 ext. 2327 fax: 734/529-5066 email: sstirlin@holnam.com