Skip to content

rmuit/HtmlCleanup

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HtmlCleanup

This script was written to create clean HTML output from source files that no existing software (which I tried in 2011) was able to deal with.

It's written in Python. Much of its code uses the BeautifulSoup library, but first it does raw string manipulation to fix up illegal HTML tag constructs that would confuse BeautifulSoup (or HTMLTidy).

Usage

  • Install BeautifulSoup version 3. (Which is available for Python 2.x only. I welcome PRs for porting to version 4.)
  • Run:
cleanup_msfp.py inputfile.html > output.html
  • Check the output and see if it is to your liking.
  • If it could be better and you're a programmer: modify the script. Maybe send in a PR if your modifications are general enough.

Details

History

This was originally written to clean up HTML generated by (possibly a very old version of) Microsoft Frontpage, which:

  1. has a habit of putting unnecessary tags in the HTML (like probably any other webpage authoring software);
  2. creates wonderful constructs like this (excerpted):
<body bgcolor="#FFFFFF" text="#663300" link="#660000" vlink="#999900" alink="#006600">
<!--mstheme--><font face="Book Antiqua">

<div align="center">
  <center>
  <!--mstheme--></font>
  <table border="0" width="600">
    <tr>
      <td><!--mstheme--><font face="Book Antiqua">

<h2 style="line-height: 15.1 pt; mso-line-height-rule: exactly; mso-pagination: widow-orphan; margin-top: 0" align="left"><!--mstheme--><font face="Book Antiqua, Times New Roman, Times" color="#996600">This is the

title of my page<!--mstheme--></font></h2>

 <b><p> Here is some text.</b></p>

The switching around of the p / b opening/closing tags is something that HtmlTidy could deal with. But the braindead font tags caused the whole document structure to be interpreted wrongly. So I wrote my own stuff in 2010/2011. (Also as a way to get some real exposure to Python programming.)

In 2017 I looked at it again, cleaned it up and put it on Github.

Logic / philosophy

This is not meant to be a 'black box' that can only be manipulated with commandline arguments. It's a script, which can be altered completely if needed. That doesn't mean altering it is super easy, because you still need to read and understand the code / logic. But it's written in a way so that altering it is possible:

  • cleanup_msfp.py is a sequence of actions which usually only take up a few lines of code, and can be read top-down;
  • cleanup_msfp.py contains more comments than code;
  • Almost all actions contain calls to methods in a helper class, which contains the detailed code.

This hopefully makes it easier to get to know the general workings of the script and be able to change it. The classes are also well commented, but I hope that a lot of customization (if that is necessary) can be done without having to read the details of the code in those classes. Reading the inline documentation should be helpful (and sometimes necessary), though.

For detailed treatment of whitespace: see the comments at the start of soupcleanup.py.

Actions perfomed

The logic generally makes a difference between 'inline' tags (e.g. <strong>, <span>) and 'non-inline' tags (like <p>, <li>, <div>). The reason is that whitespace must be treated differently for both types of tags. 'non-inline' tags are generally assumed to be starting a new line in the rendered document.

Besides removing completely faulty tags like some <font> ones in the above example, the script can so far do the following:

  • Remove script tags and HTML comments.
  • Replace <b>, <i>, <font> by <strong>, <em>, <span> tags.
  • Remove inline tags without contents.
  • Remove some tags (like <span>) if they wrap/are wrapped in a single tag, which means these tags should not be needed. Any attributes are moved from one tags into the other, before removing them.
  • Remove alignment attributes (or tags like <center>) in places where they should not be needed.
  • If a double <br> is encountered inside a <p>, convert this to two separate <p>.
  • Replace a <table> consisting of one single cell, by a <div>.
  • Convert <table>s which have two colums, where the first column only contains images that represent a bullet point, into <ul>/<li> constructs.
  • Unify whitespace usage: move whitespace at the start/end of inline tags, to just outside those tags.
  • Remove whitespace that does not have any effect (e.g. just before or after a <br> or at the start/end of a non-inline tag).
  • De-duplicate superfluous whitespace.

Authors

  • Roderik Muit - Wyz

License

This software is licensed under the MIT License - see the LICENSE.md file for details.

About

Script to clean up convoluted HMTL

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages