Examples: https://www.youtube.com/watch?v=PXW_sWKoHnI
Goes through the HTML of a wordpress math blogpost (mainly, Prof. Terry Tao’s blog) using a combination of regexes and BeautifulSoup, and spits out a .tex
should be few and easy to fix.
Note: please observe Prof Tao's copyright notice on this page and do not redistribute large numbers of Tao's blogposts without asking him for permission:
Readers are welcome to copy, link to, quote from, or translate reasonable portions of the content of this blog (e.g. a single article) into other media, though for items longer than one or two paragraphs, I would appreciate it if a reference or citation to the URL that the content originates from is provided. If you wish to copy a significantly larger fraction of the content (e.g. an entire series of articles), please contact me about it first.
You need reasonably up-to-date installations of Python 3 and tao2tex.py
). In addition, we also require the following to be installed (e.g. via pip)
lxml
bs4
(Beautiful Soup)requests
emoji
You could also use a cloud service like Overleaf in lieu of a new
-
clone the repo and install the dependencies. One way to do this is with
pdm install
. -
Go to Terry’s blog and find a post you want to convert to
$\rm\LaTeX$ . -
Copy the URL.
-
cd
to the repo and runpython3 tao2tex.py URL
. (if using pdm, then usepdm run python tao2tex.py
) -
Wait a few seconds and a
.tex
file will be produced. -
Run the
.tex
file in your favourite$\rm\LaTeX$ workflow to create a finished PDF.
For instance if we copied this url, we should type python3 tao2tex.py https://terrytao.wordpress.com/2018/12/09/254a-supplemental-weak-solutions-from-the-perspective-of-nonstandard-analysis-optional/
.
tao2tex also supports a local mode, and a batch mode:
- For local mode, save the html of the page and then use the name of the file in place of the url, with the option
-l
. e.g.python3 tao2tex.py file.html -l
- For batch mode, save the list of urls in a file, e.g. batch.txt and call
python3 tao2tex.py batch.txt -b
. If you have a list of local files, you can use-b -l
, e.g. the providedtested.txt
file. Everything after the first whitespace in each line is ignored, so you can leave comments after a space.
In addition, you can specify the name of the .tex file with the -o
option, the -p
option prints the output to the command-line, and -d
enables a rudimentary debugger. If you do not have a specific post in mind, you can run python3 tao2tex.py -i https://terrytao.wordpress.com
to get a list of blog posts on Prof Tao's front page.
Since the desired output is not precisely defined, we provide a test.html
file which may be used for debugging (in particular, for adding features, adjusting to breaking changes, or for adapting to other blogs). It is a short sample HTML file that can be used to test the output of tao2tex via the command python3 tao2tex.py test.html -l
.
The easiest way to customise the output is to modify preamble.tex
. The theorems look very close to how they appear online. This is achieved with \usepackage[framemethod=tikz]{mdframed}
and the simple style \mdfdefinestyle{tao}{outerlinewidth = 1,roundcorner=2pt,innertopmargin=0}
. The more standard amsthm
environments are provided as a commented-out block.
There are a number of keywords in the given preamble.tex
; they are in all-caps and begin with TTT-
, e.g. TTT-BLOG-TITLE
. These are substituted via regex by tao2tex.py to create the .tex
output. It is possible to create more of these keywords; to make tao2tex see them, you should modify the preamble_formatter
function.
Emoji that appear (for instance, in certain comments) are processed (e.g. 😂 becomes \emoji{face_with_tears_of_joy}
); \emoji
is defined to simply be \texttt
, as emoji
package, and compile with
-
the more recent versions (since 2018) of
$\rm pdf\LaTeX$ will cope with many unicode symbols (but not all) because UTF8 is assumed to be the default input encoding. If you do not want to install a newer version, you can try using Overleaf. You might be able to get away with adding\usepackage[TU]{inputenc}
or\usepackage[T1]{inputenc}
to the preamble... -
Sometimes (In section names, theorem names, etc.) The mathematics is skipped. This should be easy to fix once I have time to look into this.
-
In
string_formatter
, we escape only a few unicode characters to attempt to please the$\rm\TeX$ engine. We replace greek characters, which do appear on some of the blog posts, in an arguably naive and counterproductive manner (e.g. alpha into\(\alpha\)
).$\rm{}pdf\LaTeX$ will complain, and$\rm{}Xe\LaTeX$ and$\rm{}Lua\LaTeX$ will work if you switch to a font that has the glyphs (without, these two will still compile.) -
Since we pull website data using the
requests
module, we do not see any HTML generated from Javascript. For example, we are unable to process the occasional polls that Tao makes. However, the rest of the post should work as expected. -
In some posts, e.g. this one, there are so many comments that we check multiple pages. We skip this when running in
-l
/--local
mode. -
The heuristics we use for labels are not perfect. However, we definitely include all labelled tags (formatted as
<a name="...">eq. number</a>
). Most issues seem to be easy to regex away after running tao2tex; for example, I had success replacingend{align}\\label{[a-z-]*}
withend{align}
globally. -
Most likely, modification of the
BeautifulSoup
part is needed to work with other blogs, even those that are on Wordpress. Despite looking quite similar, the precise way that the tags are laid out seem to differ from blog to blog. -
For similar reasons, if Prof Tao ever updates the layout of the blog, this tool will break. Hopefully such a new version will directly support a good print option, but in any case the posts pre-update with the older layout will still be accessible, thanks to the Internet Archive.