This sub-project contains Python-based code for testing the PDF output produced by earlier CI stages, or by you locally.
The desired properties of the PDFs under test (by default, all PDFs in the project root), are configured in the tests config file. Once that is done to your liking, set up the testing framework and run the tests (or don't and use CI instead, where everything is done for you already and Just Works™).
This is helpful to check for basic stuff. A more involved approach is shown here. This includes checking for publisher-specific requirements, allowing us to detect errors and iterate much faster.
There are two steps to this. This is unfortunately not as easy as it could be, owed to the nature of Python's ecosystem.
The project uses poetry for dependency
management.
Once you have it installed according to their documentation, it is very easy to pull in
the dependencies of this project.
In the directory containing the poetry
config file, run
poetry install
This will pull the precise requirements needed from the lock file. Otherwise, it uses the config file. That's it... almost.
You will also need a suitable Python interpreter, aka Python version. This is "Python itself". If your system's Python is compatible with what is listed in the config, you do not need to do anything. The easiest way to test this is to just run:
poetry run pytest
and see if it fails.
If it does, poetry
will complain to you accordingly.
In such a case, pyenv
has worked well for me to set
up a suitable, local or system-wide Python interpreter of any desired version.
The setup for the CI pipeline (GitHub Actions) is quite different. Take a look if you like, but the steps there are not applicable to local usage.
After the setup, you can simply run:
poetry run pytest
Prepending everything with poetry run
will make sure all commands run in the suitable
virtual environment with the correct
packages installed in the correct versions, as well as using the Python version set up
using pyenv
, if any.
Any sub-commands or flags after pytest
are courtesy of pytest
, not poetry
.
There, you can for example specify which tests to run.
Otherwise, the testing procedure is tucked away and made accessible via the Makefile.
It requires you to have GNU make
installed.
If you are on Linux you might already have it, since a lot of development workflows rely
on it.
The tests can then be run using:
make test
or any of the other targets in that Makefile:
make test-self
make test-pdfs
Sadly, an inherent issue is that PDF parsing/text extraction is incredibly hard. Visually, PDFs might look fine to human eyes, but trying to tell a computer, in an automated fashion, what it sees and have it parse that correctly is as of today basically impossible.