A python library for parsing, converting and modifying PageXML files.
pip install pypxml
- Clone repository:
git clone https://github.com/jahtz/pypxml
- Install package:
cd pypxml && pip install .
- Test with
pypxml --version
pypxml [OPTIONS] COMMAND [ARGS]...
PyXML provides a feature rich Python API for working with PageXML files.
from pypxml import PageXML, PageType
pxml = PageXML.from_xml('path_to_pagexml.xml')
text_region = pxml.create_element(PageType.TextRegion, type='paragraph', id='tr_001')
text_region.create_element(PageType.Coords, points='1,2 3,4 5,6 ...')
for region in pxml.regions:
print(region.type)
pxml.to_xml('path_to_output.xml')
Developed at Centre for Philology and Digitality (ZPD), University of Würzburg.