Yet another LaTex OCR Project written in PyTorch, based on ConvNeXt and Transformer.
This project is the backend of CeleryMath.
Give us a star if this project helps you 🤗
Any further developments and contributions are welcome 😄
Follow the following instructions to train by yourself:
- Create virtual environment.
poetry install
poetry shell
- Create dataset.
You can download generated dataset from here (2.05G) or generate by yourself with the following code, the generation may be slow:
python -m src.utils.latex2png -i dataset/data/full_math.txt -w dataset/data/full_set -b 1
- Edit config file.
Edit the src/config/config_convnext.json
and replace the dataset path to yours.
- Run training
python -m src.train
If you have your own latex formula dataset, you can add them to dataset/data/full_math.txt
and regenerate tokenizers
and images.
- tokenizers from hugging face was used, if you want to change formula file and output file location, edit
src/dataset.py
python -m src.dataset
- generate dataset, TexLive or MikeTex or similar program must be installed.
python -m src.utils.latex2png -i dataset/data/full_math.txt -w dataset/data/full_set -b 1
Open an issue if you have any questions or a PR if you can fix it.
- API
- Desktop Deploy, see CeleryMath
- ONNX
- Use pytorch-lightning to manage training and evaluation
This project was inspired by the following project, and some methods or codes were also borrowed from them, THANKS A LOT!!! 🤝
- LaTex-OCR, LICENSE: MIT, https://github.com/lukas-blecher/LaTeX-OCR
- LaTeX_OCR_PRO, LICENSE: GPL-3.0, https://github.com/LinXueyuanStdio/LaTeX_OCR_PRO
GPL-3.0, details here