Prerequisites: python3.10
Install Dependencies
linux/osx
apt-get/yum/brew install libreoffice
windows
install libreoffice
append "install_dir\LibreOffice\program" to ENVIRONMENT PATH
Install Magic-Doc
pip install fairy-doc[cpu] # cpu version
or
pip install fairy-doc[gpu] # gpu version
Magic-Doc is a lightweight open-source tool that allows users to convert multiple file type (PPT/PPTX/DOC/DOCX/PDF) to markdown. It supports both local file and S3 file.
# for local file
from magic_doc.docconv import DocConverter, S3Config
converter = DocConverter(s3_config=None)
markdown_content, time_cost = converter.convert("some_doc.pptx", conv_timeout=300)
# for remote file located in aws s3
from magic_doc.docconv import DocConverter, S3Config
s3_config = S3Config(ak='${ak}', sk='${sk}', endpoint='${endpoint}')
converter = DocConverter(s3_config=s3_config)
markdown_content, time_cost = converter.convert("s3:https://some_bucket/some_doc.pptx", conv_timeout=300)
ENV: AMD EPYC 7742 64-Core Processor, NVIDIA A100, Centos 7
File Type | Speed |
---|---|
PDF (digital) | 347 (page/s) |
PDF (ocr) | 2.7 (page/s) |
PPT | 20 (page/s) |
PPTX | 149 (page/s) |
DOC | 600 (page/s) |
DOCX | 1482 (page/s) |
@misc{2024magic-doc,
title={Magic-Doc: A Toolkit that Converts Multiple File Types to Markdown},
author={Magic-Doc Contributors},
howpublished = {\url{https://github.com/InternLM/magic-doc}},
year={2024}
}
This project is released under the Apache 2.0 license.