Skip to content
View xiaomei1995's full-sized avatar

Block or report xiaomei1995

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

All-in-one text de-duplication

Python 583 69 Updated May 21, 2024

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Python 3,422 251 Updated Sep 10, 2024

🥚 Transform PDF to JSON or Markdown with ease and speed 🐣

Python 367 32 Updated Sep 10, 2024
Python 301 22 Updated Jul 26, 2024

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Python 373 33 Updated Feb 1, 2024

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Python 1,479 234 Updated Apr 14, 2024

360LayoutAnaylsis, a series Document Analysis Models and Datasets deleveped by 360 AI Research Institute

218 8 Updated Sep 10, 2024

Python library for parsing .docx (Office Open XML) files

Python 52 24 Updated Mar 26, 2020

Python bindings to PDFium

Python 346 15 Updated Aug 26, 2024

Convert PDF to markdown quickly with high accuracy

Python 16,290 916 Updated Sep 7, 2024

parallel corpus dataset from the mnbvc project

Jupyter Notebook 8 5 Updated Jul 9, 2024

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

Python 6,319 649 Updated Aug 29, 2024

Open source Python library for converting PDF to DOCX.

Python 2,468 360 Updated Sep 6, 2024

A standalone Java library/command line tool that converts DOC, DOCX, PPT, PPTX and ODT documents to PDF files.

Java 590 242 Updated Mar 27, 2023

通用考试题库数据集 选择 填空 简答

Jupyter Notebook 5 5 Updated Nov 10, 2023

borb is a library for reading, creating and manipulating PDF files in python.

Python 3,364 149 Updated Aug 26, 2024

Convert Word documents (.docx files) to HTML

Python 783 121 Updated Jun 16, 2024

Create and modify Word documents with Python

Python 4,491 1,103 Updated Aug 20, 2024

Community maintained fork of pdfminer - we fathom PDF

Python 5,792 919 Updated Aug 2, 2024

A lightning fast Finite State machine and REgular expression manipulation library.

C++ 1,815 126 Updated Oct 24, 2023

arXiv LaTeX Cleaner: Easily clean the LaTeX code of your paper to submit to arXiv

Python 5,172 324 Updated Jul 21, 2024

Parse LaTeX math expressions

Python 386 162 Updated Jul 11, 2019

Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.

Python 1,033 113 Updated Jun 15, 2023

Arxiv Metadata

11 4 Updated Apr 6, 2017

Create and modify Word documents with Python

Python 1 Updated Oct 14, 2019

Grok open release

Python 49,407 8,325 Updated Aug 30, 2024

Diffusion model papers, survey, and taxonomy

2,891 247 Updated Aug 9, 2024

A collection of awesome text-to-image generation studies.

TeX 319 16 Updated Sep 5, 2024

Open-Sora: Democratizing Efficient Video Production for All

Python 21,569 2,072 Updated Aug 9, 2024
Next