sisterdong

sisterdong

4 followers · 6 following

Block or Report

Block or report sisterdong

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Lists (1)

Sort

👀 Interviews

2 repositories

Beta Lists are currently in beta. Share feedback and report bugs.

Stars

rom1504 / cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...

Python 300 23 Updated Dec 9, 2023

DmitryRyumin / CVPR-2023-24-Papers

CVPR 2023-2024 Papers: Dive into advanced research presented at the leading computer vision conference. Keep up to date with the latest developments in computer vision and deep learning. Code inclu…

Python 322 21 Updated Jun 22, 2024

pymupdf / PyMuPDF-Utilities

Demos, examples and utilities using PyMuPDF

Jupyter Notebook 501 140 Updated Jun 13, 2024

EleutherAI / gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

Python 6,675 972 Updated Jun 21, 2024

EleutherAI / pythia

The hub for EleutherAI's work on interpretability and learning dynamics

Jupyter Notebook 2,108 154 Updated Jun 18, 2024

NielsRogge / Vision-Transformer-papers

This repository contains an overview of important follow-up works based on the original Vision Transformer (ViT) by Google.

125 9 Updated Jan 3, 2022

VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy

Python 13,126 650 Updated Jun 17, 2024

attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps

Python 3,671 955 Updated May 23, 2024

graphistry / pygraphistry

PyGraphistry is a Python library to quickly load, shape, embed, and explore big graphs with the GPU-accelerated Graphistry visual graph analyzer

Python 2,084 205 Updated Jun 17, 2024

Visualize-ML / Book6_First-Course-in-Data-Science

Book_6_《数据有道》 | 鸢尾花书：从加减乘除到机器学习；欢迎大家批评指正！纠错多的同学会得到赠书感谢！

Jupyter Notebook 1,614 306 Updated Apr 6, 2024

jawah / charset_normalizer

Truly universal encoding detector in pure Python

Python 536 49 Updated Jun 19, 2024

huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Python 1,700 101 Updated Jun 22, 2024

OpenMatch / NeuScraper

[ACL 2024] This is the code repo for our ACL’24 paper "Cleaner Pretraining Corpus Curation with Neural Web Scraping".

Python 194 15 Updated Jun 20, 2024

princeton-nlp / SWE-bench

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?

Python 1,398 230 Updated Jun 18, 2024

openai / transformer-debugger

Python 3,950 231 Updated Jun 4, 2024

InflectionAI / Inflection-Benchmarks

Public Inflection Benchmarks

67 2 Updated Mar 6, 2024

LDNOOBW / List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

List of Dirty, Naughty, Obscene, and Otherwise Bad Words

2,811 654 Updated Jun 19, 2024

gkamradt / LLMTest_NeedleInAHaystack

Doing simple retrieval from LLM models at various context lengths to measure accuracy

Jupyter Notebook 1,248 126 Updated Jun 20, 2024

HqWu-HITCS / Awesome-LLM-Survey

An Awesome Collection for LLM Survey

247 23 Updated May 2, 2024

ksOAn6g5 / TaiSu

TaiSu（太素）--a large-scale Chinese multimodal dataset（亿级大规模中文视觉语言预训练数据集）

Python 164 10 Updated Nov 17, 2023

NiuTrans / Classical-Modern

非常全的文言文（古文）-现代文平行语料

Python 943 204 Updated Apr 21, 2024

kpu / kenlm

KenLM: Faster and Smaller Language Model Queries

C++ 2,430 509 Updated Feb 25, 2024

JessicaTegner / pypandoc

Thin wrapper for "pandoc" (MIT)

Python 831 108 Updated Jun 4, 2024

Zhen-Tan-dmml / LLM4Annotation

187 8 Updated Mar 9, 2024

oscar-project / ungoliant

🕷️ The pipeline for the OSCAR corpus

Rust 153 14 Updated Dec 18, 2023

chatnoir-eu / web-content-extraction-benchmark

Web Content Extraction Benchmark

Python 14 1 Updated May 24, 2024

adbar / trafilatura

Python & command-line tool to gather text on the Web: Crawling & scraping, content extraction, metadata. TXT, Markdown, CSV & XML output.

Python 3,125 234 Updated Jun 19, 2024

modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code

Python 9,569 647 Updated Jun 19, 2024

Ethan-yt / guwenbert

GuwenBERT: 古文预训练语言模型（古文BERT） A Pre-trained Language Model for Classical Chinese (Literary Chinese)

477 41 Updated Aug 31, 2021

chujiezheng / chat_templates

Chat Templates for 🤗 HuggingFace Large Language Models

Jinja 309 29 Updated Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly