Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.

Python 113 12 Updated Jun 7, 2023

Nixtla / nixtla

TimeGPT-1: production ready pre-trained Time Series Foundation Model for forecasting and anomaly detection. Generative pretrained transformer for time series trained on over 100B data points. It's …

Jupyter Notebook 2,028 159 Updated Jul 19, 2024

huridocs / pdf-document-layout-analysis

A Docker-powered service for PDF document layout analysis. This service provides a powerful and flexible PDF analysis service. The service allows for the segmentation and classification of differen…

Python 46 5 Updated Jul 21, 2024

THUDM / CogVLM2

GPT4V-level open-source multi-modal model based on Llama3-8B

Python 1,622 86 Updated Jul 16, 2024

apple / ml-aim

This repository provides the code and model checkpoints of the research paper: Scalable Pre-training of Large Autoregressive Image Models

Python 667 42 Updated May 2, 2024

OpenDriveLab / Vista

A Generalizable World Model for Autonomous Driving

Python 405 20 Updated Jun 17, 2024

cognitivecomputations / kraken

Jupyter Notebook 63 4 Updated May 26, 2024

microsoft / Phi-3CookBook

This is a Phi-3 book for getting started with Phi-3. Phi-3, a family of open AI models developed by Microsoft. Phi-3 models are the most capable and cost-effective small language models (SLMs) avai…

Jupyter Notebook 1,363 123 Updated Jul 20, 2024

sb-jang / kodialogbench

Code and data for "KoDialogBench: Evaluating Conversational Understanding of Language Models with Korean Dialogue Benchmark" (LREC-COLING 2024)

Python 14 Updated Mar 2, 2024

google-research / pix2seq

Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)

Jupyter Notebook 842 69 Updated Nov 7, 2023

OpenBMB / MiniCPM-V

MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone

Python 8,036 565 Updated Jul 19, 2024

SalesforceAIResearch / uni2ts

[ICML2024] Unified Training of Universal Time Series Forecasting Transformers

Jupyter Notebook 657 55 Updated Jul 9, 2024

sunlicai / HiCMAE

[Information Fusion 2024] HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition

Python 77 7 Updated May 20, 2024

modelscope / FunASR

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.

Python 4,895 533 Updated Jul 22, 2024