-
ShanghaiTech University
- Shanghai
- https://scholar.google.com/citations?user=j_8OPwwAAAAJ&hl=en
Stars
Repo for the paper `Evaluating and Analyzing Relationship Hallucinations in Large Vision-Language Models' (ICML2024)
[Arxiv] Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
[CVPR'24] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
Source code for InBedder, an instruction-following text embedder
code for "Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization"
[CVPR 2024] Official Code for the Paper "Compositional Chain-of-Thought Prompting for Large Multimodal Models"
Mixture-of-Experts for Large Vision-Language Models
Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation
Harnessing 1.4M GPT4V-synthesized Data for A Lite Vision-Language Model
Code for paper: DivideMix: Learning with Noisy Labels as Semi-supervised Learning
AAAI 2024 Accepted Paper Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training
Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
[CVPR 2024] The code for paper 'Towards Learning a Generalist Model for Embodied Navigation'
Lumina-T2X is a unified framework for Text to Any Modality Generation
🔥🔥🔥Official Codebase of "DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation"
This is the official repository for Retrieval Augmented Visual Question Answering
[ECCV 2024] official code for "Long-CLIP: Unlocking the Long-Text Capability of CLIP"
[IROS24 Oral]ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models
Evaluating Vision & Language Pretraining Models with Objects, Attributes and Relations.
[ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of the Open World"
Official PyTorch Implementation of "SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers"
[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
COLA: Evaluate how well your vision-language model can Compose Objects Localized with Attributes!
[ECCV 2024] BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
Strong and Open Vision Language Assistant for Mobile Devices