Highlights
- Pro
Stars
[NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
A simple pip-installable Python tool to generate your own HTML citation world map from your Google Scholar ID.
Fully automated end-to-end framework to extract data from bar plots and other figures in scientific research papers using modules such as OpenCV, AWS-Rekognition.
[NAACL 2024] MMC: Advancing Multimodal Chart Understanding with LLM Instruction Tuning
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
[CVPR23] DialMAT: Dialogue-Enabled Transformer with Moment-based Adversarial Training
LAVIS - A One-stop Library for Language-Vision Intelligence
[ICCV 2023] Official code repository for ARNOLD benchmark
NOPA: Neurally-guided Online Probabilistic Assistance for Building Socially Intelligent Home Assistants
A benchmark environment for fully cooperative human-AI performance.
Learning for task and motion planning in a 2D kitchen.
2024中国翻墙软件VPN推荐以及科学上网避坑,稳定好用。对比SSR机场、蓝灯、V2ray、老王VPN、VPS搭建梯子等科学上网与翻墙软件,中国最新科学上网翻墙梯子VPN下载推荐,访问Chatgpt。
Code for EmBERT, a transformer model for embodied, language-guided visual task completion.
awesome grounding: A curated list of research papers in visual grounding
Efficiently Scaling Up Video Annotation with Crowdsourced Marketplaces. IJCV 2012
Code for the paper Watch-And-Help: A Challenge for Social Perception and Human-AI Collaboration
calibration tests for wearable eye-tracking glasses