Merge branch 'inference' of https://github.com/LLaVA-VL/LLaVA-NeXT in…

…to inference
LLaVA-VL · May 30, 2024 · 2cdee9f · 2cdee9f
2 parents 22103f3 + 19ddd2e
commit 2cdee9f
Show file tree

Hide file tree

Showing 6 changed files with 55 additions and 12 deletions.
diff --git a/README.md b/README.md
@@ -6,14 +6,13 @@
 [![llava_next-blog](https://img.shields.io/badge/llava_next-blog-green)](https://llava-vl.github.io/blog/)
 [![llava_next-demo](https://img.shields.io/badge/llava_next-image_demo-red)](https://llava-next.lmms-lab.com/)
 [![llava_next-video_demo](https://img.shields.io/badge/llava_next-video_demo-red)](https://llavanext-video.lmms-lab.com/)
-[![llava_next-image_checkpoints](https://img.shields.io/badge/llava_next-image_checkpoints-blue)](https://huggingface.co/collections/lmms-lab/llava-next-6623288e2d61edba3ddbf5ff)
+[![llava_next-image_checkpoints](https://img.shields.io/badge/llava_next-image_checkpoints-blue)](https://huggingface.co/lmms-lab)
 [![llava_next-video_checkpoints](https://img.shields.io/badge/llava_next-video_checkpoints-blue)](https://huggingface.co/collections/lmms-lab/llava-next-video-661e86f5e8dabc3ff793c944)
 
 ## Release
-- [2024/05/10] 🔥 **LLaVA-NeXT** (Stronger) models are released, with support of stronger LMM inlcuding LLama-3 (8B) and Qwen-1.5 (72B/110B) Check out [[blog](https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/)] and [[checkpoints](https://huggingface.co/collections/lmms-lab/llava-next-6623288e2d61edba3ddbf5ff)] to see improved performance!
-- [2024/05/10] 🔥 **LLaVA-NeXT** (Video) is released. The image-only-trained LLaVA-NeXT model is surprisingly strong on video tasks with zero-shot modality transfer. DPO training with AI feedback on videos can yield significant improvement. [[Blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)] and [[checkpoints](https://huggingface.co/collections/lmms-lab/llava-next-video-661e86f5e8dabc3ff793c944)]
+- [2024/05/10] 🔥 **LLaVA-NeXT** (Stronger) models are released, with support of stronger LMM inlcuding LLama-3 (8B) and Qwen-1.5 (72B/110B) Check out [[blog](https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/)] and [[checkpoints](https://huggingface.co/lmms-lab)] to see improved performance!
+- [2024/05/10] 🔥 **LLaVA-NeXT** (Video) is released. The image-only-trained LLaVA-NeXT model is surprisingly strong on video tasks with zero-shot modality transfer. DPO training with AI feedback on videos can yield significant improvement. [[Blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)], [[checkpoints](https://huggingface.co/collections/lmms-lab/llava-next-video-661e86f5e8dabc3ff793c944)] and [[sglang](https://github.com/sgl-project/sglang)]
 - [2024/01/30] 🔥 **LLaVA-NeXT** is out! With additional scaling to LLaVA-1.5, LLaVA-NeXT-34B outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks/applications than before. Check out the [blog post](https://llava-vl.github.io/blog/2024-01-30-llava-next/), and explore the [demo](https://llava.hliu.cc/)! Models are available in [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md). Training/eval data and scripts coming soon.
-
 <details>
 <summary>More</summary>
 
@@ -78,6 +77,40 @@ Please checkout the following page for more inference & evaluation details.
 #### - LLaVA-NeXT: A Strong Zero-shot Video Understanding Model
 - [LLaVA-NeXT-Video](./docs/LLaVA-NeXT-Video.md): for video inference and evaluation scripts.
 
+
+## SGLang for SpeedUp Inference and Deployment
+
+We use [SGLang](https://github.com/sgl-project/sglang) to speed up inference and deployment of LLaVA-NeXT. You could make LLaVA-NeXT as a backend API service with SGLang.
+
+**Prepare Environment**:
+ Following the instruction in the [sglang](https://github.com/sgl-project/sglang?tab=readme-ov-file#install)
+
+### LLaVA-NeXT (Image)
+
+Checkout the HTTP Post/Get and SRT usage at [sglang/examples/usage/llava](https://github.com/sgl-project/sglang/blob/main/examples/usage/llava)
+
+### LLaVA-NeXT (Video)
+
+**Launch and Run on (K) Nodes**:
+- Go to sglang project
+ ```
+ cd PATH_TO/sglang
+ ```
+- First node:
+ ```sh
+ bash examples/usage/llava_video/srt_example_llava_v.sh K 0 YOUR_VIDEO_PATH YOUR_MODEL_PATH FRAMES_PER_VIDEO
+ (e.g. bash examples/usage/llava_video/srt_example_llava_v.sh K 0 examples/usage/llava_video/videos/Q98Z4OTh8RwmDonc.mp4 lmms-lab/LLaVA-NeXT-Video-7B-DPO 16)
+ ```
+- Second node:
+ ```sh
+ bash examples/usage/llava_video/srt_example_llava_v.sh K 1 YOUR_VIDEO_PATH YOUR_MODEL_PATH FRAMES_PER_VIDEO
+ ```
+- The K node:
+ ```sh
+ bash examples/usage/llava_video/srt_example_llava_v.sh K K-1 YOUR_VIDEO_PATH YOUR_MODEL_PATH FRAMES_PER_VIDEO
+ ```
+
+
 ## Citation
 
 If you find it useful for your research and applications, please cite related papers/blogs using this BibTeX:

diff --git a/docs/LLaVA-NeXT.md b/docs/LLaVA-NeXT.md
@@ -1,6 +1,7 @@
 # LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild
 
 ## Quick Start With HuggingFace
+First please install our repo with code and environments: `pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git`
 
 Here is a quick inference code using [`llavanext-llama3-8B`](https://huggingface.co/lmms-lab/llama3-llava-next-8b) as an example. You will need to install [`flash-attn`](https://github.com/Dao-AILab/flash-attention) to use this code snippet. If you don't want to install it, you can set `attn_implementation=None` when load_pretrained_model
 ```python

diff --git a/llava/conversation.py b/llava/conversation.py
@@ -348,6 +348,12 @@ def dict(self):
  sep2="</s>",
 )
 
+try:
+ llama3_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
+except Exception as e:
+ print("Error loading llama3 tokenizer")
+ print(e)
+
 conv_llava_llama_3 = Conversation(
  system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.",
  roles=("<|start_header_id|>user", "<|start_header_id|>assistant"),
@@ -356,7 +362,7 @@ def dict(self):
  offset=0,
  sep_style=SeparatorStyle.LLAMA_3,
  tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
- tokenizer=AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct"),
+ tokenizer=llama3_tokenizer,
  stop_token_ids=[128009],
 )
 

diff --git a/llava/model/__init__.py b/llava/model/__init__.py
@@ -14,8 +14,7 @@
  try:
  exec(f"from .language_model.{model_name} import {model_classes}")
  except ImportError:
- import traceback
-
- traceback.print_exc()
+ # import traceback
+ # traceback.print_exc()
  print(f"Failed to import {model_name} from llava.language_model.{model_name}")
  pass
diff --git a/llava/model/multimodal_encoder/clip_encoder.py b/llava/model/multimodal_encoder/clip_encoder.py
@@ -108,3 +108,7 @@ def num_patches(self):
  if "cls_patch" in self.select_feature:
  _num_patches += 1
  return _num_patches
+
+ @property
+ def image_size(self):
+ return self.config.image_size
diff --git a/llavavid/model/builder.py b/llavavid/model/builder.py
@@ -46,7 +46,7 @@ def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, l
  tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
  print("Loading LLaVA from base model...")
  if "mixtral" in model_name.lower():
- model = LlavaMixtralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, use_flash_attention_2=False, **kwargs)
+ model = LlavaMixtralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, **kwargs)
  else:
  model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, **kwargs)
  token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features
@@ -105,7 +105,7 @@ def load_from_hf(repo_id, filename, subfolder=None):
  model = LlavaMptForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
  elif "mixtral" in model_name.lower() and "vicuna" not in model_name.lower() and "mistral" not in model_name.lower():
  tokenizer = AutoTokenizer.from_pretrained(model_path)
- model = LlavaMixtralForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, use_flash_attention_2=True, **kwargs)
+ model = LlavaMixtralForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
  elif "mistral" in model_name.lower() or "zephyr" in model_name.lower():
  tokenizer = AutoTokenizer.from_pretrained(model_path)
  cfg_pretrained = AutoConfig.from_pretrained(model_path)
@@ -114,15 +114,15 @@ def load_from_hf(repo_id, filename, subfolder=None):
  print(f"Overwriting config with {overwrite_config}")
  for k, v in overwrite_config.items():
  setattr(cfg_pretrained, k, v)
- model = LlavaMistralForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, use_flash_attention_2=True, config=cfg_pretrained, **kwargs)
+ model = LlavaMistralForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
  else:
  tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
  cfg_pretrained = AutoConfig.from_pretrained(model_path)
  if overwrite_config is not None:
  print(f"Overwriting config with {overwrite_config}")
  for k, v in overwrite_config.items():
  setattr(cfg_pretrained, k, v)
- model = LlavaLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, use_flash_attention_2=True, config=cfg_pretrained, **kwargs)
+ model = LlavaLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
  else:
  # Load language model
  if model_base is not None: