-
Notifications
You must be signed in to change notification settings - Fork 505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] Add llava qwen, llava mistral #419
Conversation
@kcz358 This is awesome! When can I use this? |
Thanks @kcz358 for this PR. I added some test functions here (used to debug during demo host). You can add it somewhere and name it """
Usage:
# Endpoint Service CLI: CUDA_VISIBLE_DEVICES=0,1,2,3 python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host="127.0.0.1" --tp-size=4
python3 test_httpserver_llava_llama3.py
Output:
"Stylish Feline: A Cat's Chic Adventure in a Pink Hoodie and Sunglasses"
"""
import argparse
import asyncio
import json
import time
import aiohttp
import requests
from llava.conversation import (
default_conversation,
conv_templates,
SeparatorStyle,
conv_llava_llama_3,
conv_qwen,
)
# installing latest llava-next: pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
async def send_request(url, data, delay=0):
await asyncio.sleep(delay)
async with aiohttp.ClientSession() as session:
async with session.post(url, json=data) as resp:
output = await resp.json()
return output
async def test_concurrent(args):
url = f"{args.host}:{args.port}"
response = []
for i in range(1):
response.append(
send_request(
url + "/generate",
{
"text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|eot_id|><|start_header_id|><|start_header_id|>user<|end_header_id|>\n\n<image>\nPlease generate caption towards this image.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
"image_data": "/mnt/bn/vl-research/workspace/boli01/projects/demos/sglang_codebase/examples/quick_start/images/cat.jpeg",
"sampling_params": {
"max_new_tokens": 1024,
"temperature": 0,
"top_p": 1.0,
"presence_penalty": 2,
"frequency_penalty": 2,
"stop": "<|eot_id|>",
},
},
)
)
rets = await asyncio.gather(*response)
for ret in rets:
print(ret["text"])
def test_streaming(args):
url = f"{args.host}:{args.port}"
pload = {
"text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|eot_id|><|start_header_id|><|start_header_id|>user<|end_header_id|>\n\n<image>\nPlease generate caption towards this image.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
"sampling_params": {
"max_new_tokens": 1024,
"temperature": 0,
"top_p": 1.0,
"presence_penalty": 2,
"frequency_penalty": 2,
"stop": "<|eot_id|>",
},
"image_data": "/mnt/bn/vl-research/workspace/boli01/projects/demos/sglang_codebase/examples/quick_start/images/cat.jpeg",
"stream": True,
}
response = requests.post(
url + "/generate",
json=pload,
stream=True,
)
prev = 0
for chunk in response.iter_lines(decode_unicode=False):
chunk = chunk.decode("utf-8")
if chunk and chunk.startswith("data:"):
if chunk == "data: [DONE]":
break
data = json.loads(chunk[5:].strip("\n"))
output = data["text"].strip()
print(output[prev:], end="", flush=True)
prev = len(output)
print("")
# Endpoint Service CLI: CUDA_VISIBLE_DEVICES=0,1,2,3 python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host="127.0.0.1" --tp-size=4
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--host", type=str, default="https://127.0.0.1")
parser.add_argument("--port", type=int, default=30000)
args = parser.parse_args()
asyncio.run(test_concurrent(args))
test_streaming(args) |
Hi @Iven2132 , you can refer to the example above. |
Which example? I don't think its been merged, I want to deploy the 110b and 72b model |
Hi, @Luodian Can you give me an example without streaming? Just a simple code example. |
@Luodian It's just logging "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained." but nothing is happening after that. Can you help?
|
You can use |
Also, you need to use a local path for |
What if I have a remote image Url and base64? Can you please tell how the correct script should look like? |
@Iven2132 , If you just want a simple demo, you can just use the example script for llava in the main branch. The pipeline is the same. You just need to change the model path and tokenizer path with the path we provided and choose the correct chat template. |
@kcz358 I tried this but same issue, It's just saying "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained." and not giving output. Logs:
Code:
|
Oh hey @kcz358 I have two questions 1) Is it possible to directly pass a remote image URL? 2) How can I serve the model directly so it doesn't have to load every time? I just want to load the model one time in my start_engine and use the generate function to get the output. I think this will be fast. currently, my code is not printing anything. Here is my current code:
and here are the logs:
|
Hi @Iven2132 , I am not sure how to do it with openai format. But based on my understanding on the code from @Luodian , I believe you can use a payload in json and post it to the url. If you don't want to reload the model everything, is it possible that you can set a server use one script and then query it using another script? |
I don't think I can do this on Modal but does SGL have any serve mind or feature? where can I load the model once? We can do this with lmdeploy
|
Hey @Luodian I tried your code example but it's not responding. Can you please help? Here is my code:
|
@kcz358 Can you please check my code? |
now i am getting this
|
This error is mainly caused by flash-attn and has nothing to do with sglang or llava. You might want to clean up your cuda and reinstall. |
Hi~ Can you help to check if this PR can be merged? |
I don't think so, my code works when I comment out all the code that uses sglang and llava. |
@kcz358 @Luodian Here are the full logs:
|
@Iven2132 , Because you use llava and sglang that use flash-attn. The main cause is still cuda version for flash-attn. You can refer to this oobabooga/text-generation-webui#4182 |
@kcz358 @Luodian is Qwen 72b and 110b supported? I'm using it like this
But getting these errors:
|
@kcz358 @Luodian Also example code in examples/usage/llava/http_qwen_llava_test.py doesn't seems to work. It's not giving any response at all when the model is being loaded after that it logs logging |
@Luodian Can you please run this code on Modal? https://modal.com/
|
@Luodian When generating the prompt, what conversation format were you using? Was it this:
|
@Luodian How much time it takes for you to load the 72b model? When you run "python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --tokenizer-path lmms-lab/llavanext-qwen-tokenizer --port=30000 --host="127.0.0.1" --tp-size=4" I'm running 4*A100 80G and it doesn't even load the model after 5 mins. |
Yes, we use this. |
I could use 4*A100 to run it. The bottleneck maybe disk read? you could check if the disk is actively reading the checkpoints, it has around 30 safetensors. |
@Luodian What should be the disk size to load the model? |
@kcz358 The two files |
OK! Thanks for this suggestion. Let me do the fix. |
This PR add the following models:
Allowing people to use sglang to serve LLaVA-NeXT-Qwen 72B and 110B
Tokenizer:
About :
LLaVA-NeXT (stronger)
On January 30, 2024, we unveiled LLaVA-NeXT, a state-of-the-art Large Multimodal Model (LMM) developed using a cost-effective training method leveraging open resources.
Today, we expanded the LLaVA-NeXT with recent stronger open LLMs, reporting our findings on more capable language models: