[Feat] Add llava qwen, llava mistral #419

kcz358 · 2024-05-11T03:29:18Z

This PR add the following models:

llava_qwen
llava_mistral

Allowing people to use sglang to serve LLaVA-NeXT-Qwen 72B and 110B

Tokenizer:

About : `LLaVA-NeXT (stronger)`

GitHub repo : https://github.com/LLaVA-VL/LLaVA-NeXT/tree/inference
Blog : https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/
Checkpoints : https://huggingface.co/collections/lmms-lab/llava-next-6623288e2d61edba3ddbf5ff
Demo : https://llava-next.lmms-lab.com/

On January 30, 2024, we unveiled LLaVA-NeXT, a state-of-the-art Large Multimodal Model (LMM) developed using a cost-effective training method leveraging open resources.

Today, we expanded the LLaVA-NeXT with recent stronger open LLMs, reporting our findings on more capable language models:

Increasing multimodal capaiblies with stronger & larger language models, up to 3x model size. This allows LMMs to present better visual world knowledge and logical reasoning inherited from LLM. It supports LLaMA3 (8B) and Qwen-1.5 (72B and 110B).
Better visual chat for more real-life scenarios, covering different applications. To evaluate the improved multimodal capabilities in the wild, we collect and develop new evaluation datasets, LLaVA-Bench (Wilder), which inherit the spirit of LLaVA-Bench (in-the-wild) to study daily-life visual chat and enlarge the data size for comprehensive evaluation.

Iven2132 · 2024-05-11T07:52:24Z

@kcz358 This is awesome! When can I use this?

Luodian · 2024-05-11T08:44:49Z

Thanks @kcz358 for this PR. I added some test functions here (used to debug during demo host). You can add it somewhere and name it httpserver_llama3_llavanext.py

"""
Usage:
# Endpoint Service CLI: CUDA_VISIBLE_DEVICES=0,1,2,3 python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host="127.0.0.1" --tp-size=4

python3 test_httpserver_llava_llama3.py

Output:
"Stylish Feline: A Cat's Chic Adventure in a Pink Hoodie and Sunglasses"
"""

import argparse
import asyncio
import json
import time

import aiohttp
import requests

from llava.conversation import (
    default_conversation,
    conv_templates,
    SeparatorStyle,
    conv_llava_llama_3,
    conv_qwen,
)

# installing latest llava-next: pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git

async def send_request(url, data, delay=0):
    await asyncio.sleep(delay)
    async with aiohttp.ClientSession() as session:
        async with session.post(url, json=data) as resp:
            output = await resp.json()
    return output


async def test_concurrent(args):
    url = f"{args.host}:{args.port}"

    response = []
    for i in range(1):
        response.append(
            send_request(
                url + "/generate",
                {
                    "text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|eot_id|><|start_header_id|><|start_header_id|>user<|end_header_id|>\n\n<image>\nPlease generate caption towards this image.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
                    "image_data": "/mnt/bn/vl-research/workspace/boli01/projects/demos/sglang_codebase/examples/quick_start/images/cat.jpeg",
                    "sampling_params": {
                        "max_new_tokens": 1024,
                        "temperature": 0,
                        "top_p": 1.0,
                        "presence_penalty": 2,
                        "frequency_penalty": 2,
                        "stop": "<|eot_id|>",
                    },
                },
            )
        )

    rets = await asyncio.gather(*response)
    for ret in rets:
        print(ret["text"])


def test_streaming(args):
    url = f"{args.host}:{args.port}"
    pload = {
        "text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|eot_id|><|start_header_id|><|start_header_id|>user<|end_header_id|>\n\n<image>\nPlease generate caption towards this image.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "sampling_params": {
            "max_new_tokens": 1024,
            "temperature": 0,
            "top_p": 1.0,
            "presence_penalty": 2,
            "frequency_penalty": 2,
            "stop": "<|eot_id|>",
        },
        "image_data": "/mnt/bn/vl-research/workspace/boli01/projects/demos/sglang_codebase/examples/quick_start/images/cat.jpeg",
        "stream": True,
    }
    response = requests.post(
        url + "/generate",
        json=pload,
        stream=True,
    )

    prev = 0
    for chunk in response.iter_lines(decode_unicode=False):
        chunk = chunk.decode("utf-8")
        if chunk and chunk.startswith("data:"):
            if chunk == "data: [DONE]":
                break
            data = json.loads(chunk[5:].strip("\n"))
            output = data["text"].strip()
            print(output[prev:], end="", flush=True)
            prev = len(output)
    print("")

# Endpoint Service CLI: CUDA_VISIBLE_DEVICES=0,1,2,3 python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host="127.0.0.1" --tp-size=4
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--host", type=str, default="https://127.0.0.1")
    parser.add_argument("--port", type=int, default=30000)
    args = parser.parse_args()
    asyncio.run(test_concurrent(args))
    test_streaming(args)

Luodian · 2024-05-11T08:45:29Z

kcz358 · 2024-05-11T09:04:12Z

Hi @Iven2132 , you can refer to the example above.

Iven2132 · 2024-05-11T09:21:30Z

Hi @Iven2132 , you can refer to the example above.

Which example? I don't think its been merged, I want to deploy the 110b and 72b model

Iven2132 · 2024-05-11T11:40:19Z

Thanks @kcz358 for this PR. I added some test functions here (used to debug during demo host). You can add it somewhere and name it httpserver_llama3_llavanext.py

"""
Usage:
# Endpoint Service CLI: CUDA_VISIBLE_DEVICES=0,1,2,3 python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host="127.0.0.1" --tp-size=4

python3 test_httpserver_llava_llama3.py

Output:
"Stylish Feline: A Cat's Chic Adventure in a Pink Hoodie and Sunglasses"
"""

import argparse
import asyncio
import json
import time

import aiohttp
import requests

from llava.conversation import (
    default_conversation,
    conv_templates,
    SeparatorStyle,
    conv_llava_llama_3,
    conv_qwen,
)

# installing latest llava-next: pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git

async def send_request(url, data, delay=0):
    await asyncio.sleep(delay)
    async with aiohttp.ClientSession() as session:
        async with session.post(url, json=data) as resp:
            output = await resp.json()
    return output


async def test_concurrent(args):
    url = f"{args.host}:{args.port}"

    response = []
    for i in range(1):
        response.append(
            send_request(
                url + "/generate",
                {
                    "text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|eot_id|><|start_header_id|><|start_header_id|>user<|end_header_id|>\n\n<image>\nPlease generate caption towards this image.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
                    "image_data": "/mnt/bn/vl-research/workspace/boli01/projects/demos/sglang_codebase/examples/quick_start/images/cat.jpeg",
                    "sampling_params": {
                        "max_new_tokens": 1024,
                        "temperature": 0,
                        "top_p": 1.0,
                        "presence_penalty": 2,
                        "frequency_penalty": 2,
                        "stop": "<|eot_id|>",
                    },
                },
            )
        )

    rets = await asyncio.gather(*response)
    for ret in rets:
        print(ret["text"])


def test_streaming(args):
    url = f"{args.host}:{args.port}"
    pload = {
        "text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|eot_id|><|start_header_id|><|start_header_id|>user<|end_header_id|>\n\n<image>\nPlease generate caption towards this image.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "sampling_params": {
            "max_new_tokens": 1024,
            "temperature": 0,
            "top_p": 1.0,
            "presence_penalty": 2,
            "frequency_penalty": 2,
            "stop": "<|eot_id|>",
        },
        "image_data": "/mnt/bn/vl-research/workspace/boli01/projects/demos/sglang_codebase/examples/quick_start/images/cat.jpeg",
        "stream": True,
    }
    response = requests.post(
        url + "/generate",
        json=pload,
        stream=True,
    )

    prev = 0
    for chunk in response.iter_lines(decode_unicode=False):
        chunk = chunk.decode("utf-8")
        if chunk and chunk.startswith("data:"):
            if chunk == "data: [DONE]":
                break
            data = json.loads(chunk[5:].strip("\n"))
            output = data["text"].strip()
            print(output[prev:], end="", flush=True)
            prev = len(output)
    print("")

# Endpoint Service CLI: CUDA_VISIBLE_DEVICES=0,1,2,3 python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host="127.0.0.1" --tp-size=4
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--host", type=str, default="https://127.0.0.1")
    parser.add_argument("--port", type=int, default=30000)
    args = parser.parse_args()
    asyncio.run(test_concurrent(args))
    test_streaming(args)

Hi, @Luodian Can you give me an example without streaming? Just a simple code example.

Iven2132 · 2024-05-11T12:15:04Z

@Luodian It's just logging "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained." but nothing is happening after that. Can you help?

from pathlib import Path
from modal import Mount, asgi_app
import os
import time
import modal

GPU_CONFIG = modal.gpu.A100(memory=80, count=2)

vllm_image = (
    modal.Image.from_registry("nvidia/cuda:11.8.0-devel-ubuntu22.04", add_python="3.10")
    .apt_install("git", "wget", "cmake")
    .pip_install(
        "wheel==0.43.0",
        "torch==2.2.1",
        "torchvision==0.17.1",
        "transformers==4.40.0",
        "timm==0.9.12",
        "Pillow==10.3.0",
        "peft==0.8.2",
        "lmdeploy==0.4.1",
        "hf-transfer==0.1.6",
        "huggingface_hub==0.22.2",
        "nvidia-nccl-cu11==2.21.5"
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
    .run_commands("pip install flash-attn==2.5.2 --no-build-isolation")
    .run_commands("pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git")
    .run_commands("git clone https://github.com/sgl-project/sglang.git && cd sglang && pip install -e 'python[all]'")
)

app = modal.App("my-app")


@app.cls(
    gpu=GPU_CONFIG,
    timeout=1200,
    container_idle_timeout=1200,
    allow_concurrent_inputs=10,
    image=vllm_image,
)
class Model:
    @modal.enter()
    async def start_engine(self):
        import subprocess
        cmd = [
            'python', '-m', 'sglang.launch_server',
            '--model-path', 'lmms-lab/llama3-llava-next-8b',
            '--tokenizer-path', 'lmms-lab/llama3-llava-next-8b-tokenizer',
            '--port', '30000',
            '--host', '127.0.0.1',
            '--tp-size', '4'
        ]

        subprocess.run(cmd)
        
    @modal.method()
    async def generate(self, messages):
        import argparse
        import asyncio
        import json
        import time

        import aiohttp
        import requests

        from llava.conversation import (
            default_conversation,
            conv_templates,
            SeparatorStyle,
            conv_llava_llama_3,
            conv_qwen,
        )

        async def send_request(url, data, delay=0):
            await asyncio.sleep(delay)
            async with aiohttp.ClientSession() as session:
                async with session.post(url, json=data) as resp:
                    output = await resp.json()
            return output

        def test_streaming(args):
            url = "https://127.0.0.1:30000"
            pload = {
                "text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|eot_id|><|start_header_id|><|start_header_id|>user<|end_header_id|>\n\n<image>\nPlease generate caption towards this image.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
                "sampling_params": {
                    "max_new_tokens": 1024,
                    "temperature": 0,
                    "top_p": 1.0,
                    "presence_penalty": 2,
                    "frequency_penalty": 2,
                    "stop": "<|eot_id|>",
                },
                "image_data": "/mnt/bn/vl-research/workspace/boli01/projects/demos/sglang_codebase/examples/quick_start/images/cat.jpeg",
            }
            response = requests.post(
                url + "/generate",
                json=pload
            )

            prev = 0
            for chunk in response.iter_lines(decode_unicode=False):
                chunk = chunk.decode("utf-8")
                if chunk and chunk.startswith("data:"):
                    if chunk == "data: [DONE]":
                        break
                    data = json.loads(chunk[5:].strip("\n"))
                    output = data["text"].strip()
                    print(output[prev:], end="", flush=True)
                    prev = len(output)
            print("")

Luodian · 2024-05-11T12:16:37Z

asyncio.run(test_concurrent(args))

You can use asyncio.run(test_concurrent(args)) to avoid streaming output.

Luodian · 2024-05-11T12:18:02Z

@Luodian It's just logging "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained." but nothing is happening after that. Can you help?

from pathlib import Path
from modal import Mount, asgi_app
import os
import time
import modal

GPU_CONFIG = modal.gpu.A100(memory=80, count=2)

vllm_image = (
    modal.Image.from_registry("nvidia/cuda:11.8.0-devel-ubuntu22.04", add_python="3.10")
    .apt_install("git", "wget", "cmake")
    .pip_install(
        "wheel==0.43.0",
        "torch==2.2.1",
        "torchvision==0.17.1",
        "transformers==4.40.0",
        "timm==0.9.12",
        "Pillow==10.3.0",
        "peft==0.8.2",
        "lmdeploy==0.4.1",
        "hf-transfer==0.1.6",
        "huggingface_hub==0.22.2",
        "nvidia-nccl-cu11==2.21.5"
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
    .run_commands("pip install flash-attn==2.5.2 --no-build-isolation")
    .run_commands("pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git")
    .run_commands("git clone https://github.com/sgl-project/sglang.git && cd sglang && pip install -e 'python[all]'")
)

app = modal.App("my-app")


@app.cls(
    gpu=GPU_CONFIG,
    timeout=1200,
    container_idle_timeout=1200,
    allow_concurrent_inputs=10,
    image=vllm_image,
)
class Model:
    @modal.enter()
    async def start_engine(self):
        import subprocess
        cmd = [
            'python', '-m', 'sglang.launch_server',
            '--model-path', 'lmms-lab/llama3-llava-next-8b',
            '--tokenizer-path', 'lmms-lab/llama3-llava-next-8b-tokenizer',
            '--port', '30000',
            '--host', '127.0.0.1',
            '--tp-size', '4'
        ]

        subprocess.run(cmd)
        
    @modal.method()
    async def generate(self, messages):
        import argparse
        import asyncio
        import json
        import time

        import aiohttp
        import requests

        from llava.conversation import (
            default_conversation,
            conv_templates,
            SeparatorStyle,
            conv_llava_llama_3,
            conv_qwen,
        )

        async def send_request(url, data, delay=0):
            await asyncio.sleep(delay)
            async with aiohttp.ClientSession() as session:
                async with session.post(url, json=data) as resp:
                    output = await resp.json()
            return output

        def test_streaming(args):
            url = "https://127.0.0.1:30000"
            pload = {
                "text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|eot_id|><|start_header_id|><|start_header_id|>user<|end_header_id|>\n\n<image>\nPlease generate caption towards this image.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
                "sampling_params": {
                    "max_new_tokens": 1024,
                    "temperature": 0,
                    "top_p": 1.0,
                    "presence_penalty": 2,
                    "frequency_penalty": 2,
                    "stop": "<|eot_id|>",
                },
                "image_data": "/mnt/bn/vl-research/workspace/boli01/projects/demos/sglang_codebase/examples/quick_start/images/cat.jpeg",
            }
            response = requests.post(
                url + "/generate",
                json=pload
            )

            prev = 0
            for chunk in response.iter_lines(decode_unicode=False):
                chunk = chunk.decode("utf-8")
                if chunk and chunk.startswith("data:"):
                    if chunk == "data: [DONE]":
                        break
                    data = json.loads(chunk[5:].strip("\n"))
                    output = data["text"].strip()
                    print(output[prev:], end="", flush=True)
                    prev = len(output)
            print("")

Also, you need to use a local path for image_data.

Iven2132 · 2024-05-11T12:22:42Z

@Luodian It's just logging "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained." but nothing is happening after that. Can you help?

from pathlib import Path
from modal import Mount, asgi_app
import os
import time
import modal

GPU_CONFIG = modal.gpu.A100(memory=80, count=2)

vllm_image = (
    modal.Image.from_registry("nvidia/cuda:11.8.0-devel-ubuntu22.04", add_python="3.10")
    .apt_install("git", "wget", "cmake")
    .pip_install(
        "wheel==0.43.0",
        "torch==2.2.1",
        "torchvision==0.17.1",
        "transformers==4.40.0",
        "timm==0.9.12",
        "Pillow==10.3.0",
        "peft==0.8.2",
        "lmdeploy==0.4.1",
        "hf-transfer==0.1.6",
        "huggingface_hub==0.22.2",
        "nvidia-nccl-cu11==2.21.5"
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
    .run_commands("pip install flash-attn==2.5.2 --no-build-isolation")
    .run_commands("pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git")
    .run_commands("git clone https://github.com/sgl-project/sglang.git && cd sglang && pip install -e 'python[all]'")
)

app = modal.App("my-app")


@app.cls(
    gpu=GPU_CONFIG,
    timeout=1200,
    container_idle_timeout=1200,
    allow_concurrent_inputs=10,
    image=vllm_image,
)
class Model:
    @modal.enter()
    async def start_engine(self):
        import subprocess
        cmd = [
            'python', '-m', 'sglang.launch_server',
            '--model-path', 'lmms-lab/llama3-llava-next-8b',
            '--tokenizer-path', 'lmms-lab/llama3-llava-next-8b-tokenizer',
            '--port', '30000',
            '--host', '127.0.0.1',
            '--tp-size', '4'
        ]

        subprocess.run(cmd)
        
    @modal.method()
    async def generate(self, messages):
        import argparse
        import asyncio
        import json
        import time

        import aiohttp
        import requests

        from llava.conversation import (
            default_conversation,
            conv_templates,
            SeparatorStyle,
            conv_llava_llama_3,
            conv_qwen,
        )

        async def send_request(url, data, delay=0):
            await asyncio.sleep(delay)
            async with aiohttp.ClientSession() as session:
                async with session.post(url, json=data) as resp:
                    output = await resp.json()
            return output

        def test_streaming(args):
            url = "https://127.0.0.1:30000"
            pload = {
                "text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|eot_id|><|start_header_id|><|start_header_id|>user<|end_header_id|>\n\n<image>\nPlease generate caption towards this image.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
                "sampling_params": {
                    "max_new_tokens": 1024,
                    "temperature": 0,
                    "top_p": 1.0,
                    "presence_penalty": 2,
                    "frequency_penalty": 2,
                    "stop": "<|eot_id|>",
                },
                "image_data": "/mnt/bn/vl-research/workspace/boli01/projects/demos/sglang_codebase/examples/quick_start/images/cat.jpeg",
            }
            response = requests.post(
                url + "/generate",
                json=pload
            )

            prev = 0
            for chunk in response.iter_lines(decode_unicode=False):
                chunk = chunk.decode("utf-8")
                if chunk and chunk.startswith("data:"):
                    if chunk == "data: [DONE]":
                        break
                    data = json.loads(chunk[5:].strip("\n"))
                    output = data["text"].strip()
                    print(output[prev:], end="", flush=True)
                    prev = len(output)
            print("")

Also, you need to use a local path for image_data.

What if I have a remote image Url and base64? Can you please tell how the correct script should look like?

kcz358 · 2024-05-11T14:15:38Z

@Iven2132 , If you just want a simple demo, you can just use the example script for llava in the main branch. The pipeline is the same. You just need to change the model path and tokenizer path with the path we provided and choose the correct chat template.

Iven2132 · 2024-05-11T14:46:12Z

@Iven2132 , If you just want a simple demo, you can just use the example script for llava in the main branch. The pipeline is the same. You just need to change the model path and tokenizer path with the path we provided and choose the correct chat template.

@kcz358 I tried this but same issue, It's just saying "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained." and not giving output.

Logs:

usr/local/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:103: FutureWarning: The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect
  warnings.warn(
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. 
The class this function is called from is 'LlamaTokenizerFast'.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
server started on [0.0.0.0]:10005
server started on [0.0.0.0]:10004
server started on [0.0.0.0]:10006
server started on [0.0.0.0]:10007
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. 
The class this function is called from is 'LlamaTokenizerFast'.
accepted ('127.0.0.1', 24790) with fd 44
welcome ('127.0.0.1', 24790)
accepted ('127.0.0.1', 58376) with fd 33
welcome ('127.0.0.1', 58376)
accepted ('127.0.0.1', 28615) with fd 33
welcome ('127.0.0.1', 28615)
accepted ('127.0.0.1', 23064) with fd 33
welcome ('127.0.0.1', 23064)
/usr/local/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
/usr/local/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
/usr/local/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
/usr/local/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Code:

import modal

GPU_CONFIG = modal.gpu.A100(memory=80, count=2)

vllm_image = (
    modal.Image.from_registry(
        "nvidia/cuda:11.8.0-devel-ubuntu22.04", add_python="3.10")
    .apt_install("git", "wget", "cmake")
    .pip_install(
        "wheel==0.43.0",
        "torch==2.2.1",
        "torchvision==0.17.1",
        "transformers==4.40.0",
        "timm==0.9.12",
        "Pillow==10.3.0",
        "peft==0.8.2",
        "hf-transfer==0.1.6",
        "huggingface_hub==0.22.2",
        "nvidia-nccl-cu11==2.21.5"
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
    .run_commands("pip install flash-attn==2.5.2 --no-build-isolation")
    .run_commands("pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git")
    .run_commands('pip install "sglang[all]"')
    # .run_commands("git clone https://github.com/sgl-project/sglang.git && cd sglang && pip install -e 'python[all]'")
)

app = modal.App("app")


@app.cls(
    gpu=GPU_CONFIG,
    timeout=120,
    container_idle_timeout=120,
    allow_concurrent_inputs=10,
    image=vllm_image,
)
class Model:
    @modal.enter()
    async def start_engine(self):
        import sglang as sgl
        import requests

        response = requests.get(
            "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
        with open("./images/nyc.png", 'wb') as file:
            file.write(response.content)

        @sgl.function
        def image_qa(s, image_path, question):
            s += sgl.user(sgl.image(image_path) + question)
            s += sgl.assistant(sgl.gen("answer"))

        runtime = sgl.Runtime(model_path="lmms-lab/llama3-llava-next-8b",
                              tokenizer_path="lmms-lab/llama3-llava-next-8b-tokenizer")
        sgl.set_default_backend(runtime)

        state = image_qa.run(
            image_path="./images/nyc.png",
            question="What is this?",
            max_new_tokens=64)
        print(state["answer"], "\n")

        runtime.shutdown()

    @modal.method()
    async def generate(self):
        print("HI")

        
@app.local_entrypoint()
def main():
    Model().generate.remote()

Iven2132 · 2024-05-11T15:55:08Z

Oh hey @kcz358 I have two questions 1) Is it possible to directly pass a remote image URL? 2) How can I serve the model directly so it doesn't have to load every time? I just want to load the model one time in my start_engine and use the generate function to get the output. I think this will be fast.

currently, my code is not printing anything.

Here is my current code:

class Model:
    @modal.enter()
    async def start_engine(self):
        import sglang as sgl
        import requests
        import subprocess

        # response = requests.get(
        #     "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
        # with open("./nyc.png", 'wb') as file:
        #     file.write(response.content)

        command = 'python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host="127.0.0.1"'
        subprocess.run(command, shell=True)


    @modal.method()
    async def generate(self):
        print("Generating")
        import openai
        client = openai.Client(
        base_url="https://127.0.0.1:30000/v1", api_key="EMPTY")
        response = client.chat.completions.create(
            model="default",
            messages=[
            {
            "role": "user",
            "content": [
                {
                "type": "text",
                "text": "Where is this located?"
                },
                {
                "type": "image_url",
                "image_url": {
                    "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
                }
                }
            ]
            }
        ],
            temperature=0,
            max_tokens=64,
        )
        print(response)

and here are the logs:

/usr/local/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:103: FutureWarning: The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect
  warnings.warn(
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. 
The class this function is called from is 'LlamaTokenizerFast'.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/usr/local/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. 
The class this function is called from is 'LlamaTokenizerFast'.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Rank 0: load weight begin.
/usr/local/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
INFO 05-11 17:24:09 weight_utils.py:177] Using model weights format ['*.safetensors']
INFO 05-11 17:25:40 weight_utils.py:177] Using model weights format ['*.safetensors']
Rank 0: load weight end.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. 
The class this function is called from is 'LlamaTokenizerFast'.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Rank 0: max_total_num_token=452279, max_prefill_num_token=75379, context_len=8192, 
disable_radix_cache=False, enable_flashinfer=False, disable_regex_jump_forward=False, disable_disk_cache=False, attention_reduce_in_fp32=False
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Started server process [25]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on https://127.0.0.1:30000 (Press CTRL+C to quit)
INFO:     127.0.0.1:18252 - "GET /get_model_info HTTP/1.1" 200 OK
new fill batch. #seq: 1. #cached_token: 0. #new_token: 9. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%.
/usr/local/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
/usr/local/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
INFO:     127.0.0.1:52481 - "POST /generate HTTP/1.1" 200 OK

kcz358 · 2024-05-12T01:30:46Z

Hi @Iven2132 , I am not sure how to do it with openai format. But based on my understanding on the code from @Luodian , I believe you can use a payload in json and post it to the url. If you don't want to reload the model everything, is it possible that you can set a server use one script and then query it using another script?

Iven2132 · 2024-05-12T06:05:12Z

Hi @Iven2132 , I am not sure how to do it with openai format. But based on my understanding on the code from @Luodian , I believe you can use a payload in json and post it to the url. If you don't want to reload the model everything, is it possible that you can set a server use one script and then query it using another script?

I don't think I can do this on Modal but does SGL have any serve mind or feature? where can I load the model once? We can do this with lmdeploy

def start_engine(self):
    from lmdeploy import serve, ChatTemplateConfig
    self.server = serve('OpenGVLab/InternVL-Chat-V1-2-Plus',
      chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2'),
      server_name='0.0.0.0',
      server_port=23333)

Iven2132 · 2024-05-12T08:32:59Z

Hey @Luodian I tried your code example but it's not responding. Can you please help?

Here is my code:

import modal

GPU_CONFIG = modal.gpu.A100(memory=80, count=2)

vllm_image = (
    modal.Image.from_registry(
        "nvidia/cuda:11.8.0-devel-ubuntu22.04", add_python="3.10")
    .apt_install("git", "wget", "cmake")
    .pip_install(
        "wheel==0.43.0",
        "torch==2.2.1",
        "torchvision==0.17.1",
        "transformers==4.40.0",
        "timm==0.9.12",
        "Pillow==10.3.0",
        "peft==0.8.2",
        "hf-transfer==0.1.6",
        "huggingface_hub==0.22.2",
        "nvidia-nccl-cu11==2.21.5"
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
    .run_commands("pip install flash-attn==2.5.2 --no-build-isolation")
    .run_commands("pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git")
    .run_commands("git clone https://github.com/sgl-project/sglang.git && cd sglang && pip install -e 'python[all]'")
)

app = modal.App("test-sgl-app")

@app.cls(
    gpu=GPU_CONFIG,
    timeout=1200,
    container_idle_timeout=1200,
    allow_concurrent_inputs=10,
    image=vllm_image,
)
class Model:
    @modal.enter()
    async def start_engine(self):
        import subprocess

        command = 'python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host="127.0.0.1"'
        subprocess.run(command, shell=True)


    @modal.method()
    async def generate(self):
        import requests
        import copy

        from llava.conversation import (
            conv_llava_llama_3,
        )

        url = "https://127.0.0.1:30000/generate"

        prompt = "<image>\nPlease generate caption towards this image."
        conv_template = copy.deepcopy(conv_llava_llama_3)
        conv_template.append_message(role="user", message=prompt)
        prompt_with_template = conv_template.get_prompt()
        data = {
            "text": prompt_with_template,
            "image_data": "https://farm4.staticflickr.com/3175/2653711032_804ff86d81_z.jpg",
            "sampling_params": {
                "max_new_tokens": 30,
                "temperature": 0,
                "top_p": 1.0,
                "presence_penalty": 2,
                "frequency_penalty": 2,
                "stop": "",
            },
        }
        response = requests.post(url, json=data)
        data = response.json()
        print(data["text"])
        

        
@app.local_entrypoint()
def main():
    Model().generate.remote()

…lavaQwenForCausalLM

Iven2132 · 2024-05-12T11:59:44Z

Hey @Luodian I tried your code example but it's not responding. Can you please help?

Here is my code:

import modal

GPU_CONFIG = modal.gpu.A100(memory=80, count=2)

vllm_image = (
    modal.Image.from_registry(
        "nvidia/cuda:11.8.0-devel-ubuntu22.04", add_python="3.10")
    .apt_install("git", "wget", "cmake")
    .pip_install(
        "wheel==0.43.0",
        "torch==2.2.1",
        "torchvision==0.17.1",
        "transformers==4.40.0",
        "timm==0.9.12",
        "Pillow==10.3.0",
        "peft==0.8.2",
        "hf-transfer==0.1.6",
        "huggingface_hub==0.22.2",
        "nvidia-nccl-cu11==2.21.5"
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
    .run_commands("pip install flash-attn==2.5.2 --no-build-isolation")
    .run_commands("pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git")
    .run_commands("git clone https://github.com/sgl-project/sglang.git && cd sglang && pip install -e 'python[all]'")
)

app = modal.App("test-sgl-app")

@app.cls(
    gpu=GPU_CONFIG,
    timeout=1200,
    container_idle_timeout=1200,
    allow_concurrent_inputs=10,
    image=vllm_image,
)
class Model:
    @modal.enter()
    async def start_engine(self):
        import subprocess

        command = 'python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host="127.0.0.1"'
        subprocess.run(command, shell=True)


    @modal.method()
    async def generate(self):
        import requests
        import copy

        from llava.conversation import (
            conv_llava_llama_3,
        )

        url = "https://127.0.0.1:30000/generate"

        prompt = "<image>\nPlease generate caption towards this image."
        conv_template = copy.deepcopy(conv_llava_llama_3)
        conv_template.append_message(role="user", message=prompt)
        prompt_with_template = conv_template.get_prompt()
        data = {
            "text": prompt_with_template,
            "image_data": "https://farm4.staticflickr.com/3175/2653711032_804ff86d81_z.jpg",
            "sampling_params": {
                "max_new_tokens": 30,
                "temperature": 0,
                "top_p": 1.0,
                "presence_penalty": 2,
                "frequency_penalty": 2,
                "stop": "",
            },
        }
        response = requests.post(url, json=data)
        data = response.json()
        print(data["text"])
        

        
@app.local_entrypoint()
def main():
    Model().generate.remote()

@kcz358 Can you please check my code?

Iven2132 · 2024-05-12T12:11:35Z

now i am getting this

 Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
/usr/local/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: 
_ZN2at4_ops15sum_IntList_out4callERKNS_6TensorEN3c1016OptionalArrayRefIlEEbSt8optionalINS5_10ScalarTypeEERS2_

kcz358 · 2024-05-13T01:15:49Z

now i am getting this

 Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
/usr/local/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: 
_ZN2at4_ops15sum_IntList_out4callERKNS_6TensorEN3c1016OptionalArrayRefIlEEbSt8optionalINS5_10ScalarTypeEERS2_

This error is mainly caused by flash-attn and has nothing to do with sglang or llava. You might want to clean up your cuda and reinstall.

Luodian · 2024-05-13T05:18:51Z

@Qubitium @merrymercy

Hi~ Can you help to check if this PR can be merged?

Iven2132 · 2024-05-13T15:24:06Z

now i am getting this
 Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
/usr/local/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: 
_ZN2at4_ops15sum_IntList_out4callERKNS_6TensorEN3c1016OptionalArrayRefIlEEbSt8optionalINS5_10ScalarTypeEERS2_
This error is mainly caused by flash-attn and has nothing to do with sglang or llava. You might want to clean up your cuda and reinstall.

I don't think so, my code works when I comment out all the code that uses sglang and llava.

Iven2132 · 2024-05-13T15:27:40Z

now i am getting this

 Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
/usr/local/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: 
_ZN2at4_ops15sum_IntList_out4callERKNS_6TensorEN3c1016OptionalArrayRefIlEEbSt8optionalINS5_10ScalarTypeEERS2_

@kcz358 @Luodian Here are the full logs:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1510, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 55, in <module>
    from flash_attn import flash_attn_func, flash_attn_varlen_func
  File "/usr/local/lib/python3.11/site-packages/flash_attn/__init__.py", line 3, in <module>
    from flash_attn.flash_attn_interface import (
  File "/usr/local/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py", line 10, in <module>
    import flash_attn_2_cuda as flash_attn_cuda
ImportError: /usr/local/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/pkg/modal/_container_io_manager.py", line 458, in handle_user_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 493, in call_lifecycle_functions
    event_loop.run(res)
  File "/pkg/modal/_container_entrypoint.py", line 162, in run
    return self.loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/root/main.py", line 199, in start_engine
    from llava.conversation import (
  File "/usr/local/lib/python3.11/site-packages/llava/__init__.py", line 1, in <module>
    from .model import LlavaLlamaForCausalLM
  File "/usr/local/lib/python3.11/site-packages/llava/model/__init__.py", line 15, in <module>
    exec(f"from .language_model.{model_name} import {model_classes}")
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.11/site-packages/llava/model/language_model/llava_llama.py", line 25, in <module>
    from transformers import LlamaModel, LlamaForCausalLM
  File "<frozen importlib._bootstrap>", line 1229, in _handle_fromlist
  File "/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1501, in __getattr__
    value = getattr(module, name)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1500, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1512, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
/usr/local/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

kcz358 · 2024-05-14T01:57:36Z

@Iven2132 , Because you use llava and sglang that use flash-attn. The main cause is still cuda version for flash-attn. You can refer to this oobabooga/text-generation-webui#4182

Iven2132 · 2024-05-14T04:43:34Z

@Iven2132 , Because you use llava and sglang that use flash-attn. The main cause is still cuda version for flash-attn. You can refer to this oobabooga/text-generation-webui#4182

@kcz358 @Luodian is Qwen 72b and 110b supported? I'm using it like this

       tokenizer_path="lmms-lab/llavanext-qwen-tokenizer")
       sgl.set_default_backend(runtime)

But getting these errors:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 945, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
                   ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 647, in __getitem__
    raise KeyError(key)
KeyError: 'llava_qwen'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/sglang/python/sglang/srt/server.py", line 140, in launch_server
    tokenizer_manager = TokenizerManager(server_args, port_args)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 98, in __init__
    self.hf_config = get_config(
                     ^^^^^^^^^^^
  File "/sglang/python/sglang/srt/hf_transformers_utils.py", line 34, in get_config
    config = AutoConfig.from_pretrained(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 947, in from_pretrained
    raise ValueError(
ValueError: The checkpoint you are trying to load has model type `llava_qwen` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
Traceback (most recent call last):
  File "/pkg/modal/_container_io_manager.py", line 458, in handle_user_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 493, in call_lifecycle_functions
    event_loop.run(res)
  File "/pkg/modal/_container_entrypoint.py", line 162, in run
    return self.loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/root/main.py", line 204, in start_engine
    runtime = sgl.Runtime(model_path="lmms-lab/llava-next-72b",
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/api.py", line 38, in Runtime
    return Runtime(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/server.py", line 265, in __init__
    raise RuntimeError("Initialization failed. Please see the error messages above.")
RuntimeError: Initialization failed. Please see the error messages above.

Iven2132 · 2024-05-14T07:03:46Z

@kcz358 @Luodian Also example code in examples/usage/llava/http_qwen_llava_test.py doesn't seems to work. It's not giving any response at all when the model is being loaded after that it logs logging /usr/local/lib/python3.11/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The vocab_sizeattribute is deprecated and will be removed in v4.42, Please usetext_config.vocab_size instead.

Luodian · 2024-05-14T08:29:20Z

It seems some issues related to merge and your local environment. Please try to install packages following the steps, clean huggingface cache folder, and uninstall flash-attn (if you met problem).

# Installing latest llava-next: pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
# Installing latest sglang: cd ~/sglang; pip install -e "python[all]"
# Installing latest vllm: pip install vllm==0.4.2
# Installing latest flashinfer (in case any error): pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/

# Endpoint Service CLI: 
python3 -m sglang.launch_server --model-path lmms-lab/llava-next-72b --tokenizer-path lmms-lab/llavanext-qwen-tokenizer --port=30000 --host="127.0.0.1" --tp-size=4

# In another tmux
python3 http_qwen_llava_test.py

Output:
"Two children pose with a large teddy bear, one holding a smaller stuffed bear, in a room with an American flag and potted plants."

Here's my output (dont mind the warnings, --port need to be set according to the endpoint message, it will tell you which port starts the model endpoint):

Iven2132 · 2024-05-14T09:31:10Z

It seems some issues related to merge and your local environment. Please try to install packages following the steps, clean huggingface cache folder, and uninstall flash-attn (if you met problem).
# Installing latest llava-next: pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
# Installing latest sglang: cd ~/sglang; pip install -e "python[all]"
# Installing latest vllm: pip install vllm==0.4.2
# Installing latest flashinfer (in case any error): pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/

# Endpoint Service CLI: 
python3 -m sglang.launch_server --model-path lmms-lab/llava-next-72b --tokenizer-path lmms-lab/llavanext-qwen-tokenizer --port=30000 --host="127.0.0.1" --tp-size=4

# In another tmux
python3 http_qwen_llava_test.py

Output:
"Two children pose with a large teddy bear, one holding a smaller stuffed bear, in a room with an American flag and potted plants."
Here's my output (dont mind the warnings, --port need to be set according to the endpoint message, it will tell you which port starts the model endpoint):

@Luodian Can you please run this code on Modal? https://modal.com/

import modal
from modal import asgi_app

GPU_CONFIG = modal.gpu.A100(memory=80, count=2)

vllm_image = (
    modal.Image.from_registry(
        "nvidia/cuda:12.2.0-devel-ubuntu22.04", add_python="3.11"
    )
    .apt_install("git", "wget", "cmake")
    .pip_install(
        "wheel==0.43.0",
        "torch==2.2.1",
        "torchvision==0.17.1",
        "transformers==4.40.2",
        "timm==0.9.12",
        "Pillow==10.3.0",
        "peft==0.8.2",
        "hf-transfer==0.1.6",
        "huggingface_hub==0.22.2",
        "requests==2.31.0",
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
    .run_commands("pip install flash-attn --no-build-isolation")
    .run_commands("pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git")
    .run_commands("git clone https://github.com/sgl-project/sglang.git && cd sglang && pip install -e 'python[all]'")
)

app = modal.App("test-app-sgl")

@app.cls(
    gpu=GPU_CONFIG,
    timeout=1200,
    container_idle_timeout=1200,
    allow_concurrent_inputs=10,
    image=vllm_image,
)
class Model:
    @modal.enter()
    async def start_engine(self):
        import subprocess
        command = [
            "python", "-m", "sglang.launch_server",
            "--model-path", "lmms-lab/llama3-llava-next-8b",
            "--tokenizer-path", "lmms-lab/llama3-llava-next-8b-tokenizer",
            "--port=30000",
            "--host=127.0.0.1",
            "--tp-size=4"
        ]
        result = subprocess.run(command)
        print("Standard Output:", result)

    @modal.method()
    async def generate(self):
        print("Generating")

        import requests

        url = "https://127.0.0.1:30000/generate"

        data = {
            "text": "Hi",
            "image_data": "https://farm4.staticflickr.com/3175/2653711032_804ff86d81_z.jpg",
            "sampling_params": {
                "max_new_tokens": 30,
                "temperature": 0,
                "top_p": 1.0,
                "presence_penalty": 2,
                "frequency_penalty": 2,
                "stop": "",
            },
        }
        response = requests.post(url, json=data)
        print(response)


@app.local_entrypoint()
def main():
    Model().generate.remote()

vedantroy · 2024-05-14T23:44:51Z

@Luodian When generating the prompt, what conversation format were you using? Was it this:

conv_llava_llama_3 = Conversation(
    system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.",
    roles=("<|start_header_id|>user", "<|start_header_id|>assistant"),
    version="llama_v3",
    messages=[],
    offset=0,
    sep_style=SeparatorStyle.LLAMA_3,
    tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
    tokenizer=AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct"),
    stop_token_ids=[128009],
)

Iven2132 · 2024-05-16T08:31:50Z

@Luodian How much time it takes for you to load the 72b model? When you run "python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --tokenizer-path lmms-lab/llavanext-qwen-tokenizer --port=30000 --host="127.0.0.1" --tp-size=4"

I'm running 4*A100 80G and it doesn't even load the model after 5 mins.

Luodian · 2024-05-16T08:32:58Z

@Luodian When generating the prompt, what conversation format were you using? Was it this:

conv_llava_llama_3 = Conversation(
    system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.",
    roles=("<|start_header_id|>user", "<|start_header_id|>assistant"),
    version="llama_v3",
    messages=[],
    offset=0,
    sep_style=SeparatorStyle.LLAMA_3,
    tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
    tokenizer=AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct"),
    stop_token_ids=[128009],
)

Yes, we use this.

Luodian · 2024-05-16T08:33:59Z

@Luodian How much time it takes for you to load the 72b model? When you run "python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --tokenizer-path lmms-lab/llavanext-qwen-tokenizer --port=30000 --host="127.0.0.1" --tp-size=4"

I'm running 4*A100 80G and it doesn't even load the model after 5 mins.

I could use 4*A100 to run it. The bottleneck maybe disk read? you could check if the disk is actively reading the checkpoints, it has around 30 safetensors.

Iven2132 · 2024-05-16T08:53:32Z

@Luodian How much time it takes for you to load the 72b model? When you run "python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --tokenizer-path lmms-lab/llavanext-qwen-tokenizer --port=30000 --host="127.0.0.1" --tp-size=4"
I'm running 4*A100 80G and it doesn't even load the model after 5 mins.

I could use 4*A100 to run it. The bottleneck maybe disk read? you could check if the disk is actively reading the checkpoints, it has around 30 safetensors.

@Luodian What should be the disk size to load the model?

merrymercy · 2024-05-21T16:18:07Z

@kcz358 The two files python/sglang/srt/models/llava_mistral.py and python/sglang/srt/models/llava_qwen.py are almost identical to python/sglang/srt/models/llava.py. We should try to reduce the redundancy here. Can you refactor LlavaQwenForCausalLM and LlavaMistralForCausalLM to be the subclass of LlavaLlamaForCausalLM and reuse any many functions as possible?

Luodian · 2024-05-21T16:43:19Z

@kcz358 The two files python/sglang/srt/models/llava_mistral.py and python/sglang/srt/models/llava_qwen.py are almost identical to python/sglang/srt/models/llava.py. We should try to reduce the redundancy here. Can you refactor LlavaQwenForCausalLM and LlavaMistralForCausalLM to be the subclass of LlavaLlamaForCausalLM and reuse any many functions as possible?

OK! Thanks for this suggestion. Let me do the fix.

kcz358 added 3 commits May 11, 2024 03:25

Allow qwen to handle vision model weights

83c7d0f

Add llava mistral

8ef0730

Add llava qwen

a1aea26

kcz358 mentioned this pull request May 11, 2024

How to deploy this model via API? LLaVA-VL/LLaVA-NeXT#7

Open

Update imports and arguments in LlavaLlamaForCausalLM class

cf8b830

Luodian added 3 commits May 12, 2024 06:36

Update function to handle edge case

c7b79c8

Update test scripts for llama3 and qwen models

796384c

Merge branch 'sgl-project:main' into main

2aff313

Luodian added 2 commits May 12, 2024 08:52

Fix import statements in conversation.py and llava_mistral.py

00da205

Update linear_method to quant_config in LlavaMistralForCausalLM and L…

b9b8b4a

…lavaQwenForCausalLM

merrymercy merged commit 664287b into sgl-project:main May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Add llava qwen, llava mistral #419

[Feat] Add llava qwen, llava mistral #419

kcz358 commented May 11, 2024 •

edited

Loading

Iven2132 commented May 11, 2024

Luodian commented May 11, 2024 •

edited

Loading

Luodian commented May 11, 2024

kcz358 commented May 11, 2024 •

edited

Loading

Iven2132 commented May 11, 2024 •

edited

Loading

Iven2132 commented May 11, 2024

Iven2132 commented May 11, 2024

Luodian commented May 11, 2024

Luodian commented May 11, 2024

Iven2132 commented May 11, 2024

kcz358 commented May 11, 2024

Iven2132 commented May 11, 2024

Iven2132 commented May 11, 2024 •

edited

Loading

kcz358 commented May 12, 2024

Iven2132 commented May 12, 2024

Iven2132 commented May 12, 2024

Iven2132 commented May 12, 2024

Iven2132 commented May 12, 2024 •

edited

Loading

kcz358 commented May 13, 2024

Luodian commented May 13, 2024

Iven2132 commented May 13, 2024

Iven2132 commented May 13, 2024

kcz358 commented May 14, 2024

Iven2132 commented May 14, 2024

Iven2132 commented May 14, 2024

Luodian commented May 14, 2024 •

edited

Loading

Iven2132 commented May 14, 2024

vedantroy commented May 14, 2024

Iven2132 commented May 16, 2024

Luodian commented May 16, 2024

Luodian commented May 16, 2024

Iven2132 commented May 16, 2024

merrymercy commented May 21, 2024 •

edited

Loading

Luodian commented May 21, 2024

[Feat] Add llava qwen, llava mistral #419

[Feat] Add llava qwen, llava mistral #419

Conversation

kcz358 commented May 11, 2024 • edited Loading

About : LLaVA-NeXT (stronger)

Iven2132 commented May 11, 2024

Luodian commented May 11, 2024 • edited Loading

Luodian commented May 11, 2024

kcz358 commented May 11, 2024 • edited Loading

Iven2132 commented May 11, 2024 • edited Loading

Iven2132 commented May 11, 2024

Iven2132 commented May 11, 2024

Luodian commented May 11, 2024

Luodian commented May 11, 2024

Iven2132 commented May 11, 2024

kcz358 commented May 11, 2024

Iven2132 commented May 11, 2024

Iven2132 commented May 11, 2024 • edited Loading

kcz358 commented May 12, 2024

Iven2132 commented May 12, 2024

Iven2132 commented May 12, 2024

Iven2132 commented May 12, 2024

Iven2132 commented May 12, 2024 • edited Loading

kcz358 commented May 13, 2024

Luodian commented May 13, 2024

Iven2132 commented May 13, 2024

Iven2132 commented May 13, 2024

kcz358 commented May 14, 2024

Iven2132 commented May 14, 2024

Iven2132 commented May 14, 2024

Luodian commented May 14, 2024 • edited Loading

Iven2132 commented May 14, 2024

vedantroy commented May 14, 2024

Iven2132 commented May 16, 2024

Luodian commented May 16, 2024

Luodian commented May 16, 2024

Iven2132 commented May 16, 2024

merrymercy commented May 21, 2024 • edited Loading

Luodian commented May 21, 2024

kcz358 commented May 11, 2024 •

edited

Loading

About : `LLaVA-NeXT (stronger)`

Luodian commented May 11, 2024 •

edited

Loading

kcz358 commented May 11, 2024 •

edited

Loading

Iven2132 commented May 11, 2024 •

edited

Loading

Iven2132 commented May 11, 2024 •

edited

Loading

Iven2132 commented May 12, 2024 •

edited

Loading

Luodian commented May 14, 2024 •

edited

Loading

merrymercy commented May 21, 2024 •

edited

Loading