Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Add llava qwen, llava mistral #419

Merged
merged 9 commits into from
May 14, 2024
Merged

Conversation

kcz358
Copy link
Contributor

@kcz358 kcz358 commented May 11, 2024

This PR add the following models:

  • llava_qwen
  • llava_mistral

Allowing people to use sglang to serve LLaVA-NeXT-Qwen 72B and 110B

Tokenizer:


About : LLaVA-NeXT (stronger)

On January 30, 2024, we unveiled LLaVA-NeXT, a state-of-the-art Large Multimodal Model (LMM) developed using a cost-effective training method leveraging open resources.

Today, we expanded the LLaVA-NeXT with recent stronger open LLMs, reporting our findings on more capable language models:

  1. Increasing multimodal capaiblies with stronger & larger language models, up to 3x model size. This allows LMMs to present better visual world knowledge and logical reasoning inherited from LLM. It supports LLaMA3 (8B) and Qwen-1.5 (72B and 110B).
  2. Better visual chat for more real-life scenarios, covering different applications. To evaluate the improved multimodal capabilities in the wild, we collect and develop new evaluation datasets, LLaVA-Bench (Wilder), which inherit the spirit of LLaVA-Bench (in-the-wild) to study daily-life visual chat and enlarge the data size for comprehensive evaluation.

@Iven2132
Copy link

@kcz358 This is awesome! When can I use this?

@Luodian
Copy link
Contributor

Luodian commented May 11, 2024

Thanks @kcz358 for this PR. I added some test functions here (used to debug during demo host). You can add it somewhere and name it httpserver_llama3_llavanext.py

"""
Usage:
# Endpoint Service CLI: CUDA_VISIBLE_DEVICES=0,1,2,3 python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host="127.0.0.1" --tp-size=4

python3 test_httpserver_llava_llama3.py

Output:
"Stylish Feline: A Cat's Chic Adventure in a Pink Hoodie and Sunglasses"
"""

import argparse
import asyncio
import json
import time

import aiohttp
import requests

from llava.conversation import (
    default_conversation,
    conv_templates,
    SeparatorStyle,
    conv_llava_llama_3,
    conv_qwen,
)

# installing latest llava-next: pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git

async def send_request(url, data, delay=0):
    await asyncio.sleep(delay)
    async with aiohttp.ClientSession() as session:
        async with session.post(url, json=data) as resp:
            output = await resp.json()
    return output


async def test_concurrent(args):
    url = f"{args.host}:{args.port}"

    response = []
    for i in range(1):
        response.append(
            send_request(
                url + "/generate",
                {
                    "text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|eot_id|><|start_header_id|><|start_header_id|>user<|end_header_id|>\n\n<image>\nPlease generate caption towards this image.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
                    "image_data": "/mnt/bn/vl-research/workspace/boli01/projects/demos/sglang_codebase/examples/quick_start/images/cat.jpeg",
                    "sampling_params": {
                        "max_new_tokens": 1024,
                        "temperature": 0,
                        "top_p": 1.0,
                        "presence_penalty": 2,
                        "frequency_penalty": 2,
                        "stop": "<|eot_id|>",
                    },
                },
            )
        )

    rets = await asyncio.gather(*response)
    for ret in rets:
        print(ret["text"])


def test_streaming(args):
    url = f"{args.host}:{args.port}"
    pload = {
        "text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|eot_id|><|start_header_id|><|start_header_id|>user<|end_header_id|>\n\n<image>\nPlease generate caption towards this image.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "sampling_params": {
            "max_new_tokens": 1024,
            "temperature": 0,
            "top_p": 1.0,
            "presence_penalty": 2,
            "frequency_penalty": 2,
            "stop": "<|eot_id|>",
        },
        "image_data": "/mnt/bn/vl-research/workspace/boli01/projects/demos/sglang_codebase/examples/quick_start/images/cat.jpeg",
        "stream": True,
    }
    response = requests.post(
        url + "/generate",
        json=pload,
        stream=True,
    )

    prev = 0
    for chunk in response.iter_lines(decode_unicode=False):
        chunk = chunk.decode("utf-8")
        if chunk and chunk.startswith("data:"):
            if chunk == "data: [DONE]":
                break
            data = json.loads(chunk[5:].strip("\n"))
            output = data["text"].strip()
            print(output[prev:], end="", flush=True)
            prev = len(output)
    print("")

# Endpoint Service CLI: CUDA_VISIBLE_DEVICES=0,1,2,3 python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host="127.0.0.1" --tp-size=4
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--host", type=str, default="https://127.0.0.1")
    parser.add_argument("--port", type=int, default=30000)
    args = parser.parse_args()
    asyncio.run(test_concurrent(args))
    test_streaming(args)

@Luodian
Copy link
Contributor

Luodian commented May 11, 2024

image

@kcz358
Copy link
Contributor Author

kcz358 commented May 11, 2024

Hi @Iven2132 , you can refer to the example above.

@Iven2132
Copy link

Iven2132 commented May 11, 2024

Hi @Iven2132 , you can refer to the example above.

Which example? I don't think its been merged, I want to deploy the 110b and 72b model

@Iven2132
Copy link

Thanks @kcz358 for this PR. I added some test functions here (used to debug during demo host). You can add it somewhere and name it httpserver_llama3_llavanext.py

"""
Usage:
# Endpoint Service CLI: CUDA_VISIBLE_DEVICES=0,1,2,3 python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host="127.0.0.1" --tp-size=4

python3 test_httpserver_llava_llama3.py

Output:
"Stylish Feline: A Cat's Chic Adventure in a Pink Hoodie and Sunglasses"
"""

import argparse
import asyncio
import json
import time

import aiohttp
import requests

from llava.conversation import (
    default_conversation,
    conv_templates,
    SeparatorStyle,
    conv_llava_llama_3,
    conv_qwen,
)

# installing latest llava-next: pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git

async def send_request(url, data, delay=0):
    await asyncio.sleep(delay)
    async with aiohttp.ClientSession() as session:
        async with session.post(url, json=data) as resp:
            output = await resp.json()
    return output


async def test_concurrent(args):
    url = f"{args.host}:{args.port}"

    response = []
    for i in range(1):
        response.append(
            send_request(
                url + "/generate",
                {
                    "text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|eot_id|><|start_header_id|><|start_header_id|>user<|end_header_id|>\n\n<image>\nPlease generate caption towards this image.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
                    "image_data": "/mnt/bn/vl-research/workspace/boli01/projects/demos/sglang_codebase/examples/quick_start/images/cat.jpeg",
                    "sampling_params": {
                        "max_new_tokens": 1024,
                        "temperature": 0,
                        "top_p": 1.0,
                        "presence_penalty": 2,
                        "frequency_penalty": 2,
                        "stop": "<|eot_id|>",
                    },
                },
            )
        )

    rets = await asyncio.gather(*response)
    for ret in rets:
        print(ret["text"])


def test_streaming(args):
    url = f"{args.host}:{args.port}"
    pload = {
        "text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|eot_id|><|start_header_id|><|start_header_id|>user<|end_header_id|>\n\n<image>\nPlease generate caption towards this image.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "sampling_params": {
            "max_new_tokens": 1024,
            "temperature": 0,
            "top_p": 1.0,
            "presence_penalty": 2,
            "frequency_penalty": 2,
            "stop": "<|eot_id|>",
        },
        "image_data": "/mnt/bn/vl-research/workspace/boli01/projects/demos/sglang_codebase/examples/quick_start/images/cat.jpeg",
        "stream": True,
    }
    response = requests.post(
        url + "/generate",
        json=pload,
        stream=True,
    )

    prev = 0
    for chunk in response.iter_lines(decode_unicode=False):
        chunk = chunk.decode("utf-8")
        if chunk and chunk.startswith("data:"):
            if chunk == "data: [DONE]":
                break
            data = json.loads(chunk[5:].strip("\n"))
            output = data["text"].strip()
            print(output[prev:], end="", flush=True)
            prev = len(output)
    print("")

# Endpoint Service CLI: CUDA_VISIBLE_DEVICES=0,1,2,3 python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host="127.0.0.1" --tp-size=4
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--host", type=str, default="https://127.0.0.1")
    parser.add_argument("--port", type=int, default=30000)
    args = parser.parse_args()
    asyncio.run(test_concurrent(args))
    test_streaming(args)

Hi, @Luodian Can you give me an example without streaming? Just a simple code example.

@Iven2132
Copy link

@Luodian It's just logging "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained." but nothing is happening after that. Can you help?

from pathlib import Path
from modal import Mount, asgi_app
import os
import time
import modal

GPU_CONFIG = modal.gpu.A100(memory=80, count=2)

vllm_image = (
    modal.Image.from_registry("nvidia/cuda:11.8.0-devel-ubuntu22.04", add_python="3.10")
    .apt_install("git", "wget", "cmake")
    .pip_install(
        "wheel==0.43.0",
        "torch==2.2.1",
        "torchvision==0.17.1",
        "transformers==4.40.0",
        "timm==0.9.12",
        "Pillow==10.3.0",
        "peft==0.8.2",
        "lmdeploy==0.4.1",
        "hf-transfer==0.1.6",
        "huggingface_hub==0.22.2",
        "nvidia-nccl-cu11==2.21.5"
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
    .run_commands("pip install flash-attn==2.5.2 --no-build-isolation")
    .run_commands("pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git")
    .run_commands("git clone https://github.com/sgl-project/sglang.git && cd sglang && pip install -e 'python[all]'")
)

app = modal.App("my-app")


@app.cls(
    gpu=GPU_CONFIG,
    timeout=1200,
    container_idle_timeout=1200,
    allow_concurrent_inputs=10,
    image=vllm_image,
)
class Model:
    @modal.enter()
    async def start_engine(self):
        import subprocess
        cmd = [
            'python', '-m', 'sglang.launch_server',
            '--model-path', 'lmms-lab/llama3-llava-next-8b',
            '--tokenizer-path', 'lmms-lab/llama3-llava-next-8b-tokenizer',
            '--port', '30000',
            '--host', '127.0.0.1',
            '--tp-size', '4'
        ]

        subprocess.run(cmd)
        
    @modal.method()
    async def generate(self, messages):
        import argparse
        import asyncio
        import json
        import time

        import aiohttp
        import requests

        from llava.conversation import (
            default_conversation,
            conv_templates,
            SeparatorStyle,
            conv_llava_llama_3,
            conv_qwen,
        )

        async def send_request(url, data, delay=0):
            await asyncio.sleep(delay)
            async with aiohttp.ClientSession() as session:
                async with session.post(url, json=data) as resp:
                    output = await resp.json()
            return output

        def test_streaming(args):
            url = "https://127.0.0.1:30000"
            pload = {
                "text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|eot_id|><|start_header_id|><|start_header_id|>user<|end_header_id|>\n\n<image>\nPlease generate caption towards this image.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
                "sampling_params": {
                    "max_new_tokens": 1024,
                    "temperature": 0,
                    "top_p": 1.0,
                    "presence_penalty": 2,
                    "frequency_penalty": 2,
                    "stop": "<|eot_id|>",
                },
                "image_data": "/mnt/bn/vl-research/workspace/boli01/projects/demos/sglang_codebase/examples/quick_start/images/cat.jpeg",
            }
            response = requests.post(
                url + "/generate",
                json=pload
            )

            prev = 0
            for chunk in response.iter_lines(decode_unicode=False):
                chunk = chunk.decode("utf-8")
                if chunk and chunk.startswith("data:"):
                    if chunk == "data: [DONE]":
                        break
                    data = json.loads(chunk[5:].strip("\n"))
                    output = data["text"].strip()
                    print(output[prev:], end="", flush=True)
                    prev = len(output)
            print("")

@Luodian
Copy link
Contributor

Luodian commented May 11, 2024

asyncio.run(test_concurrent(args))

You can use asyncio.run(test_concurrent(args)) to avoid streaming output.

@Luodian
Copy link
Contributor

Luodian commented May 11, 2024

@Luodian It's just logging "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained." but nothing is happening after that. Can you help?

from pathlib import Path
from modal import Mount, asgi_app
import os
import time
import modal

GPU_CONFIG = modal.gpu.A100(memory=80, count=2)

vllm_image = (
    modal.Image.from_registry("nvidia/cuda:11.8.0-devel-ubuntu22.04", add_python="3.10")
    .apt_install("git", "wget", "cmake")
    .pip_install(
        "wheel==0.43.0",
        "torch==2.2.1",
        "torchvision==0.17.1",
        "transformers==4.40.0",
        "timm==0.9.12",
        "Pillow==10.3.0",
        "peft==0.8.2",
        "lmdeploy==0.4.1",
        "hf-transfer==0.1.6",
        "huggingface_hub==0.22.2",
        "nvidia-nccl-cu11==2.21.5"
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
    .run_commands("pip install flash-attn==2.5.2 --no-build-isolation")
    .run_commands("pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git")
    .run_commands("git clone https://github.com/sgl-project/sglang.git && cd sglang && pip install -e 'python[all]'")
)

app = modal.App("my-app")


@app.cls(
    gpu=GPU_CONFIG,
    timeout=1200,
    container_idle_timeout=1200,
    allow_concurrent_inputs=10,
    image=vllm_image,
)
class Model:
    @modal.enter()
    async def start_engine(self):
        import subprocess
        cmd = [
            'python', '-m', 'sglang.launch_server',
            '--model-path', 'lmms-lab/llama3-llava-next-8b',
            '--tokenizer-path', 'lmms-lab/llama3-llava-next-8b-tokenizer',
            '--port', '30000',
            '--host', '127.0.0.1',
            '--tp-size', '4'
        ]

        subprocess.run(cmd)
        
    @modal.method()
    async def generate(self, messages):
        import argparse
        import asyncio
        import json
        import time

        import aiohttp
        import requests

        from llava.conversation import (
            default_conversation,
            conv_templates,
            SeparatorStyle,
            conv_llava_llama_3,
            conv_qwen,
        )

        async def send_request(url, data, delay=0):
            await asyncio.sleep(delay)
            async with aiohttp.ClientSession() as session:
                async with session.post(url, json=data) as resp:
                    output = await resp.json()
            return output

        def test_streaming(args):
            url = "https://127.0.0.1:30000"
            pload = {
                "text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|eot_id|><|start_header_id|><|start_header_id|>user<|end_header_id|>\n\n<image>\nPlease generate caption towards this image.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
                "sampling_params": {
                    "max_new_tokens": 1024,
                    "temperature": 0,
                    "top_p": 1.0,
                    "presence_penalty": 2,
                    "frequency_penalty": 2,
                    "stop": "<|eot_id|>",
                },
                "image_data": "/mnt/bn/vl-research/workspace/boli01/projects/demos/sglang_codebase/examples/quick_start/images/cat.jpeg",
            }
            response = requests.post(
                url + "/generate",
                json=pload
            )

            prev = 0
            for chunk in response.iter_lines(decode_unicode=False):
                chunk = chunk.decode("utf-8")
                if chunk and chunk.startswith("data:"):
                    if chunk == "data: [DONE]":
                        break
                    data = json.loads(chunk[5:].strip("\n"))
                    output = data["text"].strip()
                    print(output[prev:], end="", flush=True)
                    prev = len(output)
            print("")

Also, you need to use a local path for image_data.

@Iven2132
Copy link

@Luodian It's just logging "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained." but nothing is happening after that. Can you help?

from pathlib import Path
from modal import Mount, asgi_app
import os
import time
import modal

GPU_CONFIG = modal.gpu.A100(memory=80, count=2)

vllm_image = (
    modal.Image.from_registry("nvidia/cuda:11.8.0-devel-ubuntu22.04", add_python="3.10")
    .apt_install("git", "wget", "cmake")
    .pip_install(
        "wheel==0.43.0",
        "torch==2.2.1",
        "torchvision==0.17.1",
        "transformers==4.40.0",
        "timm==0.9.12",
        "Pillow==10.3.0",
        "peft==0.8.2",
        "lmdeploy==0.4.1",
        "hf-transfer==0.1.6",
        "huggingface_hub==0.22.2",
        "nvidia-nccl-cu11==2.21.5"
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
    .run_commands("pip install flash-attn==2.5.2 --no-build-isolation")
    .run_commands("pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git")
    .run_commands("git clone https://github.com/sgl-project/sglang.git && cd sglang && pip install -e 'python[all]'")
)

app = modal.App("my-app")


@app.cls(
    gpu=GPU_CONFIG,
    timeout=1200,
    container_idle_timeout=1200,
    allow_concurrent_inputs=10,
    image=vllm_image,
)
class Model:
    @modal.enter()
    async def start_engine(self):
        import subprocess
        cmd = [
            'python', '-m', 'sglang.launch_server',
            '--model-path', 'lmms-lab/llama3-llava-next-8b',
            '--tokenizer-path', 'lmms-lab/llama3-llava-next-8b-tokenizer',
            '--port', '30000',
            '--host', '127.0.0.1',
            '--tp-size', '4'
        ]

        subprocess.run(cmd)
        
    @modal.method()
    async def generate(self, messages):
        import argparse
        import asyncio
        import json
        import time

        import aiohttp
        import requests

        from llava.conversation import (
            default_conversation,
            conv_templates,
            SeparatorStyle,
            conv_llava_llama_3,
            conv_qwen,
        )

        async def send_request(url, data, delay=0):
            await asyncio.sleep(delay)
            async with aiohttp.ClientSession() as session:
                async with session.post(url, json=data) as resp:
                    output = await resp.json()
            return output

        def test_streaming(args):
            url = "https://127.0.0.1:30000"
            pload = {
                "text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|eot_id|><|start_header_id|><|start_header_id|>user<|end_header_id|>\n\n<image>\nPlease generate caption towards this image.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
                "sampling_params": {
                    "max_new_tokens": 1024,
                    "temperature": 0,
                    "top_p": 1.0,
                    "presence_penalty": 2,
                    "frequency_penalty": 2,
                    "stop": "<|eot_id|>",
                },
                "image_data": "/mnt/bn/vl-research/workspace/boli01/projects/demos/sglang_codebase/examples/quick_start/images/cat.jpeg",
            }
            response = requests.post(
                url + "/generate",
                json=pload
            )

            prev = 0
            for chunk in response.iter_lines(decode_unicode=False):
                chunk = chunk.decode("utf-8")
                if chunk and chunk.startswith("data:"):
                    if chunk == "data: [DONE]":
                        break
                    data = json.loads(chunk[5:].strip("\n"))
                    output = data["text"].strip()
                    print(output[prev:], end="", flush=True)
                    prev = len(output)
            print("")

Also, you need to use a local path for image_data.

What if I have a remote image Url and base64? Can you please tell how the correct script should look like?

@kcz358
Copy link
Contributor Author

kcz358 commented May 11, 2024

@Iven2132 , If you just want a simple demo, you can just use the example script for llava in the main branch. The pipeline is the same. You just need to change the model path and tokenizer path with the path we provided and choose the correct chat template.

@Iven2132
Copy link

@Iven2132 , If you just want a simple demo, you can just use the example script for llava in the main branch. The pipeline is the same. You just need to change the model path and tokenizer path with the path we provided and choose the correct chat template.

@kcz358 I tried this but same issue, It's just saying "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained." and not giving output.

Logs:

usr/local/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:103: FutureWarning: The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect
  warnings.warn(
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. 
The class this function is called from is 'LlamaTokenizerFast'.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
server started on [0.0.0.0]:10005
server started on [0.0.0.0]:10004
server started on [0.0.0.0]:10006
server started on [0.0.0.0]:10007
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. 
The class this function is called from is 'LlamaTokenizerFast'.
accepted ('127.0.0.1', 24790) with fd 44
welcome ('127.0.0.1', 24790)
accepted ('127.0.0.1', 58376) with fd 33
welcome ('127.0.0.1', 58376)
accepted ('127.0.0.1', 28615) with fd 33
welcome ('127.0.0.1', 28615)
accepted ('127.0.0.1', 23064) with fd 33
welcome ('127.0.0.1', 23064)
/usr/local/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
/usr/local/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
/usr/local/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
/usr/local/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Code:

import modal

GPU_CONFIG = modal.gpu.A100(memory=80, count=2)

vllm_image = (
    modal.Image.from_registry(
        "nvidia/cuda:11.8.0-devel-ubuntu22.04", add_python="3.10")
    .apt_install("git", "wget", "cmake")
    .pip_install(
        "wheel==0.43.0",
        "torch==2.2.1",
        "torchvision==0.17.1",
        "transformers==4.40.0",
        "timm==0.9.12",
        "Pillow==10.3.0",
        "peft==0.8.2",
        "hf-transfer==0.1.6",
        "huggingface_hub==0.22.2",
        "nvidia-nccl-cu11==2.21.5"
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
    .run_commands("pip install flash-attn==2.5.2 --no-build-isolation")
    .run_commands("pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git")
    .run_commands('pip install "sglang[all]"')
    # .run_commands("git clone https://github.com/sgl-project/sglang.git && cd sglang && pip install -e 'python[all]'")
)

app = modal.App("app")


@app.cls(
    gpu=GPU_CONFIG,
    timeout=120,
    container_idle_timeout=120,
    allow_concurrent_inputs=10,
    image=vllm_image,
)
class Model:
    @modal.enter()
    async def start_engine(self):
        import sglang as sgl
        import requests

        response = requests.get(
            "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
        with open("./images/nyc.png", 'wb') as file:
            file.write(response.content)

        @sgl.function
        def image_qa(s, image_path, question):
            s += sgl.user(sgl.image(image_path) + question)
            s += sgl.assistant(sgl.gen("answer"))

        runtime = sgl.Runtime(model_path="lmms-lab/llama3-llava-next-8b",
                              tokenizer_path="lmms-lab/llama3-llava-next-8b-tokenizer")
        sgl.set_default_backend(runtime)

        state = image_qa.run(
            image_path="./images/nyc.png",
            question="What is this?",
            max_new_tokens=64)
        print(state["answer"], "\n")

        runtime.shutdown()

    @modal.method()
    async def generate(self):
        print("HI")

        
@app.local_entrypoint()
def main():
    Model().generate.remote()

@Iven2132
Copy link

Iven2132 commented May 11, 2024

Oh hey @kcz358 I have two questions 1) Is it possible to directly pass a remote image URL? 2) How can I serve the model directly so it doesn't have to load every time? I just want to load the model one time in my start_engine and use the generate function to get the output. I think this will be fast.

currently, my code is not printing anything.

Here is my current code:

class Model:
    @modal.enter()
    async def start_engine(self):
        import sglang as sgl
        import requests
        import subprocess

        # response = requests.get(
        #     "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
        # with open("./nyc.png", 'wb') as file:
        #     file.write(response.content)

        command = 'python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host="127.0.0.1"'
        subprocess.run(command, shell=True)


    @modal.method()
    async def generate(self):
        print("Generating")
        import openai
        client = openai.Client(
        base_url="https://127.0.0.1:30000/v1", api_key="EMPTY")
        response = client.chat.completions.create(
            model="default",
            messages=[
            {
            "role": "user",
            "content": [
                {
                "type": "text",
                "text": "Where is this located?"
                },
                {
                "type": "image_url",
                "image_url": {
                    "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
                }
                }
            ]
            }
        ],
            temperature=0,
            max_tokens=64,
        )
        print(response)

and here are the logs:

/usr/local/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:103: FutureWarning: The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect
  warnings.warn(
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. 
The class this function is called from is 'LlamaTokenizerFast'.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/usr/local/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. 
The class this function is called from is 'LlamaTokenizerFast'.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Rank 0: load weight begin.
/usr/local/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
INFO 05-11 17:24:09 weight_utils.py:177] Using model weights format ['*.safetensors']
INFO 05-11 17:25:40 weight_utils.py:177] Using model weights format ['*.safetensors']
Rank 0: load weight end.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. 
The class this function is called from is 'LlamaTokenizerFast'.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Rank 0: max_total_num_token=452279, max_prefill_num_token=75379, context_len=8192, 
disable_radix_cache=False, enable_flashinfer=False, disable_regex_jump_forward=False, disable_disk_cache=False, attention_reduce_in_fp32=False
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Started server process [25]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on https://127.0.0.1:30000 (Press CTRL+C to quit)
INFO:     127.0.0.1:18252 - "GET /get_model_info HTTP/1.1" 200 OK
new fill batch. #seq: 1. #cached_token: 0. #new_token: 9. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%.
/usr/local/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
/usr/local/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
INFO:     127.0.0.1:52481 - "POST /generate HTTP/1.1" 200 OK

@kcz358
Copy link
Contributor Author

kcz358 commented May 12, 2024

Hi @Iven2132 , I am not sure how to do it with openai format. But based on my understanding on the code from @Luodian , I believe you can use a payload in json and post it to the url. If you don't want to reload the model everything, is it possible that you can set a server use one script and then query it using another script?

@Iven2132
Copy link

Hi @Iven2132 , I am not sure how to do it with openai format. But based on my understanding on the code from @Luodian , I believe you can use a payload in json and post it to the url. If you don't want to reload the model everything, is it possible that you can set a server use one script and then query it using another script?

I don't think I can do this on Modal but does SGL have any serve mind or feature? where can I load the model once? We can do this with lmdeploy

def start_engine(self):
    from lmdeploy import serve, ChatTemplateConfig
    self.server = serve('OpenGVLab/InternVL-Chat-V1-2-Plus',
      chat_template_config=ChatTemplateConfig(model_name='internvl-zh-hermes2'),
      server_name='0.0.0.0',
      server_port=23333)

@Iven2132
Copy link

Hey @Luodian I tried your code example but it's not responding. Can you please help?

Here is my code:

import modal

GPU_CONFIG = modal.gpu.A100(memory=80, count=2)

vllm_image = (
    modal.Image.from_registry(
        "nvidia/cuda:11.8.0-devel-ubuntu22.04", add_python="3.10")
    .apt_install("git", "wget", "cmake")
    .pip_install(
        "wheel==0.43.0",
        "torch==2.2.1",
        "torchvision==0.17.1",
        "transformers==4.40.0",
        "timm==0.9.12",
        "Pillow==10.3.0",
        "peft==0.8.2",
        "hf-transfer==0.1.6",
        "huggingface_hub==0.22.2",
        "nvidia-nccl-cu11==2.21.5"
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
    .run_commands("pip install flash-attn==2.5.2 --no-build-isolation")
    .run_commands("pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git")
    .run_commands("git clone https://github.com/sgl-project/sglang.git && cd sglang && pip install -e 'python[all]'")
)

app = modal.App("test-sgl-app")

@app.cls(
    gpu=GPU_CONFIG,
    timeout=1200,
    container_idle_timeout=1200,
    allow_concurrent_inputs=10,
    image=vllm_image,
)
class Model:
    @modal.enter()
    async def start_engine(self):
        import subprocess

        command = 'python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host="127.0.0.1"'
        subprocess.run(command, shell=True)


    @modal.method()
    async def generate(self):
        import requests
        import copy

        from llava.conversation import (
            conv_llava_llama_3,
        )

        url = "https://127.0.0.1:30000/generate"

        prompt = "<image>\nPlease generate caption towards this image."
        conv_template = copy.deepcopy(conv_llava_llama_3)
        conv_template.append_message(role="user", message=prompt)
        prompt_with_template = conv_template.get_prompt()
        data = {
            "text": prompt_with_template,
            "image_data": "https://farm4.staticflickr.com/3175/2653711032_804ff86d81_z.jpg",
            "sampling_params": {
                "max_new_tokens": 30,
                "temperature": 0,
                "top_p": 1.0,
                "presence_penalty": 2,
                "frequency_penalty": 2,
                "stop": "",
            },
        }
        response = requests.post(url, json=data)
        data = response.json()
        print(data["text"])
        

        
@app.local_entrypoint()
def main():
    Model().generate.remote()

@Iven2132
Copy link

Hey @Luodian I tried your code example but it's not responding. Can you please help?

Here is my code:

import modal

GPU_CONFIG = modal.gpu.A100(memory=80, count=2)

vllm_image = (
    modal.Image.from_registry(
        "nvidia/cuda:11.8.0-devel-ubuntu22.04", add_python="3.10")
    .apt_install("git", "wget", "cmake")
    .pip_install(
        "wheel==0.43.0",
        "torch==2.2.1",
        "torchvision==0.17.1",
        "transformers==4.40.0",
        "timm==0.9.12",
        "Pillow==10.3.0",
        "peft==0.8.2",
        "hf-transfer==0.1.6",
        "huggingface_hub==0.22.2",
        "nvidia-nccl-cu11==2.21.5"
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
    .run_commands("pip install flash-attn==2.5.2 --no-build-isolation")
    .run_commands("pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git")
    .run_commands("git clone https://github.com/sgl-project/sglang.git && cd sglang && pip install -e 'python[all]'")
)

app = modal.App("test-sgl-app")

@app.cls(
    gpu=GPU_CONFIG,
    timeout=1200,
    container_idle_timeout=1200,
    allow_concurrent_inputs=10,
    image=vllm_image,
)
class Model:
    @modal.enter()
    async def start_engine(self):
        import subprocess

        command = 'python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host="127.0.0.1"'
        subprocess.run(command, shell=True)


    @modal.method()
    async def generate(self):
        import requests
        import copy

        from llava.conversation import (
            conv_llava_llama_3,
        )

        url = "https://127.0.0.1:30000/generate"

        prompt = "<image>\nPlease generate caption towards this image."
        conv_template = copy.deepcopy(conv_llava_llama_3)
        conv_template.append_message(role="user", message=prompt)
        prompt_with_template = conv_template.get_prompt()
        data = {
            "text": prompt_with_template,
            "image_data": "https://farm4.staticflickr.com/3175/2653711032_804ff86d81_z.jpg",
            "sampling_params": {
                "max_new_tokens": 30,
                "temperature": 0,
                "top_p": 1.0,
                "presence_penalty": 2,
                "frequency_penalty": 2,
                "stop": "",
            },
        }
        response = requests.post(url, json=data)
        data = response.json()
        print(data["text"])
        

        
@app.local_entrypoint()
def main():
    Model().generate.remote()

@kcz358 Can you please check my code?

@Iven2132
Copy link

Iven2132 commented May 12, 2024

now i am getting this

 Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
/usr/local/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: 
_ZN2at4_ops15sum_IntList_out4callERKNS_6TensorEN3c1016OptionalArrayRefIlEEbSt8optionalINS5_10ScalarTypeEERS2_

@kcz358
Copy link
Contributor Author

kcz358 commented May 13, 2024

now i am getting this

 Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
/usr/local/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: 
_ZN2at4_ops15sum_IntList_out4callERKNS_6TensorEN3c1016OptionalArrayRefIlEEbSt8optionalINS5_10ScalarTypeEERS2_

This error is mainly caused by flash-attn and has nothing to do with sglang or llava. You might want to clean up your cuda and reinstall.

@Luodian
Copy link
Contributor

Luodian commented May 13, 2024

@Qubitium @merrymercy

Hi~ Can you help to check if this PR can be merged?

@Iven2132
Copy link

now i am getting this

 Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
/usr/local/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: 
_ZN2at4_ops15sum_IntList_out4callERKNS_6TensorEN3c1016OptionalArrayRefIlEEbSt8optionalINS5_10ScalarTypeEERS2_

This error is mainly caused by flash-attn and has nothing to do with sglang or llava. You might want to clean up your cuda and reinstall.

I don't think so, my code works when I comment out all the code that uses sglang and llava.

@Iven2132
Copy link

now i am getting this

 Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
/usr/local/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: 
_ZN2at4_ops15sum_IntList_out4callERKNS_6TensorEN3c1016OptionalArrayRefIlEEbSt8optionalINS5_10ScalarTypeEERS2_

@kcz358 @Luodian Here are the full logs:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1510, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 55, in <module>
    from flash_attn import flash_attn_func, flash_attn_varlen_func
  File "/usr/local/lib/python3.11/site-packages/flash_attn/__init__.py", line 3, in <module>
    from flash_attn.flash_attn_interface import (
  File "/usr/local/lib/python3.11/site-packages/flash_attn/flash_attn_interface.py", line 10, in <module>
    import flash_attn_2_cuda as flash_attn_cuda
ImportError: /usr/local/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/pkg/modal/_container_io_manager.py", line 458, in handle_user_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 493, in call_lifecycle_functions
    event_loop.run(res)
  File "/pkg/modal/_container_entrypoint.py", line 162, in run
    return self.loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/root/main.py", line 199, in start_engine
    from llava.conversation import (
  File "/usr/local/lib/python3.11/site-packages/llava/__init__.py", line 1, in <module>
    from .model import LlavaLlamaForCausalLM
  File "/usr/local/lib/python3.11/site-packages/llava/model/__init__.py", line 15, in <module>
    exec(f"from .language_model.{model_name} import {model_classes}")
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.11/site-packages/llava/model/language_model/llava_llama.py", line 25, in <module>
    from transformers import LlamaModel, LlamaForCausalLM
  File "<frozen importlib._bootstrap>", line 1229, in _handle_fromlist
  File "/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1501, in __getattr__
    value = getattr(module, name)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1500, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1512, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
/usr/local/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

@kcz358
Copy link
Contributor Author

kcz358 commented May 14, 2024

@Iven2132 , Because you use llava and sglang that use flash-attn. The main cause is still cuda version for flash-attn. You can refer to this oobabooga/text-generation-webui#4182

@Iven2132
Copy link

@Iven2132 , Because you use llava and sglang that use flash-attn. The main cause is still cuda version for flash-attn. You can refer to this oobabooga/text-generation-webui#4182

@kcz358 @Luodian is Qwen 72b and 110b supported? I'm using it like this

       tokenizer_path="lmms-lab/llavanext-qwen-tokenizer")
       sgl.set_default_backend(runtime)

But getting these errors:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 945, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
                   ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 647, in __getitem__
    raise KeyError(key)
KeyError: 'llava_qwen'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/sglang/python/sglang/srt/server.py", line 140, in launch_server
    tokenizer_manager = TokenizerManager(server_args, port_args)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 98, in __init__
    self.hf_config = get_config(
                     ^^^^^^^^^^^
  File "/sglang/python/sglang/srt/hf_transformers_utils.py", line 34, in get_config
    config = AutoConfig.from_pretrained(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 947, in from_pretrained
    raise ValueError(
ValueError: The checkpoint you are trying to load has model type `llava_qwen` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
Traceback (most recent call last):
  File "/pkg/modal/_container_io_manager.py", line 458, in handle_user_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 493, in call_lifecycle_functions
    event_loop.run(res)
  File "/pkg/modal/_container_entrypoint.py", line 162, in run
    return self.loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/root/main.py", line 204, in start_engine
    runtime = sgl.Runtime(model_path="lmms-lab/llava-next-72b",
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/api.py", line 38, in Runtime
    return Runtime(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/server.py", line 265, in __init__
    raise RuntimeError("Initialization failed. Please see the error messages above.")
RuntimeError: Initialization failed. Please see the error messages above.

@merrymercy merrymercy merged commit 664287b into sgl-project:main May 14, 2024
@Iven2132
Copy link

@kcz358 @Luodian Also example code in examples/usage/llava/http_qwen_llava_test.py doesn't seems to work. It's not giving any response at all when the model is being loaded after that it logs logging /usr/local/lib/python3.11/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The vocab_sizeattribute is deprecated and will be removed in v4.42, Please usetext_config.vocab_size instead.

@Luodian
Copy link
Contributor

Luodian commented May 14, 2024

It seems some issues related to merge and your local environment. Please try to install packages following the steps, clean huggingface cache folder, and uninstall flash-attn (if you met problem).

# Installing latest llava-next: pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
# Installing latest sglang: cd ~/sglang; pip install -e "python[all]"
# Installing latest vllm: pip install vllm==0.4.2
# Installing latest flashinfer (in case any error): pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/

# Endpoint Service CLI: 
python3 -m sglang.launch_server --model-path lmms-lab/llava-next-72b --tokenizer-path lmms-lab/llavanext-qwen-tokenizer --port=30000 --host="127.0.0.1" --tp-size=4

# In another tmux
python3 http_qwen_llava_test.py

Output:
"Two children pose with a large teddy bear, one holding a smaller stuffed bear, in a room with an American flag and potted plants."

Here's my output (dont mind the warnings, --port need to be set according to the endpoint message, it will tell you which port starts the model endpoint):
image

@Iven2132
Copy link

It seems some issues related to merge and your local environment. Please try to install packages following the steps, clean huggingface cache folder, and uninstall flash-attn (if you met problem).

# Installing latest llava-next: pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
# Installing latest sglang: cd ~/sglang; pip install -e "python[all]"
# Installing latest vllm: pip install vllm==0.4.2
# Installing latest flashinfer (in case any error): pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/

# Endpoint Service CLI: 
python3 -m sglang.launch_server --model-path lmms-lab/llava-next-72b --tokenizer-path lmms-lab/llavanext-qwen-tokenizer --port=30000 --host="127.0.0.1" --tp-size=4

# In another tmux
python3 http_qwen_llava_test.py

Output:
"Two children pose with a large teddy bear, one holding a smaller stuffed bear, in a room with an American flag and potted plants."

Here's my output (dont mind the warnings, --port need to be set according to the endpoint message, it will tell you which port starts the model endpoint): image

@Luodian Can you please run this code on Modal? https://modal.com/

import modal
from modal import asgi_app

GPU_CONFIG = modal.gpu.A100(memory=80, count=2)

vllm_image = (
    modal.Image.from_registry(
        "nvidia/cuda:12.2.0-devel-ubuntu22.04", add_python="3.11"
    )
    .apt_install("git", "wget", "cmake")
    .pip_install(
        "wheel==0.43.0",
        "torch==2.2.1",
        "torchvision==0.17.1",
        "transformers==4.40.2",
        "timm==0.9.12",
        "Pillow==10.3.0",
        "peft==0.8.2",
        "hf-transfer==0.1.6",
        "huggingface_hub==0.22.2",
        "requests==2.31.0",
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
    .run_commands("pip install flash-attn --no-build-isolation")
    .run_commands("pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git")
    .run_commands("git clone https://github.com/sgl-project/sglang.git && cd sglang && pip install -e 'python[all]'")
)

app = modal.App("test-app-sgl")

@app.cls(
    gpu=GPU_CONFIG,
    timeout=1200,
    container_idle_timeout=1200,
    allow_concurrent_inputs=10,
    image=vllm_image,
)
class Model:
    @modal.enter()
    async def start_engine(self):
        import subprocess
        command = [
            "python", "-m", "sglang.launch_server",
            "--model-path", "lmms-lab/llama3-llava-next-8b",
            "--tokenizer-path", "lmms-lab/llama3-llava-next-8b-tokenizer",
            "--port=30000",
            "--host=127.0.0.1",
            "--tp-size=4"
        ]
        result = subprocess.run(command)
        print("Standard Output:", result)

    @modal.method()
    async def generate(self):
        print("Generating")

        import requests

        url = "https://127.0.0.1:30000/generate"

        data = {
            "text": "Hi",
            "image_data": "https://farm4.staticflickr.com/3175/2653711032_804ff86d81_z.jpg",
            "sampling_params": {
                "max_new_tokens": 30,
                "temperature": 0,
                "top_p": 1.0,
                "presence_penalty": 2,
                "frequency_penalty": 2,
                "stop": "",
            },
        }
        response = requests.post(url, json=data)
        print(response)


@app.local_entrypoint()
def main():
    Model().generate.remote()

@vedantroy
Copy link

@Luodian When generating the prompt, what conversation format were you using? Was it this:

conv_llava_llama_3 = Conversation(
    system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.",
    roles=("<|start_header_id|>user", "<|start_header_id|>assistant"),
    version="llama_v3",
    messages=[],
    offset=0,
    sep_style=SeparatorStyle.LLAMA_3,
    tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
    tokenizer=AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct"),
    stop_token_ids=[128009],
)

@Iven2132
Copy link

@Luodian How much time it takes for you to load the 72b model? When you run "python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --tokenizer-path lmms-lab/llavanext-qwen-tokenizer --port=30000 --host="127.0.0.1" --tp-size=4"

I'm running 4*A100 80G and it doesn't even load the model after 5 mins.

@Luodian
Copy link
Contributor

Luodian commented May 16, 2024

@Luodian When generating the prompt, what conversation format were you using? Was it this:

conv_llava_llama_3 = Conversation(
    system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.",
    roles=("<|start_header_id|>user", "<|start_header_id|>assistant"),
    version="llama_v3",
    messages=[],
    offset=0,
    sep_style=SeparatorStyle.LLAMA_3,
    tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
    tokenizer=AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct"),
    stop_token_ids=[128009],
)

Yes, we use this.

@Luodian
Copy link
Contributor

Luodian commented May 16, 2024

@Luodian How much time it takes for you to load the 72b model? When you run "python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --tokenizer-path lmms-lab/llavanext-qwen-tokenizer --port=30000 --host="127.0.0.1" --tp-size=4"

I'm running 4*A100 80G and it doesn't even load the model after 5 mins.

I could use 4*A100 to run it. The bottleneck maybe disk read? you could check if the disk is actively reading the checkpoints, it has around 30 safetensors.

@Iven2132
Copy link

@Luodian How much time it takes for you to load the 72b model? When you run "python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --tokenizer-path lmms-lab/llavanext-qwen-tokenizer --port=30000 --host="127.0.0.1" --tp-size=4"
I'm running 4*A100 80G and it doesn't even load the model after 5 mins.

I could use 4*A100 to run it. The bottleneck maybe disk read? you could check if the disk is actively reading the checkpoints, it has around 30 safetensors.

@Luodian What should be the disk size to load the model?

@merrymercy
Copy link
Contributor

merrymercy commented May 21, 2024

@kcz358 The two files python/sglang/srt/models/llava_mistral.py and python/sglang/srt/models/llava_qwen.py are almost identical to python/sglang/srt/models/llava.py. We should try to reduce the redundancy here. Can you refactor LlavaQwenForCausalLM and LlavaMistralForCausalLM to be the subclass of LlavaLlamaForCausalLM and reuse any many functions as possible?

@Luodian
Copy link
Contributor

Luodian commented May 21, 2024

@kcz358 The two files python/sglang/srt/models/llava_mistral.py and python/sglang/srt/models/llava_qwen.py are almost identical to python/sglang/srt/models/llava.py. We should try to reduce the redundancy here. Can you refactor LlavaQwenForCausalLM and LlavaMistralForCausalLM to be the subclass of LlavaLlamaForCausalLM and reuse any many functions as possible?

OK! Thanks for this suggestion. Let me do the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants