Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent evaluation results with Chat Template #1841

Open
shiweijiezero opened this issue May 14, 2024 · 3 comments
Open

Inconsistent evaluation results with Chat Template #1841

shiweijiezero opened this issue May 14, 2024 · 3 comments

Comments

@shiweijiezero
Copy link
Contributor

I evaluated llama3-8b-Instruct using the gsm8k benchmark, and found some interesting phenomenons.

  1. Huggingface and vllm has similar results
    image

  2. If I use vllm to start an API service, and use lm-eval local-chat mode to evaluate gsm8k, resulting a different accuracy.
    image

  3. I browsed the source code of lm-eval, and found that the API use the apply chat template, while inference of huggingface and vllm mode do not use chat template (i.e. there are no special tokens)

  4. I tried to use the chat template in the tokenizer and log some intermate results, could you give me some insights about it???

output of vllm in the lm-eval
image

output of vllm using chat template of llama, and this do not use system prompt:
image

output of vllm using chat template of llama, and this uses system prompt "you are a helpful assistant":
image

In the final, the accuracy of vllm with chat template significantly drop!!!
image

Are you have any idea about it?

I think the people should use more chat-template in the evaluation, since it is close to the real sceneries.

@mohit-rag
Copy link

Did you figure out how to use the chat template with vllm?

@shiweijiezero
Copy link
Contributor Author

Did you figure out how to use the chat template with vllm?

Yeah, It's simple.

        if (use_completion_flag):
            all_prompt_ids = tokenizer(all_prompt, add_special_tokens=False).input_ids
        else:
            # all_prompt_ids = [tokenizer.apply_chat_template(
            #     conversation=[
            #         {"role": "user", "content": prompt},
            #     ],
            #     add_generation_prompt=False,
            # ) for prompt in all_prompt]
            all_prompt_ids = [tokenizer.apply_chat_template(
                conversation=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": prompt},
                ],
                add_generation_prompt=add_generation_prompt,

            ) for prompt in all_prompt]

        if (use_vllm_flag):
            sampling_params = SamplingParams(
                temperature=0.6,
                top_p=0.9,
                # top_k=5,
                repetition_penalty=1.1,
                max_tokens=1024,
                # min_tokens=20,
                stop = until,
            )
            # Create an LLM.
            llm = LLM(
                # enforce_eager=True,
                model=model_name, dtype="bfloat16",
                gpu_memory_utilization=0.45,
                max_model_len=4096,
                # tensor_parallel_size=8
            )
            # Generate texts from the prompts. The output is a list of RequestOutput objects
            # that contain the prompt, generated text, and other information.
            # 限制gpu内存使用百分比
            outputs = llm.generate(
                # all_prompt,
                prompt_token_ids=all_prompt_ids,  # 这一步之前需要使用模板
                sampling_params=sampling_params,
            )

@haileyschoelkopf
Copy link
Contributor

Hi! After #2034 both HF and vllm should support chat templating via --apply_chat_template and related CLI flags!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants