Inconsistent evaluation results with Chat Template #1841

shiweijiezero · 2024-05-14T13:00:20Z

I evaluated llama3-8b-Instruct using the gsm8k benchmark, and found some interesting phenomenons.

Huggingface and vllm has similar results
If I use vllm to start an API service, and use lm-eval local-chat mode to evaluate gsm8k, resulting a different accuracy.
I browsed the source code of lm-eval, and found that the API use the apply chat template, while inference of huggingface and vllm mode do not use chat template (i.e. there are no special tokens)
I tried to use the chat template in the tokenizer and log some intermate results, could you give me some insights about it???

output of vllm in the lm-eval

output of vllm using chat template of llama, and this do not use system prompt:

output of vllm using chat template of llama, and this uses system prompt "you are a helpful assistant":

In the final, the accuracy of vllm with chat template significantly drop!!!

Are you have any idea about it?

I think the people should use more chat-template in the evaluation, since it is close to the real sceneries.

mohit-rag · 2024-06-28T05:26:29Z

Did you figure out how to use the chat template with vllm?

shiweijiezero · 2024-06-28T05:49:27Z

Did you figure out how to use the chat template with vllm?

Yeah, It's simple.

        if (use_completion_flag):
            all_prompt_ids = tokenizer(all_prompt, add_special_tokens=False).input_ids
        else:
            # all_prompt_ids = [tokenizer.apply_chat_template(
            #     conversation=[
            #         {"role": "user", "content": prompt},
            #     ],
            #     add_generation_prompt=False,
            # ) for prompt in all_prompt]
            all_prompt_ids = [tokenizer.apply_chat_template(
                conversation=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": prompt},
                ],
                add_generation_prompt=add_generation_prompt,

            ) for prompt in all_prompt]

        if (use_vllm_flag):
            sampling_params = SamplingParams(
                temperature=0.6,
                top_p=0.9,
                # top_k=5,
                repetition_penalty=1.1,
                max_tokens=1024,
                # min_tokens=20,
                stop = until,
            )
            # Create an LLM.
            llm = LLM(
                # enforce_eager=True,
                model=model_name, dtype="bfloat16",
                gpu_memory_utilization=0.45,
                max_model_len=4096,
                # tensor_parallel_size=8
            )
            # Generate texts from the prompts. The output is a list of RequestOutput objects
            # that contain the prompt, generated text, and other information.
            # 限制gpu内存使用百分比
            outputs = llm.generate(
                # all_prompt,
                prompt_token_ids=all_prompt_ids,  # 这一步之前需要使用模板
                sampling_params=sampling_params,
            )

haileyschoelkopf · 2024-06-28T17:40:17Z

Hi! After #2034 both HF and vllm should support chat templating via --apply_chat_template and related CLI flags!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent evaluation results with Chat Template #1841

Inconsistent evaluation results with Chat Template #1841

shiweijiezero commented May 14, 2024

mohit-rag commented Jun 28, 2024

shiweijiezero commented Jun 28, 2024

haileyschoelkopf commented Jun 28, 2024

Inconsistent evaluation results with Chat Template #1841

Inconsistent evaluation results with Chat Template #1841

Comments

shiweijiezero commented May 14, 2024

mohit-rag commented Jun 28, 2024

shiweijiezero commented Jun 28, 2024

haileyschoelkopf commented Jun 28, 2024