[QUESTION] Questions on the performance of EE-LLM on HELM benchmark #18

Marmot-C · 2024-07-11T09:16:04Z

Hi, i'm trying to recreate figure 8 in the EE-LLM paper using the 7B checkpoint. Here are some of the problems i encountered during experiment.

HELM framework needs the tokenizer used by the model to perform evaluation. What type of tokenizer does EE-LLM use? Does the HELM framework currently support it?
As i'm testing the model, there seems to be a performance gap between my results and the results shown in figure 8. I got a ROUGE-L score of around 0.2 on CNN/DM dataset with early_exit_thres=1.0, around 0.15 on Xsum dataset under the same setting. I'm using huggingface/gpt2 tokenizer because of the problem mentioned in question 1, not sure if that's part of the cause. Also i've only sampled 100 examples from the dataset. However the performance gap still seems kinda large. (With early_exit_thres ranging from 0.2 to 0.8 on CNN/DM, i'm able to get similar results as shown in figure 8. Not so much on XSum dataset though.)
Does EE-LLM use fp32? I tried to deploy EE-LLM on a single GPU but got OOM. Two or more GPUs work fine. Usually a 7B model using fp16 should fit in a GPU with 24GB memory. Is there any way to convert it?

Thanks a lot!

pan-x-c · 2024-07-11T09:58:32Z

we use the llama tokenizer, which is provided along with the checkpoint files.
we set --max-eval-instances to 500
we use A100 with bf16 in our experiments

You can refer to #11 and #15 for more details

Marmot-C · 2024-07-12T02:31:02Z

Thanks for the reply! With --max-eval-instances 500, i'm still having trouble recreating the results of xsum in figure 8. Are there any further modifications to be made to the HELM framework or any specific hyperparameter settings?

Marmot-C · 2024-07-15T01:26:24Z

I've noticed that there's another task in HELM named summarization_xsum_sampled (instead of summarization_xsum), and with that task i'm able to recreate closer results to the original Figure 8. Thanks!

pan-x-c · 2024-07-15T01:56:49Z

Have you enabled pipeline parallelism with PP=4 in the benchmark? The KV recomputation method implemented for the single GPU scenario in EE-LLM has certain efficiency defects. You may find that when the early_exit_thres is very small, the inference speed becomes slower. You can uncomment lines 24-25 of ./megatron/core/inference_params.py to optimize the performance.

yanxi-chen · 2024-07-15T02:11:03Z

Thanks for the reply! With --max-eval-instances 500, i'm still having trouble recreating the results of xsum in figure 8. Are there any further modifications to be made to the HELM framework or any specific hyperparameter settings?

Hi @Marmot-C , could you provide more details about how your experiment results differ from those in Figure 8 (e.g. lower scores, lower speedup, or both), and your configurations of experiments (e.g. which EE inference method, TP/PP degree, whether the choice of tokenizer mentioned in your initial post has been fixed, etc.)? This might help us better pinpoint the causes, besides the one mentioned by Xuchen :)

Marmot-C · 2024-07-15T02:25:05Z

Yeah sure. The problem is mainly on lower scores on summarization_xsum, where i'm only able to get ~0.15 ROUGE-L score with threshold=0.8/1.0. However, on summarization_xsum_sampled, i'm able to reproduce results shown in Figure 8. Scores on CNN/DailyMail dataset is also slightly lower than Figure 8, where i'm able to get ~0.2 ROUGE-L score with threshold=0.8/1.0. Other than that, i can reproduce most results shown in Figure 8.

For tokenizer i used meta-llama/Llama-2-7b-hf provided by HELM. I'm using TP=1, PP=2 and TP=1, PP=4, both produced similar results. Script was taken directly from examples/ee_inference/ee_inference_server.sh, with modification corresponding to TP and PP settings.

pan-x-c · 2024-07-15T02:59:47Z

It may be because HELM updated the dataset. Our experiments are based on an old version of HELM (commit id 33ca6e62).

In addition, the default evaluation metric of XSUM and CNN/DM in HELM is ROUGE-2, which we changed to ROUGE-L in our experiments.

pan-x-c · 2024-07-15T05:26:50Z

Here is the raw output generated by HELM for the 7B-EE model in our experiment. You can check if there are any mismatched settings.
result.zip

Marmot-C · 2024-08-26T09:52:05Z

Hi, thank you very much for your help! Recently I've been reading the EE-Tuning paper and may i ask which subjects were used in the MMLU evaluation? Thanks.

pan-x-c · 2024-08-27T02:13:27Z

We use all available subjects in MMLU, and --max-eval-instances is set to 500.

github-actions · 2024-10-26T18:33:33Z

Marking as stale. No activity in 60 days.

github-actions bot added the stale label Oct 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Questions on the performance of EE-LLM on HELM benchmark #18

[QUESTION] Questions on the performance of EE-LLM on HELM benchmark #18

Marmot-C commented Jul 11, 2024

pan-x-c commented Jul 11, 2024 •

edited

Loading

Marmot-C commented Jul 12, 2024

Marmot-C commented Jul 15, 2024

pan-x-c commented Jul 15, 2024

yanxi-chen commented Jul 15, 2024

Marmot-C commented Jul 15, 2024

pan-x-c commented Jul 15, 2024

pan-x-c commented Jul 15, 2024

Marmot-C commented Aug 26, 2024

pan-x-c commented Aug 27, 2024

github-actions bot commented Oct 26, 2024

[QUESTION] Questions on the performance of EE-LLM on HELM benchmark #18

[QUESTION] Questions on the performance of EE-LLM on HELM benchmark #18

Comments

Marmot-C commented Jul 11, 2024

pan-x-c commented Jul 11, 2024 • edited Loading

Marmot-C commented Jul 12, 2024

Marmot-C commented Jul 15, 2024

pan-x-c commented Jul 15, 2024

yanxi-chen commented Jul 15, 2024

Marmot-C commented Jul 15, 2024

pan-x-c commented Jul 15, 2024

pan-x-c commented Jul 15, 2024

Marmot-C commented Aug 26, 2024

pan-x-c commented Aug 27, 2024

github-actions bot commented Oct 26, 2024

pan-x-c commented Jul 11, 2024 •

edited

Loading