Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Questions on the performance of EE-LLM on HELM benchmark #18

Open
Marmot-C opened this issue Jul 11, 2024 · 11 comments
Open
Labels

Comments

@Marmot-C
Copy link

Hi, i'm trying to recreate figure 8 in the EE-LLM paper using the 7B checkpoint. Here are some of the problems i encountered during experiment.

  1. HELM framework needs the tokenizer used by the model to perform evaluation. What type of tokenizer does EE-LLM use? Does the HELM framework currently support it?

  2. As i'm testing the model, there seems to be a performance gap between my results and the results shown in figure 8. I got a ROUGE-L score of around 0.2 on CNN/DM dataset with early_exit_thres=1.0, around 0.15 on Xsum dataset under the same setting. I'm using huggingface/gpt2 tokenizer because of the problem mentioned in question 1, not sure if that's part of the cause. Also i've only sampled 100 examples from the dataset. However the performance gap still seems kinda large. (With early_exit_thres ranging from 0.2 to 0.8 on CNN/DM, i'm able to get similar results as shown in figure 8. Not so much on XSum dataset though.)

  3. Does EE-LLM use fp32? I tried to deploy EE-LLM on a single GPU but got OOM. Two or more GPUs work fine. Usually a 7B model using fp16 should fit in a GPU with 24GB memory. Is there any way to convert it?

Thanks a lot!

@pan-x-c
Copy link
Owner

pan-x-c commented Jul 11, 2024

  1. we use the llama tokenizer, which is provided along with the checkpoint files.
  2. we set --max-eval-instances to 500
  3. we use A100 with bf16 in our experiments

You can refer to #11 and #15 for more details

@Marmot-C
Copy link
Author

Thanks for the reply! With --max-eval-instances 500, i'm still having trouble recreating the results of xsum in figure 8. Are there any further modifications to be made to the HELM framework or any specific hyperparameter settings?

@Marmot-C
Copy link
Author

I've noticed that there's another task in HELM named summarization_xsum_sampled (instead of summarization_xsum), and with that task i'm able to recreate closer results to the original Figure 8. Thanks!

@pan-x-c
Copy link
Owner

pan-x-c commented Jul 15, 2024

Have you enabled pipeline parallelism with PP=4 in the benchmark? The KV recomputation method implemented for the single GPU scenario in EE-LLM has certain efficiency defects. You may find that when the early_exit_thres is very small, the inference speed becomes slower. You can uncomment lines 24-25 of ./megatron/core/inference_params.py to optimize the performance.

@yanxi-chen
Copy link
Collaborator

Thanks for the reply! With --max-eval-instances 500, i'm still having trouble recreating the results of xsum in figure 8. Are there any further modifications to be made to the HELM framework or any specific hyperparameter settings?

Hi @Marmot-C , could you provide more details about how your experiment results differ from those in Figure 8 (e.g. lower scores, lower speedup, or both), and your configurations of experiments (e.g. which EE inference method, TP/PP degree, whether the choice of tokenizer mentioned in your initial post has been fixed, etc.)? This might help us better pinpoint the causes, besides the one mentioned by Xuchen :)

@Marmot-C
Copy link
Author

Yeah sure. The problem is mainly on lower scores on summarization_xsum, where i'm only able to get ~0.15 ROUGE-L score with threshold=0.8/1.0. However, on summarization_xsum_sampled, i'm able to reproduce results shown in Figure 8. Scores on CNN/DailyMail dataset is also slightly lower than Figure 8, where i'm able to get ~0.2 ROUGE-L score with threshold=0.8/1.0. Other than that, i can reproduce most results shown in Figure 8.

For tokenizer i used meta-llama/Llama-2-7b-hf provided by HELM. I'm using TP=1, PP=2 and TP=1, PP=4, both produced similar results. Script was taken directly from examples/ee_inference/ee_inference_server.sh, with modification corresponding to TP and PP settings.

@pan-x-c
Copy link
Owner

pan-x-c commented Jul 15, 2024

It may be because HELM updated the dataset. Our experiments are based on an old version of HELM (commit id 33ca6e62).

In addition, the default evaluation metric of XSUM and CNN/DM in HELM is ROUGE-2, which we changed to ROUGE-L in our experiments.

@pan-x-c
Copy link
Owner

pan-x-c commented Jul 15, 2024

Here is the raw output generated by HELM for the 7B-EE model in our experiment. You can check if there are any mismatched settings.
result.zip

@Marmot-C
Copy link
Author

Hi, thank you very much for your help! Recently I've been reading the EE-Tuning paper and may i ask which subjects were used in the MMLU evaluation? Thanks.

@pan-x-c
Copy link
Owner

pan-x-c commented Aug 27, 2024

We use all available subjects in MMLU, and --max-eval-instances is set to 500.

Copy link

Marking as stale. No activity in 60 days.

@github-actions github-actions bot added the stale label Oct 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants