-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QUESTION] Questions on the performance of EE-LLM on HELM benchmark #18
Comments
Thanks for the reply! With --max-eval-instances 500, i'm still having trouble recreating the results of xsum in figure 8. Are there any further modifications to be made to the HELM framework or any specific hyperparameter settings? |
I've noticed that there's another task in HELM named summarization_xsum_sampled (instead of summarization_xsum), and with that task i'm able to recreate closer results to the original Figure 8. Thanks! |
Have you enabled pipeline parallelism with PP=4 in the benchmark? The KV recomputation method implemented for the single GPU scenario in EE-LLM has certain efficiency defects. You may find that when the early_exit_thres is very small, the inference speed becomes slower. You can uncomment lines 24-25 of |
Hi @Marmot-C , could you provide more details about how your experiment results differ from those in Figure 8 (e.g. lower scores, lower speedup, or both), and your configurations of experiments (e.g. which EE inference method, TP/PP degree, whether the choice of tokenizer mentioned in your initial post has been fixed, etc.)? This might help us better pinpoint the causes, besides the one mentioned by Xuchen :) |
Yeah sure. The problem is mainly on lower scores on summarization_xsum, where i'm only able to get ~0.15 ROUGE-L score with threshold=0.8/1.0. However, on summarization_xsum_sampled, i'm able to reproduce results shown in Figure 8. Scores on CNN/DailyMail dataset is also slightly lower than Figure 8, where i'm able to get ~0.2 ROUGE-L score with threshold=0.8/1.0. Other than that, i can reproduce most results shown in Figure 8. For tokenizer i used meta-llama/Llama-2-7b-hf provided by HELM. I'm using TP=1, PP=2 and TP=1, PP=4, both produced similar results. Script was taken directly from examples/ee_inference/ee_inference_server.sh, with modification corresponding to TP and PP settings. |
It may be because HELM updated the dataset. Our experiments are based on an old version of HELM (commit id In addition, the default evaluation metric of XSUM and CNN/DM in HELM is ROUGE-2, which we changed to ROUGE-L in our experiments. |
Here is the raw output generated by HELM for the 7B-EE model in our experiment. You can check if there are any mismatched settings. |
Hi, thank you very much for your help! Recently I've been reading the EE-Tuning paper and may i ask which subjects were used in the MMLU evaluation? Thanks. |
We use all available subjects in MMLU, and |
Marking as stale. No activity in 60 days. |
Hi, i'm trying to recreate figure 8 in the EE-LLM paper using the 7B checkpoint. Here are some of the problems i encountered during experiment.
HELM framework needs the tokenizer used by the model to perform evaluation. What type of tokenizer does EE-LLM use? Does the HELM framework currently support it?
As i'm testing the model, there seems to be a performance gap between my results and the results shown in figure 8. I got a ROUGE-L score of around 0.2 on CNN/DM dataset with early_exit_thres=1.0, around 0.15 on Xsum dataset under the same setting. I'm using huggingface/gpt2 tokenizer because of the problem mentioned in question 1, not sure if that's part of the cause. Also i've only sampled 100 examples from the dataset. However the performance gap still seems kinda large. (With early_exit_thres ranging from 0.2 to 0.8 on CNN/DM, i'm able to get similar results as shown in figure 8. Not so much on XSum dataset though.)
Does EE-LLM use fp32? I tried to deploy EE-LLM on a single GPU but got OOM. Two or more GPUs work fine. Usually a 7B model using fp16 should fit in a GPU with 24GB memory. Is there any way to convert it?
Thanks a lot!
The text was updated successfully, but these errors were encountered: