Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] How can I convert checkpoint tuned by EE-Tuning to Huggingface format? #15

Open
Mr-lonely0 opened this issue Jun 11, 2024 · 12 comments
Labels

Comments

@Mr-lonely0
Copy link

I have fine-tuned the llama-7b model using EE-Tuning, and I now need to convert the checkpoint to the Hugging Face format to proceed with the evaluation process. How should I do this?

@pan-x-c
Copy link
Owner

pan-x-c commented Jun 12, 2024

Same as #10 and #7. There is no way to convert the checkpoint to the huggingface format currently.

@Mr-lonely0
Copy link
Author

Thanks for your information!

I am also curious about how I can reproduce the results demonstrated in your paper and perform the downstream evaluation on the HELM benchmark. Could you please provide more details on this?

@pan-x-c
Copy link
Owner

pan-x-c commented Jun 12, 2024

We modified the MegatronClient, adding parameters related to EE-LLM. All other parts are directly inherited from HELM.

@Mr-lonely0
Copy link
Author

Actually, I'm not familiar with HELM. Could you provide some demos or guidance on how to use the script MegatronClient?

@pan-x-c
Copy link
Owner

pan-x-c commented Jun 12, 2024

You can refer to the demo in data-juicer.
Note that HELM itself is a heavy evaluation framework, and there are many difficulties in its installation and usage. You may need to go to the Helm official repository for help

@Mr-lonely0
Copy link
Author

Really appreciate for your help!
I'll check the demo you mentioned and give it a try.
Thanks again for your time!

@Mr-lonely0
Copy link
Author

Hello again!

I have tried the evaluation framework proposed in data-juicer and get some benchmark results, such as ROUGE-2 in CNN/DM, F1 score in NarrativeQA, and EM in MMLU. However, I'm confused about how can I get efficiency results like inference time throughout the generation process.

What should I modify in mymodel_example.yaml to parse the corresponding metric from HELM output?

I would greatly appreciate your help and look forward to your prompt response.

@pan-x-c
Copy link
Owner

pan-x-c commented Jun 17, 2024

If you use the HELM provided by Data-Juicer, you can modify src/helm/benchmark/static/schema.yaml to adjust the metrics. For example, we modified the efficiency item to:

  - name: efficiency
    display_name: Efficiency
    metrics:
    - name: inference_runtime
      split: ${main_split}

The inference_runtime is the metric used in our paper.

In addition, you also need to modify your megatron_client.py to return the new metric in your response. For example,

        return RequestResult(
            success=True,
            cached=cached,
            request_time=response['request_time'],
            request_datetime=response['request_datetime'],
            completions=completions,
            embedding=[]
        )

Helm will use the request_time field to calculate the inference_time metric.

Note that the demo script provided by Data-Juicer is not for EE models, it only records some metrics for pretraining.
To view the full evaluation result, you should refer to the standard usage process of HELM, e.g. helm-server after helm-summarize.

@Mr-lonely0
Copy link
Author

Mr-lonely0 commented Jun 18, 2024

Thank you!

I have tested the standard usage process of HELM with the original llama-2. On the helm-server generated website, I noticed that there are no efficiency metrics recorded in the leaderboard presented by HELM.
image

However, I found the Observed inference runtime (s) in the Predictions section for the corresponding dataset (cnn_dailymail as shown below).
image

Could you please clarify how I can obtain the efficiency metrics if the former is correct? Alternatively, why is there only one count in the Predictions section where I set the --max-eval-instances=100 if the latter is correct?

@pan-x-c
Copy link
Owner

pan-x-c commented Jun 19, 2024

Your client must return those metrics in response, and then HELM can summarize them, so you need to modify your client first, as shown in my previous comment.
For example, HELM will use request_time field in the response to calculate the inference_runtime metric.

And in our paper's experiment, we set --max-eval-instances to 500.

@Mr-lonely0
Copy link
Author

Thanks!
I have figured it out. Really appreciate your time!

Copy link

Marking as stale. No activity in 60 days.

@github-actions github-actions bot added the stale label Aug 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants