-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate Semantic Answer Similarity (SAS) into the evaluation metrics. #1703
Comments
Hi! Thank you for your interest in the library and contributing! We'd be glad to support this! Worth a discussion on the best way to implement it though--how large are the typical evaluation encoder models you will want to be evaluating on? The best current way to get this added to the library is to create a custom Filter subclass (https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/filters) which, when instantiated, loads your model, and handles running the SAS metric and returns the score, from which a custom metric would return those scores, I think. If you have any questions about this, or other ideas, happy to answer them! Separately, we'd like to add LLM-as-a-judge support and will likely implement that as a 2-step process (first, create LM outputs and save them following the existing workflows, then run a secondary grading script--which could also implement this model-based evaluation metric too). |
Hi Hailey! Thank's for the response and your interest in our proposal. We truly appreciate it! 😄 Regarding the size of the models, they typically range from around 100 MB to 500 MB: <= 2.5 GB in 32-bit precision. We've experimented with three different approaches to implementation:
For the time being, we have chosen to pursue the third option within this fork, specifically in the "aquas" branch. We have defined a new task within this branch to compute both types of SAS. This task has been instrumental in our testing; it has also helped us to understand your library in more depth. The implementation of the metrics can be found here. For example:
Once again, thank you for your interest. We hope you find this feedback valuable and that it assists us in clarifying the best approach for implementing this metric within the harness. 🙂 |
The Semantic Answer Similarity (SAS) metric (https://arxiv.org/abs/2108.06130) employs pretrained encoders to gauge the semantic similarity between two types of texts: predictions and references. This metric offers various computation options, such as:
Considering the following aspects:
It could be advantageous to offer the community a metric capable of comparing texts at the semantic level, particularly in tasks where measuring responses from models designed to be more interactive, such as assistants, is of interest.
At IIC, we are collaborating with Hugging Face and SomosNLP to create the first Spanish generative LLM leaderboard using the lm-evaluation-harness library as evaluation suite. The leaderboard will include QA tasks with long complex answers evaluated using the SAS metric.
We believe that the community could also benefit from this metric. If you think that is a useful proposal, I would be delighted to open a Pull Request following the documentation on how to add new tasks and the task guide to implement the Semantic Answer Similarity metric and enable the creation of complex subjective text generation evaluation tasks.
Congratulations on your work! :) We will follow with interest the progress of this project that is so useful for the open-source community.
The text was updated successfully, but these errors were encountered: