-
-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Data Leakage Assessment #1866
Labels
enhancement
New feature or request
Comments
Hi @dt-ahmed-touila, thanks for sharing this request! Since we're blackbox application-focused we don't currently develop tooling for training/fine-tuning related evaluation. We'll re-open this if we get to that domain one day. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
🚀 Feature Request
Measure data leakage based on a subset of samples/conversations from the training/fine-tuning dataset
🔈 Motivation
To efficiently apply generative models to specific domain/application (legal field is a good example) companies turn to fine-tuning / alignment on proprietary/confidential dataset. Once the fine-tuning is done and given the nature of the models, there is a significant risk of training data leakage on inference. This leakage is due to two capacities in LLMs; Memorization and Association.
🛰 Alternatives
The text was updated successfully, but these errors were encountered: