Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Data Leakage Assessment #1866

Closed
dt-ahmed-touila opened this issue Mar 26, 2024 · 1 comment
Closed

Add Data Leakage Assessment #1866

dt-ahmed-touila opened this issue Mar 26, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@dt-ahmed-touila
Copy link

dt-ahmed-touila commented Mar 26, 2024

🚀 Feature Request

Measure data leakage based on a subset of samples/conversations from the training/fine-tuning dataset

🔈 Motivation

To efficiently apply generative models to specific domain/application (legal field is a good example) companies turn to fine-tuning / alignment on proprietary/confidential dataset. Once the fine-tuning is done and given the nature of the models, there is a significant risk of training data leakage on inference. This leakage is due to two capacities in LLMs; Memorization and Association.

🛰 Alternatives

  • Adhoc implementations 🐰
@dt-ahmed-touila dt-ahmed-touila added the enhancement New feature or request label Mar 26, 2024
@luca-martial
Copy link
Contributor

Hi @dt-ahmed-touila, thanks for sharing this request! Since we're blackbox application-focused we don't currently develop tooling for training/fine-tuning related evaluation. We'll re-open this if we get to that domain one day.

@luca-martial luca-martial closed this as not planned Won't fix, can't repro, duplicate, stale Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

2 participants