Skip to content

🚨 Implementation of the paper "Extracting Training Data from Large Language Models"(Carlini et al, 2020)

License

Notifications You must be signed in to change notification settings

snoop2head/Language_Model_Memorization

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

How to Run

  1. (Optional) Change model type and hyperparameters at config.yaml
  2. Text sampling from the victim language model
    • Run python inference.py for single-gpu generation from the victim language model.
    • Run python parallel_inference.py for faster generation from the victim language model.
  3. Run python rerank.py to retrieve possibly memorized text sequence candidates

References

Contribution

  • Prevents oversampling during the prefix selection
  • Speeds up the inference with parallel Multi-GPU usage (only for gpt2-large)
  • Clears up GPU VRAM memory usage after the corresponding task
  • Rules out 'low-quality repeated generations' with repetition penalty and with ngram restriction
  • Supports T5 Encoder-Decoder as the victim model
  • Speeds up the reranking with parallel Multi-GPU usage

About

🚨 Implementation of the paper "Extracting Training Data from Large Language Models"(Carlini et al, 2020)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages

  • Python 100.0%