GitHub - snoop2head/Language_Model_Memorization: 🚨 Implementation of the paper "Extracting Training Data from Large Language Models"(Carlini et al, 2020)

Implementation of the paper "Extracting Training Data from Large Language Models"(Carlini et al, 2020)

(Optional) Change model type and hyperparameters at config.yaml
Text sampling from the victim language model
- Run python inference.py for single-gpu generation from the victim language model.
- Run python parallel_inference.py for faster generation from the victim language model.
Run python rerank.py to retrieve possibly memorized text sequence candidates

Prevents oversampling during the prefix selection
Speeds up the inference with parallel Multi-GPU usage (only for gpt2-large)
Clears up GPU VRAM memory usage after the corresponding task
Rules out 'low-quality repeated generations' with repetition penalty and with ngram restriction
Supports T5 Encoder-Decoder as the victim model
Speeds up the reranking with parallel Multi-GPU usage

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.MD		README.MD
config.yaml		config.yaml
dataset.py		dataset.py
inference.py		inference.py
metric.py		metric.py
models.py		models.py
parallel_inference.py		parallel_inference.py
requirements.txt		requirements.txt
rerank.py		rerank.py
utils.py		utils.py