Skip to content

Sizhe-Chen/StruQ

Repository files navigation

StruQ: Defending Against Prompt Injection with Structured Queries

Decription

Environment

Packages

  • clone this repo and cd StruQ
  • create the conda env by running conda create -n struq python==3.10. If you use another env name, specify it in the -e in run.py
  • install dependencies by running pip install -r requirements.txt

Data

Base LLMs

Undefended SFT LLMs (optional)

StruQ LLMs (optional)

Script for downloading

pip install gdown
mkdir models
python download_models.py
mv llama-7b* mistral* models/

Training

  • The run.py script automatically train multiple models and test them by generating slurm scripts, run them, and delete them.
  • nohup python -u run.py -m llama-7b -train TextTextText_None SpclSpclSpcl_NaiveCompletion -test none naive ignore completion_real > run.log 2>&1 & stands for training the first model with three text delimiters (### instruction:) and None attack (undefended model), training the second model with three special delimiters ([MARK] [INST] [COLN]) and Naive+Completion attacks (StruQ-defended model), and test the two models on naive, ignore, completion_real attacks.
  • Training data size is always 52K, including 26K data that is guaranteed to be unchanged. The data without an input in the remaining 26K samples is also unchanged. Those with an input is prompt-injected by another random sample, with injection method Naive:Completion=1:1

Testing

  • Running run.py should trigger the testing (on utility and security) at the end when the model is saved.
  • Run only testing by python test.py -m models/llama-7b_SpclSpclSpcl_NaiveCompletion_2024-02-02-00-00-00 -a none naive ignore completion_real,
  • Run GCG testing by python test_gcg.py -m models/llama-7b_SpclSpclSpcl_NaiveCompletion_2024-02-02-00-00-00 --sample_ids 0 1 2. Leaving --sample_ids to default None runs all samples.
  • All training and testing logs are saved to, e.g., logs/llama-7b_SpclSpclSpcl_NaiveCompletion_2024-02-02-00-00-00.
  • Note that the attack success rate (asr) numbers from the code are higher than the actual asr, which should be calculated after manually removing false positive samples.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages