SmoothLLM

This is the official source code for "SmoothLLM: Defending LLMs Against Jailbreaking Attacks" by Alex Robey, Eric Wong, Hamed Hassani, and George J. Pappas. To learn more about our work, see our blog post.

Installation

Step 1: Create an empty virtual environment.

conda create -n smooth-llm python=3.10
conda activate smooth-llm

Step 2: Install the source code for "Universal and Transferable Adversarial Attacks on Aligned Language Models."

git clone https://github.com/llm-attacks/llm-attacks.git
cd llm-attacks
pip install -e .

Step 3: Download the weights for Vicuna and/or Llama2 from HuggingFace.

Step 4: Change the paths to the model and tokenizer in lib/model_configs.py depending on which set(s) of weights you downloaded in Step 3.

MODELS = {
    'llama2': {
        'model_path': '/shared_data0/arobey1/llama-2-7b-chat-hf',
        'tokenizer_path': '/shared_data0/arobey1/llama-2-7b-chat-hf',
        'conversation_template': 'llama-2'
    },
    'vicuna': {
        'model_path': '/shared_data0/arobey1/vicuna-13b-v1.5',
        'tokenizer_path': '/shared_data0/arobey1/vicuna-13b-v1.5',
        'conversation_template': 'vicuna'
    }
}

The conversation_template value is used to initialize a fastchat conversation template.

Experiments

We provide ten adversarial suffix generated by running GCG for Vicuna and Llama2 in the data/ directory. You can run SmoothLLM by running:

python main.py \
    --results_dir ./results \
    --target_model vicuna \
    --attack GCG \
    --attack_logfile data/GCG/vicuna_behaviors.json \
    --smoothllm_pert_type RandomSwapPerturbation \
    --smoothllm_pert_pct 10 \
    --smoothllm_num_copies 10

You can also change SmoothLLM's hyperparameters---the number of copies, the perturbation percentage, and the perturbation function---by changing the named arguments. At present, we support three kinds of perturbations: swaps, patches, and insertions. For more details, see Algorithm 2 in our paper. To use these functions, you can replace the --perturbation_type value with RandomSwapPerturbation, RandomPatchPerturbation, or RandomInsertPerturbation.

Reproducibility

The following codebases have reimplemented our results:

https://gist.github.com/deadbits/4ab3f807441d72a2cf3105d0aea9de48

Citation

If you find this codebase useful in your research, please consider citing:

@article{robey2023smoothllm,
  title={SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks},
  author={Robey, Alexander and Wong, Eric and Hassani, Hamed and Pappas, George J},
  journal={arXiv preprint arXiv:2310.03684},
  year={2023}
}

License

smooth-llm is licensed under the terms of the MIT license. See LICENSE for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
assets		assets
data/GCG		data/GCG
lib		lib
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
smooth_llm.sh		smooth_llm.sh
sweep.sh		sweep.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SmoothLLM

Installation

Experiments

Reproducibility

Citation

License

About

Releases

Packages

Languages

License

arobey1/smooth-llm

Folders and files

Latest commit

History

Repository files navigation

SmoothLLM

Installation

Experiments

Reproducibility

Citation

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages