security
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Language Models [NeurIPS 2024 Datasets and Benchmarks Track]
Robust recipes to align language models with human and AI preferences
Papers and resources related to the security and privacy of LLMs 🤖
Interpretability for sequence generation models 🐛 🔍
The jailbreak-evaluation is an easy-to-use Python package for language model jailbreak evaluation.
A reading list for large models safety, security, and privacy (including Awesome LLM Security, Safety, etc.).
[NAACL2024] Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
A curation of awesome tools, documents and projects about LLM Security.
ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.