The resources related to the safety, security, and privacy (SSP) of large models (LM). Here LM contains large language models (LLMs), large vision-language models (LVMs), diffusion models, and so on.
-
This repo is in progress 🔥 (currently manually collected)
-
Welcome to recommend resources to us (via Issue/Pull request/Email/...)!
- [2024/01] MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance
- [2023/12] Adversarial Attacks on GPT-4 via Simple Random Search
- [2023/11] FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts
- [2023/10] Adversarial Attacks on LLMs
- [2023/10] AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
- [2023/10] AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
- [2023/10] Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
- [2023/10] GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
- [2023/10] Jailbreaking Black Box Large Language Models in Twenty Queries
- [2023/09] Open Sesame! Universal Black Box Jailbreaking of Large Language Models
- [2023/07] MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots
- [2023/07] Jailbroken: How Does LLM Safety Training Fail?
- [2023/07] Universal and Transferable Adversarial Attacks on Aligned Language Models
- [2024/01] A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
- [2023/12] Exploiting Novel GPT-4 APIs
- [2023/10] Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
- [2023/10] Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
- [2023/10] UltraFeedback: Boosting Language Models with High-quality Feedback
- [2023/08] You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content
- [2023/05] Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models
- [2023/03] MGTBench: Benchmarking Machine-Generated Text Detection
- [2022/10] DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models
- [2024/01] INSTRUCTTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
- [2023/08] Robustness Over Time: Understanding Adversarial Examples' Effectiveness on Longitudinal Versions of Large Language Models
- [2023/06] PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
- [2022/05] Diffusion Models for Adversarial Purification
- [2023/02] Large Language Models for Code: Security Hardening and Adversarial Testing
- [2022/11] Do Users Write More Insecure Code with AI Assistants?
- [2023/05] Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models
- [2022/11] LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors
- [2023/11] Scalable Extraction of Training Data from (Production) Language Models
- [2023/01] Extracting Training Data from Diffusion Models
- [2020/12] Extracting Training Data from Large Language Models
- Coming soon!