Introduction

The resources related to the trustworthiness of large models (LMs) across multiple dimensions (e.g., safety, security, and privacy), with a special focus on multi-modal LMs (e.g., vision-language models and diffusion models).

This repo is in progress 🌱 (currently manually collected).
Badges:
- Model:
- Comment:
- Venue (Continuous update): or
🌻 Welcome to recommend resources to us via Issues with the following format (please fill in this table):

Title	Link	Code	Venue	Classification	Model	Comment
aa	arxiv	github	bb'23	A1. Jailbreak	LLM	Agent

News

[2023.01.20] 🔥 We collect 3 related papers from NDSS'24!
[2023.01.17] 🔥 We collect 108 related papers from ICLR'24!
[2023.01.09] 🔥 LM-SSP is released!

Book

[2024/01] NIST Trustworthy and Responsible AI Reports

Competition

[2024/03] Large Language Model Capture-the-Flag (LLM CTF) Competition @ SaTML 2024
[2024/02] LLM - Detect AI Generated Text
[2024/02] Find the Trojan: Universal Backdoor Detection in Aligned Large Language Models @ SaTML 2024
[2023/01] Training Data Extraction Challenge @ SaTML 2023
[2022/12] Machine Learning Model Attribution Challenge @ SaTML 2023

Leaderboard

[2024/01] LLM Safety Leaderboard
[2024/01] Hallucinations Leaderboard

Toolkit

[2024/02] EasyJailbreak
[2023/05] Ragas
[2023/03] AutoGen

Survey

[2024/02] Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
[2024/02] A Survey of Text Watermarking in the Era of Large Language Models
[2024/02] Safety of Multimodal Large Language Models on Images and Text
[2024/02] A Survey on Hallucination in Large Vision-Language Models
[2024/01] Security and Privacy Challenges of Large Language Models: A Survey
[2024/01] Black-Box Access Is Insufficient for Rigorous AI Audits
[2024/01] Red Teaming Visual Language Models
[2024/01] Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems
[2024/01] TrustLLM: Trustworthiness in Large Language Models
[2023/12] Privacy Issues in Large Language Models: A Survey
[2023/12] A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly
[2023/10] Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
[2023/09] AgentBench: Evaluating LLMs as Agents
[2023/08] Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment
[2023/07] A Comprehensive Overview of Large Language Models
[2023/06] DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
[2023/05] ChatGPT Needs SPADE (Sustainability, PrivAcy, Digital Divide, and Ethics) Evaluation: A Review
[2023/04] Safety Assessment of Chinese Large Language Models
[2023/03] A Survey of Large Language Models
[2022/11] Holistic Evaluation of Language Models
[2022/08] Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
[2022/06] Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models
[2021/11] Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models

Paper

A. Safety

B. Security

B1. Adversarial Examples

[2024/02] Fast Adversarial Attacks on Language Models in One GPU Minute
[2024/02] Stealthy Attack on Large Language Model Based Recommendation
[2024/02] BSPA: Exploring Black-Box Stealthy Prompt Attacks Against Image Generators
[2024/02] Stop Reasoning! When Multimodal LLMs With Chain-of-Thought Reasoning Meets Adversarial Images
[2024/02] The Wolf Within: Covert Injection of Malice Into MLLM Societies via an MLLM Operative
[2024/02] Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-Shot LLM Assessment
[2024/02] Groot: Adversarial Testing for Generative Text-to-Image Models With Tree-Based Semantic Transformation
[2024/02] Exploring the Adversarial Capabilities of Large Language Models
[2024/02] Prompt Perturbation in Retrieval-Augmented Generation Based Large Language Models
[2024/02] Adversarial Text Purification: A Large Language Model Approach for Defense
[2024/02] Cheating Suffix: Targeted Attack to Text-to-Image Diffusion Models With Multi-Modal Priors
[2024/01] Revealing Vulnerabilities in Stable Diffusion via Targeted Attacks
[2024/01] Exploring Adversarial Attacks Against Latent Diffusion Model From the Perspective of Adversarial Transferability
[2024/01] Adversarial Examples Are Misaligned in Diffusion Model Manifolds
[2024/01] INSTRUCTTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
[2023/12] On the Robustness of Large Multimodal Models Against Image Adversarial Attacks
[2023/12] Causality Analysis for Evaluating the Security of Large Language Models
[2023/12] Hijacking Context in Large Multi-Modal Models
[2023/11] Improving the Robustness of Transformer-Based Large Language Models With Dynamic Attention
[2023/11] Unveiling Safety Vulnerabilities of Large Language Models
[2023/11] Can Protective Perturbation Safeguard Personal Data From Being Exploited by Stable Diffusion?
[2023/11] DiffAttack: Evasion Attacks Against Diffusion-Based Adversarial Purification
[2023/11] How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
[2023/10] Misusing Tools in Large Language Models With Visual Adversarial Examples
[2023/09] Inducing High Energy-Latency of Large Vision-Language Models With Verbose Images
[2023/09] An Image Is Worth 1000 Lies: Transferability of Adversarial Images Across Prompts on Vision-Language Models
[2023/09] An LLM Can Fool Itself: A Prompt-Based Adversarial Attack
[2023/09] Language Model Detectors Are Easily Optimized Against
[2023/09] Leveraging Optimization for Adaptive Attacks on Image Watermarks
[2023/09] Training Socially Aligned Language Models on Simulated Social Interactions
[2023/09] How Robust Is Google's Bard to Adversarial Image Attacks?
[2023/09] Image Hijacks: Adversarial Images Can Control Generative Models at Runtime
[2023/08] Ceci N'est Pas Une Pomme: Adversarial Illusions in Multi-Modal Embeddings
[2023/08] On the Adversarial Robustness of Multi-Modal Foundation Models
[2023/08] Robustness Over Time: Understanding Adversarial Examples' Effectiveness on Longitudinal Versions of Large Language Models
[2023/07] Certified Robustness for Large Language Models With Self-Denoising
[2023/06] Adversarial Examples in the Age of ChatGPT
[2023/06] Are Aligned Neural Networks Adversarially Aligned?
[2023/06] PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
[2023/06] Stable Diffusion Is Unstable
[2023/06] Unlearnable Examples for Diffusion Models: Protect Data From Unauthorized Exploitation
[2023/06] Visual Adversarial Examples Jailbreak Large Language Models
[2023/05] Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility
[2023/05] On Evaluating Adversarial Robustness of Large Vision-Language Models
[2023/03] Anti-DreamBooth: Protecting Users From Personalized Text-to-Image Synthesis
[2023/02] Large Language Models Can Be Easily Distracted by Irrelevant Context
[2023/02] On the Robustness of ChatGPT: An Adversarial and Out-of-Distribution Perspective
[2023/02] Adversarial Example Does Good: Preventing Painting Imitation From Diffusion Models via Adversarial Examples
[2023/02] Raising the Cost of Malicious AI-Powered Image Editing
[2023/01] On Robustness of Prompt-Based Semantic Parsing With Large Pre-Trained Language Model: An Empirical Study on Codex
[2022/12] Understanding Zero-Shot Adversarial Robustness for Large-Scale Model

B2. Poisoning

[2024/02] On Trojan Signatures in Large Language Models of Code
[2024/02] WIPI: A New Web Threat for LLM-Driven Web Agents
[2024/02] VL-Trojan: Multimodal Instruction Backdoor Attacks Against Autoregressive Visual Language Models
[2024/02] Universal Vulnerabilities in Large Language Models: Backdoor Attacks for in-Context Learning
[2024/02] Learning to Poison Large Language Models During Instruction Tuning
[2024/02] Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning
[2024/02] Acquiring Clean Language Models From Backdoor Poisoned Datasets by Downscaling Frequency Space
[2024/02] Rapid Adoption, Hidden Risks: The Dual Impact of Large Language Model Customization
[2024/02] Secret Collusion Among Generative AI Agents
[2024/02] Test-Time Backdoor Attacks on Multimodal Large Language Models
[2024/02] PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models
[2024/02] Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models
[2024/01] Universal Vulnerabilities in Large Language Models: In-Context Learning Backdoor Attacks
[2024/01] Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training
[2023/12] Poisoned ChatGPT Finds Work for Idle Hands: Exploring Developers' Coding Practices With Insecure Suggestions From Poisoned AI Models
[2023/12] Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
[2023/12] Unleashing Cheapfakes Through Trojan Plugins of Large Language Models
[2023/11] Test-Time Backdoor Mitigation for Black-Box Large Language Models With Defensive Demonstrations
[2023/10] Leveraging Diffusion-Based Image Variations for Robust Training on Poisoned Data
[2023/10] Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Against Text Classifiers
[2023/10] PoisonPrompt: Backdoor Attack on Prompt-Based Large Language Models
[2023/10] Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models
[2023/10] Composite Backdoor Attacks Against Large Language Models
[2023/09] BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models
[2023/09] BadEdit: Backdooring Large Language Models by Model Editing
[2023/09] Universal Jailbreak Backdoors From Poisoned Human Feedback
[2023/08] LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors
[2023/08] The Poison of Alignment
[2023/07] Backdooring Instruction-Tuned Large Language Models With Virtual Prompt Injection
[2023/06] On the Exploitability of Instruction Tuning
[2023/05] Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models
[2023/05] Poisoning Language Models During Instruction Tuning
[2022/11] Rickrolling the Artist: Injecting Backdoors Into Text Encoders for Text-to-Image Synthesis

C. Privacy

C1. Contamination

[2024/02] Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs
[2023/09] Proving Test Set Contamination for Black-Box Language Models
[2023/09] Time Travel in LLMs: Tracing Data Contamination in Large Language Models
[2023/09] To the Cutoff... And Beyond? A Longitudinal Perspective on LLM Data Contamination
[2023/09] DyVal: Graph-Informed Dynamic Evaluation of Large Language Models

C2. Copyright

[2024/02] Generative Models Are Self-Watermarked: Declaring Model Authentication Through Re-Generation
[2024/02] Attacking LLM Watermarks by Exploiting Their Strengths
[2024/02] Double-I Watermark: Protecting Model Copyright for LLM Fine-Tuning
[2024/02] Watermarking Makes Language Models Radioactive
[2024/02] A First Look at GPT Apps: Landscape and Vulnerability
[2024/02] Can Watermarks Survive Translation? On the Cross-Lingual Consistency of Text Watermark for Large Language Models
[2024/02] Proving Membership in LLM Pretraining Data via Data Watermarks
[2024/02] Resilient Watermarking for LLM-Generated Codes
[2024/02] Permute-and-Flip: An Optimally Robust and Watermarkable Decoder for LLMs
[2024/01] Adaptive Text Watermark for Large Language Models
[2024/01] Instructional Fingerprinting of Large Language Models
[2024/01] Generative AI Has a Visual Plagiarism Problem
[2023/12] Human-Readable Fingerprint for Large Language Models
[2023/12] Mark My Words: Analyzing and Evaluating Language Model Watermarks
[2023/11] A Robust Semantics-Based Watermark for Large Language Model Against Paraphrasing
[2023/11] Protecting Intellectual Property of Large Language Model-Based Code Generation APIs via Watermarks
[2023/09] A Private Watermark for Large Language Models
[2023/09] A Semantic Invariant Robust Watermark for Large Language Models
[2023/09] Provable Robust Watermarking for AI-Generated Text
[2023/09] SILO Language Models: Isolating Legal Risk in a Nonparametric Datastore
[2023/08] PromptCARE: Prompt Copyright Protection by Watermark Injection and Verification
[2023/06] Generative Watermarking Against Unauthorized Subject-Driven Image Synthesis
[2023/05] Tree-Ring Watermarks: Fingerprints for Diffusion Images That Are Invisible and Robust
[2023/05] Watermarking Diffusion Model
[2023/03] A Recipe for Watermarking Diffusion Models
[2023/02] Glaze: Protecting Artists From Style Mimicry by Text-to-Image Models
[2023/01] A Watermark for Large Language Models

C3. Data Reconstruction

[2024/02] Conversation Reconstruction Attack Against GPT Models
[2024/01] Text Embedding Inversion Attacks on Multilingual Language Models
[2023/11] Scalable Extraction of Training Data From (Production) Language Models
[2023/09] Intriguing Properties of Data Attribution on Diffusion Models
[2023/09] Teach LLMs to Phish: Stealing Private Information From Language Models
[2023/09] Language Model Inversion
[2023/07] Prompts Should Not Be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success
[2023/02] Prompt Stealing Attacks Against Text-to-Image Generation Models
[2023/01] Extracting Training Data From Diffusion Models
[2020/12] Extracting Training Data From Large Language Models

C4. Extraction

[2024/02] Prompt Stealing Attacks Against Large Language Models
[2024/02] Recovering the Pre-Fine-Tuning Weights of Generative Models
[2023/03] On Extracting Specialized Code Abilities From Large Language Models: A Feasibility Study
[2023/03] Stealing the Decoding Algorithms of Language Models

C5. Inference

[2024/02] The Good and the Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG)
[2024/02] Pandora's White-Box: Increased Training Data Leakage in Open LLMs
[2024/02] Large Language Models Are Advanced Anonymizers
[2024/02] Privacy-Preserving Language Model Inference With Instance Obfuscation
[2024/02] Do Membership Inference Attacks Work on Large Language Models?
[2024/02] PromptCrypt: Prompt Encryption for Secure Communication With Large Language Models
[2024/01] Excuse Me, Sir? Your Language Model Is Leaking (Information)
[2024/01] R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
[2023/12] Black-Box Membership Inference Attacks Against Fine-Tuned Diffusion Models
[2023/11] Practical Membership Inference Attacks Against Fine-Tuned Large Language Models via Self-Prompt Calibration
[2023/10] User Inference Attacks on Large Language Models
[2023/10] Last One Standing: A Comparative Analysis of Security and Privacy of Soft Prompt Tuning, LoRA, and in-Context Learning
[2023/09] An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization
[2023/09] Beyond Memorization: Violating Privacy via Inference With Large Language Models
[2023/09] Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory
[2023/09] Identifying the Risks of LM Agents With an LM-Emulated Sandbox
[2023/09] Privacy Side Channels in Machine Learning Systems
[2023/08] White-Box Membership Inference Attacks Against Diffusion Models
[2023/07] ProPILE: Probing Privacy Leakage in Large Language Models
[2023/03] Class Attribute Inference Attacks: Inferring Sensitive Class Information by Diffusion-Based Attribute Manipulations
[2022/10] Membership Inference Attacks Against Text-to-Image Generation Models

C6. Privacy-Preserving Computation

[2024/02] LLM-based Privacy Data Augmentation Guided by Knowledge Distillation With a Distribution Tutor for Medical Text Classification
[2023/10] Locally Differentially Private Document Generation Using Zero Shot Prompting
[2023/09] Differentially Private Synthetic Data via Foundation Model APIs 1: Images
[2023/09] DP-OPT: Make Large Language Model Your Differentially-Private Prompt Engineer
[2023/09] Enhancing Small Medical Learners With Privacy-Preserving Contextual Prompting
[2023/09] Improving LoRA in Privacy-Preserving Federated Learning
[2023/09] Privacy-Preserving in-Context Learning for Large Language Models
[2023/09] Privacy-Preserving in-Context Learning With Differentially Private Few-Shot Generation
[2023/09] Privately Aligning Language Models With Reinforcement Learning
[2023/09] DP-Forward: Fine-Tuning and Inference on Language Models With Differential Privacy in Forward Pass
[2023/08] SIGMA: Secure GPT Inference With Function Secret Sharing
[2023/07] CipherGPT: Secure Two-Party GPT Inference
[2023/05] Privacy-Preserving Prompt Tuning for Large Language Model Services
[2023/05] Privacy-Preserving Recommender Systems With Synthetic Query Generation Using Differentially Private Large Language Models
[2022/10] EW-Tune: A Framework for Privately Fine-Tuning Large Language Models With Differential Privacy

C7. Unlearning

[2024/02] Machine Unlearning of Pre-Trained Large Language Models
[2024/02] Rethinking Machine Unlearning for Large Language Models
[2024/02] Selective Forgetting: Advancing Machine Unlearning Techniques and Evaluation in Language Models
[2024/02] In-Context Learning Can Re-Learn Forbidden Tasks
[2024/02] Machine Unlearning for Image-to-Image Generative Models
[2024/01] TOFU: A Task of Fictitious Unlearning for LLMs
[2023/10] In-Context Unlearning: Language Models as Few Shot Unlearners
[2023/10] Large Language Model Unlearning
[2023/10] Unlearn What You Want to Forget: Efficient Unlearning for LLMs
[2023/10] Who's Harry Potter? Approximate Unlearning in LLMs
[2023/09] Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks
[2023/09] Detecting Pretraining Data From Large Language Models
[2023/09] Ring-a-Bell! How Reliable Are Concept Removal Methods for Diffusion Models?
[2023/09] SalUn: Empowering Machine Unlearning via Gradient-Based Weight Saliency in Both Image Classification and Generation
[2023/07] Right to Be Forgotten in the Era of Large Language Models: Implications, Challenges, and Solutions
[2023/03] Erasing Concepts From Diffusion Models

Star History

Acknowledgement

Organizers: Tianshuo Cong, Xinlei He, Zhengyu Zhao, Yugeng Liu
This project is inspired by LLM Security, Awesome LLM Security, LLM Security & Privacy, UR2-LLMs, PLMpapers, EvaluationPapers4ChatGPT

Name		Name	Last commit message	Last commit date
Latest commit History 200 Commits
figure		figure
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

News

Book

Competition

Leaderboard

Toolkit

Survey

Paper

A. Safety

A1. Jailbreak

A2. Alignment

A3. Deepfake

A4. Ethics

A5. Fairness

A6. Hallucination

A7. Toxicity

B. Security

B1. Adversarial Examples

B2. Poisoning

C. Privacy

C1. Contamination

C2. Copyright

C3. Data Reconstruction

C4. Extraction

C5. Inference

C6. Privacy-Preserving Computation

C7. Unlearning

Star History

Acknowledgement

About

Releases

Packages

License

daoyuan14/lm-ssp

Folders and files

Latest commit

History

Repository files navigation

Introduction

News

Book

Competition

Leaderboard

Toolkit

Survey

Paper

A. Safety

A1. Jailbreak

A2. Alignment

A3. Deepfake

A4. Ethics

A5. Fairness

A6. Hallucination

A7. Toxicity

B. Security

B1. Adversarial Examples

B2. Poisoning

C. Privacy

C1. Contamination

C2. Copyright

C3. Data Reconstruction

C4. Extraction

C5. Inference

C6. Privacy-Preserving Computation

C7. Unlearning

Star History

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages