LM-SSP

The resources related to the safety, security, and privacy (SSP) of large models (LM). Here LM contains large language models (LLMs), large vision-language models (LVMs), diffusion models, and so on.

This repo is in progress 🔥 (currently manually collected)
Welcome to recommend resources to us (via Issue/Pull request/Email/...)!
Tags:

Books

[2024/01] NIST Trustworthy and Responsible AI Reports

Papers

A. Safety

A1. Jailbreak

[2024/01] MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance
[2023/12] Adversarial Attacks on GPT-4 via Simple Random Search
[2023/11] FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts
[2023/10] Adversarial Attacks on LLMs
[2023/10] AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
[2023/10] AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
[2023/10] Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
[2023/10] GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
[2023/10] Jailbreaking Black Box Large Language Models in Twenty Queries
[2023/09] Open Sesame! Universal Black Box Jailbreaking of Large Language Models
[2023/07] MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots
[2023/07] Jailbroken: How Does LLM Safety Training Fail?
[2023/07] Universal and Transferable Adversarial Attacks on Aligned Language Models

A2. Safety Alignment

[2024/01] A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
[2023/12] Exploiting Novel GPT-4 APIs
[2023/10] Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
[2023/10] Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
[2023/10] UltraFeedback: Boosting Language Models with High-quality Feedback

A3. Toxicity

[2023/08] You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content
[2023/05] Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

A4. Deepfake

[2023/03] MGTBench: Benchmarking Machine-Generated Text Detection
[2022/10] DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models

A5. Agent

[2023/11] Evil Geniuses: Delving into the Safety of LLM-based Agents

B. Security

B1. Adversarial Attacks

[2024/01] INSTRUCTTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
[2023/08] Robustness Over Time: Understanding Adversarial Examples' Effectiveness on Longitudinal Versions of Large Language Models
[2023/06] PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
[2022/05] Diffusion Models for Adversarial Purification

B2. Code Generation

[2023/02] Large Language Models for Code: Security Hardening and Adversarial Testing
[2022/11] Do Users Write More Insecure Code with AI Assistants?

B3. Backdoor/Poisoning

[2023/05] Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models
[2022/11] LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors

C. Privacy

C1. Data Reconstruction

[2023/11] Scalable Extraction of Training Data from (Production) Language Models
[2023/01] Extracting Training Data from Diffusion Models
[2020/12] Extracting Training Data from Large Language Models

C2. Membership Inference

Coming soon!

C3. Property Inference

[2023/10] Beyond Memorization: Violating Privacy Via Inference with Large Language Models

C4. Model Extraction

[2023/03] Stealing the Decoding Algorithms of Language Models

C5. Unlearning

[2023/10] Unlearn What You Want to Forget: Efficient Unlearning for LLMs
[2023/10] Who's Harry Potter? Approximate Unlearning in LLMs
[2023/03] Erasing Concepts from Diffusion Models

C6. Copyright

[2024/01] Generative AI Has a Visual Plagiarism Problem
[2023/11] Protecting Intellectual Property of Large Language Model-Based Code Generation APIs via Watermarks

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LM-SSP

Books

Papers

A. Safety

A1. Jailbreak

A2. Safety Alignment

A3. Toxicity

A4. Deepfake

A5. Agent

B. Security

B1. Adversarial Attacks

B2. Code Generation

B3. Backdoor/Poisoning

C. Privacy

C1. Data Reconstruction

C2. Membership Inference

C3. Property Inference

C4. Model Extraction

C5. Unlearning

C6. Copyright

About

Releases

Packages

lishaofeng/lm-ssp

Folders and files

Latest commit

History

Repository files navigation

LM-SSP

Books

Papers

A. Safety

A1. Jailbreak

A2. Safety Alignment

A3. Toxicity

A4. Deepfake

A5. Agent

B. Security

B1. Adversarial Attacks

B2. Code Generation

B3. Backdoor/Poisoning

C. Privacy

C1. Data Reconstruction

C2. Membership Inference

C3. Property Inference

C4. Model Extraction

C5. Unlearning

C6. Copyright

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages