Skip to content

The resources related to the safety, security, and privacy of large models.

Notifications You must be signed in to change notification settings

lishaofeng/lm-ssp

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 

Repository files navigation

LM-SSP

The resources related to the safety, security, and privacy (SSP) of large models (LM). Here LM contains large language models (LLMs), large vision-language models (LVMs), diffusion models, and so on.

  • This repo is in progress 🔥 (currently manually collected)

  • Welcome to recommend resources to us (via Issue/Pull request/Email/...)!

  • Tags: img img img img img img

Books

  • [2024/01] NIST Trustworthy and Responsible AI Reports img

Papers

A. Safety

A1. Jailbreak

  • [2024/01] MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance img img img img
  • [2023/12] Adversarial Attacks on GPT-4 via Simple Random Search img img
  • [2023/11] FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts img img img
  • [2023/10] Adversarial Attacks on LLMs img img
  • [2023/10] AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models img img img
  • [2023/10] AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models img img
  • [2023/10] Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation img img img
  • [2023/10] GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts imgimg img
  • [2023/10] Jailbreaking Black Box Large Language Models in Twenty Queries img img img
  • [2023/09] Open Sesame! Universal Black Box Jailbreaking of Large Language Models img img
  • [2023/07] MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots img img img
  • [2023/07] Jailbroken: How Does LLM Safety Training Fail? img img
  • [2023/07] Universal and Transferable Adversarial Attacks on Aligned Language Models img img img

A2. Safety Alignment

  • [2024/01] A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity img
  • [2023/12] Exploiting Novel GPT-4 APIs img img
  • [2023/10] Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! img img img
  • [2023/10] Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models img img img
  • [2023/10] UltraFeedback: Boosting Language Models with High-quality Feedback img img img

A3. Toxicity

  • [2023/08] You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content img img img
  • [2023/05] Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models img img img

A4. Deepfake

  • [2023/03] MGTBench: Benchmarking Machine-Generated Text Detection img img img
  • [2022/10] DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models img img img

A5. Agent

  • [2023/11] Evil Geniuses: Delving into the Safety of LLM-based Agents img img

B. Security

B1. Adversarial Attacks

  • [2024/01] INSTRUCTTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models img img
  • [2023/08] Robustness Over Time: Understanding Adversarial Examples' Effectiveness on Longitudinal Versions of Large Language Models img img
  • [2023/06] PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts img img
  • [2022/05] Diffusion Models for Adversarial Purification img img img img

B2. Code Generation

  • [2023/02] Large Language Models for Code: Security Hardening and Adversarial Testing img img img
  • [2022/11] Do Users Write More Insecure Code with AI Assistants? img img img

B3. Backdoor/Poisoning

  • [2023/05] Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models img img
  • [2022/11] LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors img img img img

C. Privacy

C1. Data Reconstruction

  • [2023/11] Scalable Extraction of Training Data from (Production) Language Models img img
  • [2023/01] Extracting Training Data from Diffusion Models img img img
  • [2020/12] Extracting Training Data from Large Language Models img img img

C2. Membership Inference

  • Coming soon!

C3. Property Inference

  • [2023/10] Beyond Memorization: Violating Privacy Via Inference with Large Language Models img img

C4. Model Extraction

  • [2023/03] Stealing the Decoding Algorithms of Language Models img img img

C5. Unlearning

  • [2023/10] Unlearn What You Want to Forget: Efficient Unlearning for LLMs img img img
  • [2023/10] Who's Harry Potter? Approximate Unlearning in LLMs img img img
  • [2023/03] Erasing Concepts from Diffusion Models img img img

C6. Copyright

  • [2024/01] Generative AI Has a Visual Plagiarism Problem img
  • [2023/11] Protecting Intellectual Property of Large Language Model-Based Code Generation APIs via Watermarks img img img img

About

The resources related to the safety, security, and privacy of large models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published