- [2024/03] Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes
- [2024/02] Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction
- [2024/02] DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
- [2024/02] GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models
- [2024/02] CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models
- [2024/02] PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
- [2024/02] Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
- [2024/02] LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper
- [2024/02] From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings
- [2024/02] Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs
- [2024/02] Is the System Message Really Important to Jailbreaks in Large Language Models?
- [2024/02] Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
- [2024/02] How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries
- [2024/02] Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
- [2024/02] LLM Jailbreak Attack versus Defense Techniques -- A Comprehensive Study
- [2024/02] Coercing LLMs to do and reveal (almost) anything
- [2024/02] GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis
- [2024/02] Query-Based Adversarial Prompt Generation
- [2024/02] ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
- [2024/02] SPML: A DSL for Defending Language Models Against Prompt Attacks
- [2024/02] A StrongREJECT for Empty Jailbreaks
- [2024/02] Jailbreaking Proprietary Large Language Models using Word Substitution Cipher
- [2024/02] ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages
- [2024/02] PAL: Proxy-Guided Black-Box Attack on Large Language Models
- [2024/02] Attacking Large Language Models with Projected Gradient Descent
- [2024/02] SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
- [2024/02] Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues
- [2024/02] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
- [2024/02] Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
- [2024/02] Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning
- [2024/02] Comprehensive Assessment of Jailbreak Attacks Against LLMs
- [2024/02] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
- [2024/02] HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
- [2024/02] Jailbreaking Attack against Multimodal Large Language Model
- [2024/02] Prompt-Driven LLM Safeguarding via Directed Representation Optimization
- [2024/01] A Cross-Language Investigation into Jailbreak Attacks in Large Language Models
- [2024/01] Weak-to-Strong Jailbreaking on Large Language Models
- [2024/01] Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
- [2024/01] Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts
- [2024/01] PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety
- [2024/01] Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models
- [2024/01] Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
- [2024/01] All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks
- [2024/01] AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models
- [2024/01] Intention Analysis Prompting Makes Large Language Models A Good Jailbreak Defender
- [2024/01] How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
- [2023/12] A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models
- [2023/12] Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
- [2023/12] Goal-Oriented Prompt Attack and Safety Evaluation for LLMs
- [2023/12] Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
- [2023/12] Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack
- [2023/12] A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection
- [2023/12] Adversarial Attacks on GPT-4 via Simple Random Search
- [2023/12] Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
- [2023/11] Query-Relevant Images Jailbreak Large Multi-Modal Models
- [2023/11] A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts Can Fool Large Language Models Easily
- [2023/11] Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles
- [2023/11] Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks
- [2023/11] MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
- [2023/11] Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
- [2023/11] SneakyPrompt: Jailbreaking Text-to-image Generative Models
- [2023/11] DeepInception: Hypnotize Large Language Model to Be Jailbreaker
- [2023/11] Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition
- [2023/11] Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild
- [2023/11] Evil Geniuses: Delving into the Safety of LLM-based Agents
- [2023/11] FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts
- [2023/10] Attack Prompt Generation for Red Teaming and Defending Large Language Models
- [2023/10] Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attack
- [2023/10] Low-Resource Languages Jailbreak GPT-4
- [2023/10] SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models in Chinese
- [2023/10] SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
- [2023/10] Adversarial Attacks on LLMs
- [2023/10] AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
- [2023/10] Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
- [2023/10] Jailbreaking Black Box Large Language Models in Twenty Queries
- [2023/09] Baseline Defenses for Adversarial Attacks Against Aligned Language Models
- [2023/09] Certifying LLM Safety against Adversarial Prompting
- [2023/09] SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution
- [2023/09] Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
- [2023/09] AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
- [2023/09] GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
- [2023/09] Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
- [2023/09] Multilingual Jailbreak Challenges in Large Language Models
- [2023/09] On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs
- [2023/09] RAIN: Your Language Models Can Align Themselves without Finetuning
- [2023/09] Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
- [2023/09] Understanding Hidden Context in Preference Learning: Consequences for RLHF
- [2023/09] Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
- [2023/09] FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models
- [2023/09] GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
- [2023/09] Open Sesame! Universal Black Box Jailbreaking of Large Language Models
- [2023/08] Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
- [2023/08] XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
- [2023/08] “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
- [2023/08] Detecting Language Model Attacks with Perplexity
- [2023/07] From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy
- [2023/07] LLM Censorship: A Machine Learning Challenge Or A Computer Security Problem?
- [2023/07] Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models
- [2023/07] Jailbroken: How Does LLM Safety Training Fail?
- [2023/07] MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots
- [2023/07] Universal and Transferable Adversarial Attacks on Aligned Language Models
- [2023/06] Visual Adversarial Examples Jailbreak Aligned Large Language Models
- [2023/05] Adversarial demonstration attacks on large language models.
- [2023/05] Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
- [2023/05] Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks
- [2023/04] Multi-step Jailbreaking Privacy Attacks on ChatGPT
- [2023/03] Automatically Auditing Large Language Models via Discrete Optimization