A1. Jailbreak

[2024/03] Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes
[2024/02] Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction
[2024/02] DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
[2024/02] GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models
[2024/02] CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models
[2024/02] PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
[2024/02] Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
[2024/02] LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper
[2024/02] From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings
[2024/02] Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs
[2024/02] Is the System Message Really Important to Jailbreaks in Large Language Models?
[2024/02] Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
[2024/02] How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries
[2024/02] Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
[2024/02] LLM Jailbreak Attack versus Defense Techniques -- A Comprehensive Study
[2024/02] Coercing LLMs to do and reveal (almost) anything
[2024/02] GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis
[2024/02] Query-Based Adversarial Prompt Generation
[2024/02] ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
[2024/02] SPML: A DSL for Defending Language Models Against Prompt Attacks
[2024/02] A StrongREJECT for Empty Jailbreaks
[2024/02] Jailbreaking Proprietary Large Language Models using Word Substitution Cipher
[2024/02] ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages
[2024/02] PAL: Proxy-Guided Black-Box Attack on Large Language Models
[2024/02] Attacking Large Language Models with Projected Gradient Descent
[2024/02] SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
[2024/02] Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues
[2024/02] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
[2024/02] Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
[2024/02] Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning
[2024/02] Comprehensive Assessment of Jailbreak Attacks Against LLMs
[2024/02] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
[2024/02] HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
[2024/02] Jailbreaking Attack against Multimodal Large Language Model
[2024/02] Prompt-Driven LLM Safeguarding via Directed Representation Optimization
[2024/01] A Cross-Language Investigation into Jailbreak Attacks in Large Language Models
[2024/01] Weak-to-Strong Jailbreaking on Large Language Models
[2024/01] Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
[2024/01] Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts
[2024/01] PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety
[2024/01] Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models
[2024/01] Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
[2024/01] All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks
[2024/01] AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models
[2024/01] Intention Analysis Prompting Makes Large Language Models A Good Jailbreak Defender
[2024/01] How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
[2023/12] A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models
[2023/12] Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
[2023/12] Goal-Oriented Prompt Attack and Safety Evaluation for LLMs
[2023/12] Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
[2023/12] Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack
[2023/12] A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection
[2023/12] Adversarial Attacks on GPT-4 via Simple Random Search
[2023/12] Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
[2023/11] Query-Relevant Images Jailbreak Large Multi-Modal Models
[2023/11] A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts Can Fool Large Language Models Easily
[2023/11] Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles
[2023/11] Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks
[2023/11] MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
[2023/11] Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
[2023/11] SneakyPrompt: Jailbreaking Text-to-image Generative Models
[2023/11] DeepInception: Hypnotize Large Language Model to Be Jailbreaker
[2023/11] Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition
[2023/11] Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild
[2023/11] Evil Geniuses: Delving into the Safety of LLM-based Agents
[2023/11] FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts
[2023/10] Attack Prompt Generation for Red Teaming and Defending Large Language Models
[2023/10] Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attack
[2023/10] Low-Resource Languages Jailbreak GPT-4
[2023/10] SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models in Chinese
[2023/10] SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
[2023/10] Adversarial Attacks on LLMs
[2023/10] AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
[2023/10] Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
[2023/10] Jailbreaking Black Box Large Language Models in Twenty Queries
[2023/09] Baseline Defenses for Adversarial Attacks Against Aligned Language Models
[2023/09] Certifying LLM Safety against Adversarial Prompting
[2023/09] SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution
[2023/09] Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
[2023/09] AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
[2023/09] GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
[2023/09] Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
[2023/09] Multilingual Jailbreak Challenges in Large Language Models
[2023/09] On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs
[2023/09] RAIN: Your Language Models Can Align Themselves without Finetuning
[2023/09] Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
[2023/09] Understanding Hidden Context in Preference Learning: Consequences for RLHF
[2023/09] Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
[2023/09] FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models
[2023/09] GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
[2023/09] Open Sesame! Universal Black Box Jailbreaking of Large Language Models
[2023/08] Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
[2023/08] XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
[2023/08] “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
[2023/08] Detecting Language Model Attacks with Perplexity
[2023/07] From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy
[2023/07] LLM Censorship: A Machine Learning Challenge Or A Computer Security Problem?
[2023/07] Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models
[2023/07] Jailbroken: How Does LLM Safety Training Fail?
[2023/07] MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots
[2023/07] Universal and Transferable Adversarial Attacks on Aligned Language Models
[2023/06] Visual Adversarial Examples Jailbreak Aligned Large Language Models
[2023/05] Adversarial demonstration attacks on large language models.
[2023/05] Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
[2023/05] Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks
[2023/04] Multi-step Jailbreaking Privacy Attacks on ChatGPT
[2023/03] Automatically Auditing Large Language Models via Discrete Optimization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jailbreak.md

jailbreak.md

A1. Jailbreak

Files

jailbreak.md

Latest commit

History

jailbreak.md

File metadata and controls

A1. Jailbreak