- 前言
- MM-LLM
- VLM-Defense
- Removing NSFW Concepts from Vision-and-Language Models for Text-to-Image Retrieval and Generation
- Safety Alignment for Vision Language Models
- AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Sh
- Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
- MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance
- Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
- SAFEGEN: Mitigating Unsafe Content Generation in Text-to-Image Models
- Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision
- Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
- Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
- UNDERSTANDING ZERO-SHOT ADVERSARIAL ROBUSTNESS FOR LARGE-SCALE MODELS
- A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection
- UNDERSTANDING ZERO-SHOT ADVERSARIAL ROBUSTNESS FOR LARGE-SCALE MODELS
- AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Sh
- CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning
- Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content
- Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion Models
- Typographic Attacks in Large Multimodal Models Can be Alleviated by More Informative Prompts
- Onthe Robustness of Large Multimodal Models Against Image Adversarial Attacks
- Removing NSFW Concepts from Vision-and-Language Models for Text-to-Image Retrieval and Generation
- Safety Fine-Tuning at (Almost) No Cost: ABaseline for Vision Large Language Models
- Partially Recentralization Softmax Loss for Vision-Language Models Robustness
- Adversarial Prompt Tuning for Vision-Language Models
- Defense-Prefix for Preventing Typographic Attacks on CLIP
- RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedb
- AMutation-Based Method for Multi-Modal Jailbreaking Attack
- HowEasy is It to Fool Your Multimodal LLMs? AnEmpirical Analysis on Deceptive Prompts
- MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance
- EFUF: Efficient Fine-grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large
- Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
- Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Lang
- Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoisi
- Robust Contrastive Language-Image Pre-training against Data Poisoning and Backdoor Attacks
- HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
- VLM
- VLM-Attack
- Circumventing Concept Erasure Methods For Text-to-Image Generative Models
- Efficient LLM-Jailbreaking by Introducing Visual Modality
- From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
- Adversarial Attacks on Multimodal Agents
- Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Ima
- Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models
- Typography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Lar
- White-box Multimodal Jailbreaks Against Large Vision-Language Models
- Red Teaming Visual Language Models
- Private Attribute Inference from Images with Vision-Language Models
- Assessment of Multimodal Large Language Models in Alignment with Human Values
- Privacy-Aware Visual Language Models
- Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbre
- Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks
- Red Teaming Visual Language Models
- Adversarial Illusions in Multi-Modal Embeddings
- Universal Prompt Optimizer for Safe Text-to-Image Generation
- On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
- Adversarial Illusions in Multi-Modal Embeddings
- Stop Reasoning! When Multimodal LLMs with Chain-of-Thought Reasoning Meets Adversarial Images
- INSTRUCTTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
- On the Robustness of Large Multimodal Models Against Image Adversarial Attacks
- Hijacking Context in Large Multi-modal Models
- Transferable Multimodal Attack on Vision-Language Pre-training Models
- Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimoda
- AN IMAGE IS WORTH 1000 LIES: ADVERSARIAL TRANSFERABILITY ACROSS PROMPTS ON VISIONLANGUAGE MODELS
- Test-Time Backdoor Attacks on Multimodal Large Language Models
- JAILBREAK IN PIECES: COMPOSITIONAL ADVERSARIAL ATTACKS ON MULTI-MODAL LANGUAGE MODELS
- Jailbreaking Attack against Multimodal Large Language Model
- Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts
- IMAGE HIJACKS: ADVERSARIAL IMAGES CAN CONTROL GENERATIVE MODELS AT RUNTIME
- VISUAL ADVERSARIAL EXAMPLES JAILBREAK ALIGNED LARGE LANGUAGE MODELS
- Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks
- Query-Relevant Images Jailbreak Large Multi-Modal Models
- Towards Adversarial Attack on Vision-Language Pre-training Models
- HowMany Are Unicorns in This Image? ASafety Evaluation Benchmark for Vision LLMs
- SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Au
- MISUSING TOOLS IN LARGE LANGUAGE MODELS WITH VISUAL ADVERSARIAL EXAMPLES
- VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
- INSTRUCTTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
- Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Mod
- Shadowcast: STEALTHY DATA POISONING ATTACKS AGAINST VISION-LANGUAGE MODELS
- FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts
- THE WOLF WITHIN: COVERT INJECTION OF MALICE INTO MLLM SOCIETIES VIA AN MLLM OPERATIVE
- Stop Reasoning! When Multimodal LLMs with Chain-of-Thought Reasoning Meets Adversarial Images
- Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
- How Robust is Google’s Bard to Adversarial Image Attacks?
- OnEvaluating Adversarial Robustness of Large Vision-Language Models
- Onthe Adversarial Robustness of Multi-Modal Foundation Models
- Are aligned neural networks adversarially aligned?
- READING ISN’T BELIEVING: ADVERSARIAL ATTACKS ON MULTI-MODAL NEURONS
- Black Box Adversarial Prompting for Foundation Models
- Evaluation and Analysis of Hallucination in Large Vision-Language Models
- FOOL YOUR (VISION AND) LANGUAGE MODEL WITH EMBARRASSINGLY SIMPLE PERMUTATIONS
- VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
- Transferable Multimodal Attack on Vision-Language Pre-training Models
- BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning
- AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning
- T2I-Attack
- On Copyright Risks of Text-to-Image Diffusion Models
- ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users
- On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
- Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts
- SneakyPrompt: Jailbreaking Text-to-image Generative Models
- The Stronger the Diffusion Model, the Easier the Backdoor: Data Poisoning to Induce Copyright Breach
- Discovering Universal Semantic Triggers for Text-to-Image Synthesis
- Automatic Jailbreaking of the Text-to-Image Generative AI Systems
- Survey
- Generative AI Security: Challenges and Countermeasures
- Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems
- Current state of LLM Risks and AI Guardrails
- Security of AI Agents
- Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
- Exploring Vulnerabilities and Protections in Large Language Models: A Survey
- Unveiling Hallucination in Text, Image, Video, and Audio Foundation Models: A Comprehensive Survey
- Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Mode
- SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Mode
- Safety of Multimodal Large Language Models on Images and Text
- LLM Jailbreak Attack versus Defense Techniques - A Comprehensive Study
- Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
- ASurvey on Safe Multi-Modal Learning System
- TRUSTWORTHY LARGE MODELS IN VISION: A SURVEY
- A Pathway Towards Responsible AI Generated Content
- A Survey of Hallucination in “Large” Foundation Models
- An Early Categorization of Prompt Injection Attacks on Large Language Models
- Comprehensive Assessment of Jailbreak Attacks Against LLMs
- A Comprehensive Overview of Backdoor Attacks in Large Language Models within Communication Networks
- Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
- Adversarial Machine Learning for Social Good: Reframing the Adversary as an Ally
- Red-Teaming for Generative AI: Silver Bullet or Security Theater?
- A STRONGREJECT for Empty Jailbreaks
- LVM-Attack
- For Good
- Benchmark
- HALLUSIONBENCH: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusi
- OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety
- ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
- S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language
- UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images
- JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against
- JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
- Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs
- ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming
- Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Halluc
- INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
- AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Ins
- HALLUSIONBENCH: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusio
- ALL LANGUAGES MATTER: ON THE MULTILINGUAL SAFETY OF LARGE LANGUAGE MODELS
- Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial
- Red Teaming Visual Language Models
- Unified Hallucination Detection for Multimodal Large Language Models
- MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
- Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
- CAN LANGUAGE MODELS BE INSTRUCTED TO PROTECT PERSONAL INFORMATION?
- Detecting and Preventing Hallucinations in Large Vision Language Models
- DRESS : Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Lang
- ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
- SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models
- PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
- Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
- Explainality
- Privacy-Defense
- Privacy-Attack
- PANDORA’S WHITE-BOX: INCREASED TRAINING DATA LEAKAGE IN OPEN LLMS
- Untitled
- Membership Inference Attacks against Large Language Models via Self-prompt Calibration
- LANGUAGE MODEL INVERSION
- Effective Prompt Extraction from Language Models
- Prompt Stealing Attacks Against Large Language Models
- Stealing Part of a Production Language Model
- Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Cali
- Prompt Stealing Attacks Against Large Language Models
- PRSA: Prompt Reverse Stealing Attacks against Large Language Models
- Low-Resource Languages Jailbreak GPT-4
- Scalable Extraction of Training Data from (Production) Language Models
- Others
- INFERRING OFFENSIVENESS IN IMAGES FROM NATURAL LANGUAGE SUPERVISION
- An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulne
- More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
- AI SAFETY: A CLIMB TO ARMAGEDDON?
- AI RISK MANAGEMENT SHOULD INCORPORATE BOTH SAFETY AND SECURITY
- Defending Against Social Engineering Attacks in the Age of LLMs
- Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
- Deduplicating Training Data Makes Language Models Better
- MITIGATING TEXT TOXICITY WITH COUNTERFACTUAL GENERATION
- The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?
- Mitigating Hallucinations in Large Language Models via Self-Refinement-Enhanced Knowledge Retrieval
- Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
- Mitigating LLM Hallucinations via Conformal Abstention
- Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback
- Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics
- An Analysis of Recent Advances in Deepfake Image Detection in an Evolving Threat Landscape
- Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
- LARGE LANGUAGE MODELS AS AUTOMATED ALIGNERS FOR BENCHMARKING VISION-LANGUAGE MODELS
- PoLLMgraph: Unraveling Hallucinations in Large Language Models via State Transition Dynamics
- Reducing hallucination in structured outputs via Retrieval-Augmented Generation
- Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision
- Attacking LLM Watermarks by Exploiting Their Strengths
- The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Mod
- TOFU: A Task of Fictitious Unlearning for LLMs
- Learning and Forgetting Unsafe Examples in Large Language Models
- Exploring Adversarial Attacks against Latent Diffusion Model from the Perspective of Adversarial Tra
- TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space
- In Search of Truth: An Interrogation Approach to Hallucination Detection
- Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification
- Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models
- Locating and Mitigating Gender Bias in Large Language Models
- Learning to Edit: Aligning LLMs with Knowledge Editing
- Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
- Quantitative Analysis of AI-Generated Texts in Academic Research: A Study of AI Presence in Arxiv Su
- Does DETECTGPT Fully Utilize Perturbation? Bridge Selective Perturbation to Fine-tuned Contrastive L
- TELLER: A Trustworthy Framework for Explainable, Generalizable and Controllable Fake News Detection
- SPOTTING LLMS WITH BINOCULARS: ZERO-SHOT DETECTION OF MACHINE-GENERATED TEXT
- LLM-as-a-Coauthor: The Challenges of Detecting LLM-Human Mixcase
- WHAT’S IN MY BIG DATA?
- UNDERSTANDING CATASTROPHIC FORGETTING IN LANGUAGE MODELS VIA IMPLICIT INFERENCE
- Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models
- Toxicity in CHATGPT: Analyzing Persona-assigned Language Models
- MemeCraft: Contextual and Stance-Driven Multimodal Meme Generation
- Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision
- Moderating New Waves of Online Hate with Chain-of-Thought Reasoning in Large Language Models
- Poisoned ChatGPT Finds Work for Idle Hands: Exploring Developers’ Coding Practices with Insecure Sug
- Zero shot VLMs for hate meme detection: Are we there yet?
- ANALYZING AND MITIGATING OBJECT HALLUCINATION IN LARGE VISION-LANGUAGE MODELS
- MITIGATING HALLUCINATION IN LARGE MULTIMODAL MODELS VIA ROBUST INSTRUCTION TUNING
- DENEVIL: TOWARDS DECIPHERING AND NAVIGATING THE ETHICAL VALUES OF LARGE LANGUAGE MODELS VIA INSTRUCT
- Disentangling Perceptions of Offensiveness: Cultural and Moral Correlates
- Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity
- LARGE LANGUAGE MODELS AS AUTOMATED ALIGNERS FOR BENCHMARKING VISION-LANGUAGE MODELS
- Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Pro
- InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance
- CAN LANGUAGE MODELS BE INSTRUCTED TO PROTECT PERSONAL INFORMATION?
- AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications
- Prompt Injection Attacks and Defenses in LLM-Integrated Applications
- Removing RLHF Protections in GPT-4 via Fine-Tuning
- SPML: A DSL for Defending Language Models Against Prompt Attacks
- Stealthy Attack on Large Language Model based Recommendation
- Large Language Models Sometimes Generate Purely Negatively-Reinforced Text
- On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective
- Learning from data in the mixed adversarial non-adversarial case: Finding the helpers and ignoring t
- longhorns at DADC 2022: How many linguists does it take to fool a Question Answering model? A system
- A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning
- Adversarial Examples Generation for Reducing Implicit Gender Bias in Pre-trained Models
- Discovering the Hidden Vocabulary of DALLE-2
- Raising the Cost of Malicious AI-Powered Image Editing
- Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimi
- ALIGNERS: DECOUPLING LLMS AND ALIGNMENT
- CAN LLM-GENERATED MISINFORMATION BE DETECTED?
- On the Risk of Misinformation Pollution with Large Language Models
- Evading Watermark based Detection of AI-Generated Content
- Mitigating Inappropriateness in Image Generation: Can there be Value in Reflecting the World’s Uglin
- Privacy-Preserving Instructions for Aligning Large Language Models
- TOWARDS UNDERSTANDING THE INTERPLAY OF GENERATIVE ARTIFICIAL INTELLIGENCE AND THE INTERNET
- Evaluating the Social Impact of Generative AI Systems in Systems and Society
- Transformation vs Tradition: Artificial General Intelligence (AGI) for Arts and Humanities
- Attacking LLM Watermarks by Exploiting Their Strengths
- TOWARDS RESPONSIBLE AI IN THE ERA OF GENERATIVE AI: A REFERENCE ARCHITECTURE FOR DESIGNING FOUNDATIO
- RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
- Intent-aligned AI systems deplete human agency: the need for agency foundations research in AI safet
- Risk Assessment and Statistical Significance in the Age of Foundation Models
- The Foundation Model Transparency Index
- The Privacy Pillar - A Conceptual Framework for Foundation Model-based Systems
- A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribu
- Foundational Moral Values for AI Alignment
- Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models
- ON CATASTROPHIC INHERITANCE OF LARGE FOUNDATION MODELS
- Foundation Model Sherpas: Guiding Foundation Models through Knowledge and Reasoning
- Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustmen
- Foundation Model Transparency Reports
- SECURING RELIABILITY: A BRIEF OVERVIEW ON ENHANCING IN-CONTEXT LEARNING FOR FOUNDATION MODELS
- EXPLORING THE ADVERSARIAL CAPABILITIES OF LARGE LANGUAGE MODELS
- TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification
- LLM-Resistant Math Word Problem Generation via Adversarial Attacks
- Efficient Black-Box Adversarial Attacks on Neural Text Detectors
- Adversarial Preference Optimization
- Combating Adversarial Attacks with Multi-Agent Debate
- How the Advent of Ubiquitous Large Language Models both Stymie and Turbocharge Dynamic Adversarial Q
- L-AutoDA: Leveraging Large Language Models for Automated Decision-based Adversarial Attacks
- Hidding the Ghostwriters: An Adversarial Evaluation of AI-Generated Student Essay Detection
- What Does the Bot Say? Opportunities and Risks of Large Language Models in Social Media Bot Detectio
- Prompted Contextual Vectors for Spear-Phishing Detection
- Token-Ensemble Text Generation: On Attacking the Automatic AI-Generated Text Detection
- Recursive Chain-of-Feedback Prevents Performance Degradation from Redundant Prompting
- Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
- RADAR: Robust AI-Text Detection via Adversarial Learning
- OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examp
- Why do universal adversarial attacks work on large language models?: Geometry might be the answer
- J-Guard: Journalism Guided Adversarially Robust Detection of AI-generated News
- Distilling Adversarial Prompts from Safety Benchmarks: Report for the Adversarial Nibbler Challenge
- Detoxifying Large Language Models via Knowledge Editing
- Healing Unsafe Dialogue Responses with Weak Supervision Signals
- LLM-Attack
- Hacc-Man: An Arcade Game for Jailbreaking LLMs
- Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks
- DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions
- Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
- Hijacking Large Language Models via Adversarial In-Context Learning
- Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
- DALA: A Distribution-Aware LoRA-Based Adversarial Attack against Language Models
- FRONTIER LANGUAGE MODELS ARE NOT ROBUST TO ADVERSARIAL ARITHMETIC, OR “WHAT DO I NEED TO SAY SO YOU
- Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignm
- Evil Geniuses: Delving into the Safety of LLM-based Agents
- BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
- SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS
- Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models
- ChatGPT as an Attack Tool: Stealthy Textual Backdoor Attack via Blackbox Generative Model Trigger
- Tastle: Distract Large Language Models for Automatic Jailbreak Attack
- Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue
- Learning to Poison Large Language Models During Instruction Tuning
- TALK TOO MUCH: Poisoning Large Language Models under Token Limit
- Don’t Say No: Jailbreaking LLM by Suppressing Refusal
- Goal-guided Generative Prompt Injection Attack on Large Language Models
- Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
- BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents
- AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens
- QROA: A Black-Box Query-Response Optimization Attack on LLMs
- BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models
- Improved Generation of Adversarial Examples Against Safety-aligned LLMs
- Exploring Backdoor Attacks against Large Language Model-based Decision Making
- Jailbreak Paradox: The Achilles’ Heel of LLMs
- Stealth edits for provably fixing or attacking large language models
- Stealth edits for provably fixing or attacking large language models
- IS POISONING A REAL THREAT TO LLM ALIGNMENT? MAYBE MORE SO THAN YOU THINK
- IS POISONING A REAL THREAT TO LLM ALIGNMENT? MAYBE MORE SO THAN YOU THINK
- Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack
- “Not Aligned” is Not “Malicious”: Being Careful about Hallucinations of Large Language Models’ Jailb
- Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
- Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models
- Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
- StructuralSleight: Automated Jailbreak Attacks on Large Language Models Utilizing Uncommon Text-Enco
- WHEN LLM MEETS DRL: ADVANCING JAILBREAKING EFFICIENCY VIA DRL-GUIDED SEARCH
- Context Injection Attacks on Large Language Models
- Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens
- Phantom: General Trigger Attacks on Retrieval Augmented Language Generation
- On Trojans in Refined Language Models
- A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measur
- How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
- JAILBREAKING AS A REWARD MISSPECIFICATION PROBLEM
- ObscurePrompt: Jailbreaking Large Language Models via Obscure Inpu
- ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
- Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
- Page 1
- AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization
- Chain of Attack: a Semantic-Driven Contextual Multi-Turn attacker for LLM
- CAN LLMS DEEPLY DETECT COMPLEX MALICIOUS QUERIES? A FRAMEWORK FOR JAILBREAKING VIA OBFUSCATING INTEN
- SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS
- Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chai
- JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models
- AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbre
- SANDWICH ATTACK: MULTI-LANGUAGE MIXTURE ADAPTIVE ATTACK ON LLMS
- Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
- Using Hallucinations to Bypass RLHF Filters
- TARGET: Template-Transferable Backdoor Attack Against Prompt-based NLP Models via GPT4
- SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS
- OPEN SESAME! UNIVERSAL BLACK BOX JAILBREAKING OF LARGE LANGUAGE MODELS
- Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
- Weak-to-Strong Jailbreaking on Large Language Models
- Punctuation Matters! Stealthy Backdoor Attack for Language Models
- BYPASSING THE SAFETY TRAINING OF OPEN-SOURCE LLMS WITH PRIMING ATTACKS
- Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
- A Semantic, Syntactic, And Context-Aware Natural Language Adversarial Example Generator
- Fast Adversarial Attacks on Language Models In One GPU Minute
- Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models
- Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks
- Automatic and Universal Prompt Injection Attacks against Large Language Models
- Automatic and Universal Prompt Injection Attacks against Large Language Models
- Prompt Injection Attacks and Defenses in LLM-Integrated Applications
- TENSOR TRUST: INTERPRETABLE PROMPT INJECTION ATTACKS FROM AN ONLINE GAME
- DPP-Based Adversarial Prompt Searching for Lanugage Models
- Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Agai
- Rapid Adoption, Hidden Risks: The Dual Impact of Large Language Model Customization
- Using Hallucinations to Bypass RLHF Filters
- Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks
- Prompt Injection attack against LLM-integrated Applications
- Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks
- FINE-TUNING ALIGNED LANGUAGE MODELS COMPROMISES SAFETY, EVEN WHEN USERS DO NOT INTEND TO!
- CATASTROPHIC JAILBREAK OF OPEN-SOURCE LLMS VIA EXPLOITING GENERATION
- EVALUATING THE SUSCEPTIBILITY OF PRE-TRAINED LANGUAGE MODELS VIA HANDCRAFTED ADVERSARIAL EXAMPLES
- Defending LLMs against Jailbreaking Attacks via Backtranslation
- EMULATED DISALIGNMENT: SAFETY ALIGNMENT FOR LARGE LANGUAGE MODELS MAY BACKFIRE!
- GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
- ON THE SAFETY OF OPEN-SOURCED LARGE LAN GUAGE MODELS: DOES ALIGNMENT REALLY PREVENT THEM FROM BEING
- Unveiling the Implicit Toxicity in Large Language Models
- Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks
- Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
- Learning to Poison Large Language Models During Instruction Tuning
- ALIGNMENT IS NOT SUFFICIENT TO PREVENT LARGE LANGUAGE MODELS FROM GENERATING HARMFUL IN FORMATION:
- LANGUAGE MODEL UNALIGNMENT: PARAMETRIC RED-TEAMING TO EXPOSE HIDDEN HARMS AND BI ASES
- Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
- IMMUNIZATION AGAINST HARMFUL FINE-TUNING AT TACKS
- EMULATED DISALIGNMENT: SAFETY ALIGNMENT FOR LARGE LANGUAGE MODELS MAY BACKFIRE!
- Composite Backdoor Attacks Against Large Language Models
- AWolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easi
- ALL IN HOW YOU ASK FOR IT: SIMPLE BLACK-BOX METHOD FOR JAILBREAK ATTACKS
- LLMJailbreak Attack versus Defense Techniques- A Comprehensive Study
- Weak-to-Strong Jailbreaking on Large Language Models
- MULTIVERSE: Exposing Large Language Model Alignment Problems in Diverse Worlds
- Universal and Transferable Adversarial Attacks on Aligned Language Models
- COERCING LLMS TO DO AND REVEAL (ALMOST) ANYTHING
- Generating Valid and Natural Adversarial Examples with Large Language Models
- Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
- Scaling Laws for Adversarial Attacks on Language Model Activations
- Ignore Previous Prompt: Attack Techniques For Language Models
- ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages
- A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems
- ATTACKING LARGE LANGUAGE MODELS WITH PROJECTED GRADIENT DESCENT
- Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
- Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embed
- Query-Based Adversarial Prompt Generation
- COERCING LLMS TO DO AND REVEAL (ALMOST) ANYTHING
- Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment
- Fast Adversarial Attacks on Language Models In One GPU Minute
- DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
- From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Transla
- Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
- CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models
- Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content
- Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
- PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
- A Cross-Language Investigation into Jailbreak Attacks in Large Language Models
- LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario
- Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data
- SHORTCUTS ARISING FROM CONTRAST: EFFECTIVE AND COVERT CLEAN-LABEL ATTACKS IN PROMPT-BASED LEARNING
- What’s in Your “Safe” Data?: Identifying Benign Data that Breaks Safety
- DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
- Attacking LLM Watermarks by Exploiting Their Strengths
- From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Transla
- DeepInception: Hypnotize Large Language Model to Be Jailbreaker
- Hijacking Large Language Models via Adversarial In-Context Learning
- EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models
- LinkPrompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models
- DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions
- Syntactic Ghost: An Imperceptible General-purpose Backdoor Attacks on Pre-trained Language Models
- Conversation Reconstruction Attack Against GPT Models
- Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks
- PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models
- COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
- Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit
- Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
- Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
- UNIVERSAL JAILBREAK BACKDOORS FROM POISONED HUMAN FEEDBACK
- Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking
- Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
- POISONPROMPT: BACKDOOR ATTACK ON PROMPT-BASED LARGE LANGUAGE MODELS
- BACKDOORING INSTRUCTION-TUNED LARGE LANGUAGE MODELS WITH VIRTUAL PROMPT INJECTION
- Backdoor Attacks for In-Context Learning with Language Models
- Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models
- UOR: Universal Backdoor Attacks on Pre-trained Language Models
- Fake Alignment: Are LLMs Really Aligned Well?
- Syntactic Ghost: An Imperceptible General-purpose Backdoor Attacks on Pre-trained Language Models
- Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignm
- Imperio: Language-Guided Backdoor Attacks for Arbitrary Model Control
- Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Agai
- Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning
- BADCHAIN: BACKDOOR CHAIN-OF-THOUGHT PROMPTING FOR LARGE LANGUAGE MODELS
- AUTODAN: INTERPRETABLE GRADIENT-BASED ADVERSARIAL ATTACKS ON LARGE LANGUAGE MODELS
- AN LLM CAN FOOL ITSELF: A PROMPT-BASED ADVERSARIAL ATTACK
- AUTOMATIC HALLUCINATION ASSESSMENT FOR ALIGNED LARGE LANGUAGE MODELS VIA TRANSFERABLE ADVERSARIAL AT
- LLM LIES: HALLUCINATIONS ARE NOT BUGS, BUT FEATURES AS ADVERSARIAL EXAMPLES
- LOFT: LOCAL PROXY FINE-TUNING FOR IMPROVING TRANSFERABILITY OF ADVERSARIAL ATTACKS AGAINST LARGE LAN
- Universal and Transferable Adversarial Attacks on Aligned Language Models
- Robustness Over Time: Understanding Adversarial Examples’ Effectiveness on Longitudinal Versions of
- BASELINE DEFENSES FOR ADVERSARIAL ATTACKS AGAINST ALIGNED LANGUAGE MODELS
- Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue
- Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Recon
- Adversarial Demonstration Attacks on Large Language Models
- COVER: A Heuristic Greedy Adversarial Attack on Prompt-based Learning in Language Models
- The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Mod
- Open the Pandora’s Box of LLMs: Jailbreaking LLMs through Representation Engineering
- How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Huma
- Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models
- PANDORA: Jailbreak GPTs by Retrieval Augmented Generation Poisoning
- Weak-to-Strong Jailbreaking on Large Language Models
- Jailbreaking Proprietary Large Language Models using Word Substitution Cipher
- Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs
- Adversarial Attacks and Defenses in Large Language Models: Old and New Threats
- Jailbroken: How Does LLM Safety Training Fail?
- ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
- GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large
- Tastle: Distract Large Language Models for Automatic Jailbreak Attack
- Exploring Safety Generalization Challenges of Large Language Models via Code
- Learning to Poison Large Language Models During Instruction Tuning
- BADEDIT: BACKDOORING LARGE LANGUAGE MODELS BY MODEL EDITING
- Composite Backdoor Attacks Against Large Language Models
- LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario
- Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
- ALL IN HOW YOU ASK FOR IT: SIMPLE BLACK-BOX METHOD FOR JAILBREAK ATTACKS
- THE POISON OF ALIGNMENT
- The Philosopher’s Stone: Trojaning Plugins of Large Language Models
- RAPID OPTIMIZATION FOR JAILBREAKING LLMS VIA SUBCONSCIOUS EXPLOITATION AND ECHOPRAXIA
- Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
- RED TEAMING GPT-4V: ARE GPT-4V SAFE AGAINST UNI/MULTI-MODAL JAILBREAK ATTACKS ?
- PAL: Proxy-Guided Black-Box Attack on Large Language Models
- INCREASED LLM VULNERABILITIES FROM FINETUNING AND QUANTIZATION
- Rethinking How to Evaluate Language Model Jailbreak
- COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
- GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation
- Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation
- Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
- AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
- Universal Adversarial Triggers Are Not Universal
- PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
- LLM-Defense
- LANGUAGE MODELS ARE HOMER SIMPSON!
- garak : A Framework for Security Probing Large Language Models
- Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning
- Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge
- PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning
- The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
- BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards
- Cross-Task Defense: Instruction-Tuning LLMs for Content Safety
- Efficient Adversarial Training in LLMs with Continuous Attacks
- StruQ: Defending Against Prompt Injection with Structured Queries
- Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
- GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis
- Defending Jailbreak Prompts via In-Context Adversarial Game
- Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework
- Jailbreaker in Jail: Moving Target Defense for Large Language Models
- DEFENDING AGAINST ALIGNMENT-BREAKING ATTACKS VIA ROBUSTLY ALIGNED LLM
- Causality Analysis for Evaluating the Security of Large Language Models
- AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
- Jailbreaking is Best Solved by Definition
- RIGORLLM: RESILIENT GUARDRAILS FOR LARGE LANGUAGE MODELS AGAINST UNDESIRED CONTENT
- LANGUAGE MODELS ARE HOMER SIMPSON! Safety Re-Alignment of Fine-tuned Language Models through Task Ar
- Defending Against Indirect Prompt Injection Attacks With Spotlighting
- LLMGuard: Guarding against Unsafe LLM Behavior
- Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations
- ON TROJAN SIGNATURES IN LARGE LANGUAGE MODELS OF CODE
- Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space
- Detoxifying Large Language Models via Knowledge Editing
- MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
- THE POISON OF ALIGNMENT
- ROSE: Robust Selective Fine-tuning for Pre-trained Language Models
- GAINING WISDOM FROM SETBACKS : ALIGNING LARGE LANGUAGE MODELS VIA MISTAKE ANALYSIS
- Making Harmful Behaviors Unlearnable for Large Language Models
- Fake Alignment: Are LLMs Really Aligned Well?
- Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
- Vaccine: Perturbation-aware Alignment for Large Language Model
- DEFENDING LARGE LANGUAGE MODELS AGAINST JAILBREAK ATTACKS VIA SEMANTIC SMOOTHING
- Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
- DEFENDING AGAINST ALIGNMENT-BREAKING AT TACKS VIA ROBUSTLY ALIGNED LLM
- LLMSelf Defense: By Self Examination, LLMsKnowTheyAreBeing Tricked
- BASELINE DEFENSES FOR ADVERSARIAL ATTACKS AGAINST ALIGNED LANGUAGE MODELS
- Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
- LLMsCanDefend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper
- Detoxifying Text with MARCO: Controllable Revision with Experts and Anti-Experts
- Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models
- Whispers that Shake Foundations: Analyzing and Mitigating False Premise Hallucinations in Large Lang
- CAMOUFLAGE IS ALL YOU NEED: EVALUATING AND ENHANCING LANGUAGE MODEL ROBUSTNESS AGAINST CAMOUFLAGE AD
- Defending Jailbreak Prompts via In-Context Adversarial Game
- Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
- Defending LLMs against Jailbreaking Attacks via Backtranslation
- IMMUNIZATION AGAINST HARMFUL FINE-TUNING ATTACKS
- Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield
- JAB: Joint Adversarial Prompting and Belief Augmentation
- TOKEN-LEVEL ADVERSARIAL PROMPT DETECTION BASED ON PERPLEXITY MEASURES AND CONTEXTUAL INFORMATION
- Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
- Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning
- Vaccine: Perturbation-aware Alignment for Large Language Model
- Improving the Robustness of Large Language Models via Consistency Alignment
- SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
- Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
- Defending Pre-trained Language Models as Few-shot Learners against Backdoor Attacks
- LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors
- Diffusion Theory as a Scalpel: Detecting and Purifying Poisonous Dimensions in Pre-trained Language
- Analyzing And Editing Inner Mechanisms of Backdoored Language Models
- Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots
- ROBUSTIFYING LANGUAGE MODELS WITH TESTTIME ADAPTATION
- Jailbreaker in Jail: Moving Target Defense for Large Language
- DETECTING LANGUAGE MODEL ATTACKS WITH PERPLEXITY
- Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation an
- From Adversarial Arms Race to Model-centric Evaluation Motivating a Unified Automatic Robustness Eva
- LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked
- Intention Analysis Makes LLMs A Good Jailbreak Defender
- Defending Against Disinformation Attacks in Open-Domain Question Answering
- Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
- Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landsc
- Round Trip Translation Defence against Large Language Model Jailbreaking Attacks
- How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?
- SELF-GUARD: Empower the LLM to Safeguard Itself
- Intention Analysis Makes LLMs A Good Jailbreak Defender
- Jatmo: Prompt Injection Defense by Task-Specific Finetuning
- Precisely the Point: Adversarial Augmentations for Faithful and Informative Text Generation
- Adversarial Text Purification: A Large Language Model Approach for Defense
- Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning
- Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
- Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning
- Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Application
- Is the System Message Really Important to Jailbreaks in Large Language Models?
- AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
- Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge