The resources related to the trustworthiness of large models (LMs) across multiple dimensions (e.g., safety, security, and privacy), with a special focus on multi-modal LMs (e.g., vision-language models and diffusion models).
-
This repo is in progress 🌱 (currently manually collected).
-
Badges:
-
🌻 Welcome to recommend resources to us via Issues with the following format (please fill in this table):
Title | Link | Code | Venue | Classification | Model | Comment |
---|---|---|---|---|---|---|
aa | arxiv | github | bb'23 | A1. Jailbreak | LLM | Agent |
- [2023.01.20] 🔥 We collect
3
related papers from NDSS'24! - [2023.01.17] 🔥 We collect
108
related papers from ICLR'24! - [2023.01.09] 🔥 LM-SSP is released!
- [2024/03] Large Language Model Capture-the-Flag (LLM CTF) Competition @ SaTML 2024
- [2024/02] LLM - Detect AI Generated Text
- [2024/02] Find the Trojan: Universal Backdoor Detection in Aligned Large Language Models @ SaTML 2024
- [2023/01] Training Data Extraction Challenge @ SaTML 2023
- [2022/12] Machine Learning Model Attribution Challenge @ SaTML 2023
- [2024/01] LLM Safety Leaderboard
- [2024/01] Hallucinations Leaderboard
- [2024/02] EasyJailbreak
- [2023/05] Ragas
- [2023/03] AutoGen
- [2024/02] Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
- [2024/02] A Survey of Text Watermarking in the Era of Large Language Models
- [2024/02] Safety of Multimodal Large Language Models on Images and Text
- [2024/02] A Survey on Hallucination in Large Vision-Language Models
- [2024/01] Security and Privacy Challenges of Large Language Models: A Survey
- [2024/01] Black-Box Access Is Insufficient for Rigorous AI Audits
- [2024/01] Red Teaming Visual Language Models
- [2024/01] Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems
- [2024/01] TrustLLM: Trustworthiness in Large Language Models
- [2023/12] Privacy Issues in Large Language Models: A Survey
- [2023/12] A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly
- [2023/10] Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
- [2023/09] AgentBench: Evaluating LLMs as Agents
- [2023/08] Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment
- [2023/07] A Comprehensive Overview of Large Language Models
- [2023/06] DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
- [2023/05] ChatGPT Needs SPADE (Sustainability, PrivAcy, Digital Divide, and Ethics) Evaluation: A Review
- [2023/04] Safety Assessment of Chinese Large Language Models
- [2023/03] A Survey of Large Language Models
- [2022/11] Holistic Evaluation of Language Models
- [2022/08] Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
- [2022/06] Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models
- [2021/11] Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models
- [2024/02] DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
- [2024/02] GUARD: Role-Playing to Generate Natural-Language Jailbreakings to Test Guideline Adherence of Large Language Models
- [2024/02] CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models
- [2024/02] PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
- [2024/02] Defending Large Language Models Against Jailbreak Attacks via Semantic Smoothing
- [2024/02] LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper
- [2024/02] From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings
- [2024/02] Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-Source LLMs
- [2024/02] Is the System Message Really Important to Jailbreaks in Large Language Models?
- [2024/02] Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks With Self-Refinement
- [2024/02] How (Un)ethical Are Instruction-Centric Responses of LLMs? Unveiling the Vulnerabilities of Safety Guardrails to Harmful Queries
- [2024/02] Mitigating Fine-Tuning Jailbreak Attack With Backdoor Enhanced Alignment
- [2024/02] LLM Jailbreak Attack Versus Defense Techniques -- A Comprehensive Study
- [2024/02] Coercing LLMs to Do and Reveal (Almost) Anything
- [2024/02] GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis
- [2024/02] Query-Based Adversarial Prompt Generation
- [2024/02] ArtPrompt: ASCII Art-Based Jailbreak Attacks Against Aligned LLMs
- [2024/02] SPML: A DSL for Defending Language Models Against Prompt Attacks
- [2024/02] A StrongREJECT for Empty Jailbreaks
- [2024/02] Jailbreaking Proprietary Large Language Models Using Word Substitution Cipher
- [2024/02] ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages
- [2024/02] PAL: Proxy-Guided Black-Box Attack on Large Language Models
- [2024/02] Attacking Large Language Models With Projected Gradient Descent
- [2024/02] SafeDecoding: Defending Against Jailbreak Attacks via Safety-Aware Decoding
- [2024/02] Play Guessing Game With LLM: Indirect Jailbreak Attack With Implicit Clues
- [2024/02] COLD-Attack: Jailbreaking LLMs With Stealthiness and Controllability
- [2024/02] Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
- [2024/02] Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning
- [2024/02] Comprehensive Assessment of Jailbreak Attacks Against LLMs
- [2024/02] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
- [2024/02] HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
- [2024/02] Jailbreaking Attack Against Multimodal Large Language Model
- [2024/02] Prompt-Driven LLM Safeguarding via Directed Representation Optimization
- [2024/01] A Cross-Language Investigation Into Jailbreak Attacks in Large Language Models
- [2024/01] Weak-to-Strong Jailbreaking on Large Language Models
- [2024/01] Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
- [2024/01] Jailbreaking GPT-4V via Self-Adversarial Attacks With System Prompts
- [2024/01] PsySafe: A Comprehensive Framework for Psychological-Based Attack, Defense, and Evaluation of Multi-Agent System Safety
- [2024/01] Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models
- [2024/01] Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
- [2024/01] All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks
- [2024/01] AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models
- [2024/01] Intention Analysis Prompting Makes Large Language Models a Good Jailbreak Defender
- [2024/01] How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
- [2023/12] A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models
- [2023/12] Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
- [2023/12] Goal-Oriented Prompt Attack and Safety Evaluation for LLMs
- [2023/12] Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
- [2023/12] Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an in-Context Attack
- [2023/12] A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection
- [2023/12] Adversarial Attacks on GPT-4 via Simple Random Search
- [2023/12] Make Them Spill the Beans! Coercive Knowledge Extraction From (Production) LLMs
- [2023/11] Query-Relevant Images Jailbreak Large Multi-Modal Models
- [2023/11] A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts Can Fool Large Language Models Easily
- [2023/11] Exploiting Large Language Models (LLMs) Through Deception Techniques and Persuasion Principles
- [2023/11] Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks
- [2023/11] MART: Improving LLM Safety With Multi-Round Automatic Red-Teaming
- [2023/11] Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
- [2023/11] SneakyPrompt: Jailbreaking Text-to-Image Generative Models
- [2023/11] DeepInception: Hypnotize Large Language Model to Be Jailbreaker
- [2023/11] Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Scale Prompt Hacking Competition
- [2023/11] Summon a Demon and Bind It: A Grounded Theory of LLM Red Teaming in the Wild
- [2023/11] Evil Geniuses: Delving Into the Safety of LLM-based Agents
- [2023/11] FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts
- [2023/10] Attack Prompt Generation for Red Teaming and Defending Large Language Models
- [2023/10] Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attack
- [2023/10] Low-Resource Languages Jailbreak GPT-4
- [2023/10] Prompt Injection Attacks and Defenses in LLM-Integrated Applications
- [2023/10] SC-Safety: A Multi-Round Open-Ended Question Adversarial Safety Benchmark for Large Language Models in Chinese
- [2023/10] SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
- [2023/10] Adversarial Attacks on LLMs
- [2023/10] AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
- [2023/10] Jailbreak and Guard Aligned Language Models With Only Few in-Context Demonstrations
- [2023/10] Jailbreaking Black Box Large Language Models in Twenty Queries
- [2023/09] Baseline Defenses for Adversarial Attacks Against Aligned Language Models
- [2023/09] Certifying LLM Safety Against Adversarial Prompting
- [2023/09] SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution
- [2023/09] Catastrophic Jailbreak of Open-Source LLMs via Exploiting Generation
- [2023/09] AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
- [2023/09] GPT-4 Is Too Smart to Be Safe: Stealthy Chat With LLMs via Cipher
- [2023/09] Jailbreak in Pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
- [2023/09] Multilingual Jailbreak Challenges in Large Language Models
- [2023/09] On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs
- [2023/09] RAIN: Your Language Models Can Align Themselves Without Finetuning
- [2023/09] Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models That Follow Instructions
- [2023/09] Tensor Trust: Interpretable Prompt Injection Attacks From an Online Game
- [2023/09] Understanding Hidden Context in Preference Learning: Consequences for RLHF
- [2023/09] Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
- [2023/09] FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models
- [2023/09] GPTFUZZER: Red Teaming Large Language Models With Auto-Generated Jailbreak Prompts
- [2023/09] Open Sesame! Universal Black Box Jailbreaking of Large Language Models
- [2023/08] Red-Teaming Large Language Models Using Chain of Utterances for Safety-Alignment
- [2023/08] XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
- [2023/08] “Do Anything Now”: Characterizing and Evaluating in-the-Wild Jailbreak Prompts on Large Language Models
- [2023/08] Detecting Language Model Attacks With Perplexity
- [2023/07] From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy
- [2023/07] LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?
- [2023/07] Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models
- [2023/07] Jailbroken: How Does LLM Safety Training Fail?
- [2023/07] MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots
- [2023/07] Universal and Transferable Adversarial Attacks on Aligned Language Models
- [2023/06] Visual Adversarial Examples Jailbreak Aligned Large Language Models
- [2023/06] Prompt Injection Attack Against LLM-integrated Applications
- [2023/05] Adversarial Demonstration Attacks on Large Language Models.
- [2023/05] Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
- [2023/05] Tricking LLMs Into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks
- [2023/04] Multi-Step Jailbreaking Privacy Attacks on ChatGPT
- [2023/03] Automatically Auditing Large Language Models via Discrete Optimization
- [2023/02] Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications With Indirect Prompt Injection
- [2024/02] Privacy-Preserving Instructions for Aligning Large Language Models
- [2024/02] Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!
- [2024/02] Language Models Are Homer Simpson! Safety Re-Alignment of Fine-Tuned Language Models Through Task Arithmetic
- [2024/02] Learning to Edit: Aligning LLMs With Knowledge Editing
- [2024/02] DeAL: Decoding-Time Alignment for Large Language Models
- [2024/02] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
- [2024/01] Agent Alignment in Evolving Social Norms
- [2023/12] Alignment for Honesty
- [2023/12] Exploiting Novel GPT-4 APIs
- [2023/11] Removing RLHF Protections in GPT-4 via Fine-Tuning
- [2023/10] AI Alignment: A Comprehensive Survey
- [2023/10] Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
- [2023/09] Alignment as Reward-Guided Search
- [2023/09] Beyond Imitation: Leveraging Fine-Grained Quality Signals for Alignment
- [2023/09] Beyond Reverse KL: Generalizing Direct Preference Optimization With Diverse Divergence Constraints
- [2023/09] CAS: A Probability-Based Approach for Universal Condition Alignment Score
- [2023/09] CPPO: Continual Learning for Reinforcement Learning With Human Feedback
- [2023/09] Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
- [2023/09] FLASK: Fine-Grained Language Model Evaluation Based on Alignment Skill Sets
- [2023/09] Gaining Wisdom From Setbacks: Aligning Large Language Models via Mistake Analysis
- [2023/09] Generative Judge for Evaluating Alignment
- [2023/09] Group Preference Optimization: Few-Shot Alignment of Large Language Models
- [2023/09] Improving Generalization of Alignment With Human Preferences Through Group Invariant Learning
- [2023/09] Large Language Models as Automated Aligners for Benchmarking Vision-Language Models
- [2023/09] Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models
- [2023/09] RLCD: Reinforcement Learning From Contrastive Distillation for LM Alignment
- [2023/09] Safe RLHF: Safe Reinforcement Learning From Human Feedback
- [2023/09] SALMON: Self-Alignment With Principle-Following Reward Models
- [2023/09] Self-Alignment With Instruction Backtranslation
- [2023/09] Statistical Rejection Sampling Improves Preference Optimization
- [2023/09] True Knowledge Comes From Practice: Aligning Large Language Models With Embodied Environments via Reinforcement Learning
- [2023/09] Urial: Aligning Untuned LLMs With Just the 'Write' Amount of in-Context Learning
- [2023/09] What Happens When You Fine-Tuning Your Model? Mechanistic Analysis of Procedurally Generated Tasks.
- [2023/09] What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning
- [2023/07] BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
- [2023/07] CValues: Measuring the Values of Chinese Large Language Models From Safety to Responsibility
- [2023/05] Principle-Driven Self-Alignment of Language Models From Scratch With Minimal Human Supervision
- [2023/04] Fundamental Limitations of Alignment in Large Language Models
- [2023/04] RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
- [2022/10] Enabling Classifiers to Make Judgements Explicitly Aligned With Human Values
- [2024/02] Technical Report on the Checkfor.ai AI-Generated Text Classifier
- [2024/02] VGMShield: Mitigating Misuse of Video Generative Models
- [2024/02] M4gt-Bench: Evaluation Benchmark for Black-Box Machine-Generated Text Detection
- [2024/02] Ten Words Only Still Help: Improving Black-Box AI-Generated Text Detection via Proxy-Guided Efficient Re-Sampling
- [2024/02] Does DETECTGPT Fully Utilize Perturbation? Selective Perturbation on Model-Based Contrastive Learning Detector Would Be Better
- [2024/02] TELLER: A Trustworthy Framework for Explainable, Generalizable and Controllable Fake News Detection
- [2024/02] Organic or Diffused: Can We Distinguish Human Art From AI-generated Images?
- [2024/01] Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text
- [2024/01] Enhancing Robustness of LLM-Synthetic Text Detectors for Academic Writing: A Comprehensive Analysis
- [2024/01] Authorship Obfuscation in Multilingual Machine-Generated Text Detection
- [2024/01] Few-Shot Detection of Machine-Generated Text Using Style Representations
- [2024/01] LLM-as-a-Coauthor: The Challenges of Detecting LLM-Human Mixcase
- [2023/10] Harnessing the Power of ChatGPT in Fake News: An in-Depth Exploration in Generation, Detection and Explanation
- [2023/09] Can LLM-Generated Misinformation Be Detected?
- [2023/09] Detecting Machine-Generated Texts by Multi-Population Aware Optimization for Maximum Mean Discrepancy
- [2023/09] Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks
- [2023/05] On the Risk of Misinformation Pollution With Large Language Models
- [2023/05] Evading Watermark Based Detection of AI-Generated Content
- [2023/04] Synthetic Lies: Understanding AI-Generated Misinformation and Evaluating Algorithmic and Human Solutions
- [2023/03] Can AI-Generated Text Be Reliably Detected?
- [2023/03] MGTBench: Benchmarking Machine-Generated Text Detection
- [2022/12] Discovering Language Model Behaviors With Model-Written Evaluations
- [2022/12] CoCo: Coherence-Enhanced Machine-Generated Text Detection Under Data Limitation With Contrastive Learning
- [2022/10] DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models
- [2023/12] Disentangling Perceptions of Offensiveness: Cultural and Moral Correlates
- [2023/10] Unpacking the Ethical Value Alignment in Big Models
- [2023/09] Denevil: Towards Deciphering and Navigating the Ethical Values of Large Language Models via Instruction Learning
- [2023/05] From Text to MITRE Techniques: Exploring the Malicious Use of Large Language Models for Generating Cyber Attack Payloads
- [2023/01] Exploring AI Ethics of ChatGPT: A Diagnostic Analysis
- [2024/02] FairBelief - Assessing Harmful Beliefs in Language Models
- [2024/02] What's in a Name? Auditing Large Language Models for Race and Gender Bias
- [2024/02] Measuring Social Biases in Masked Language Models by Proxy of Prediction Quality
- [2024/02] Your Large Language Model Is Secretly a Fairness Proponent and You Should Prompt It Like One
- [2024/02] Disclosure and Mitigation of Gender Bias in LLMs
- [2024/02] I Am Not Them: Fluid Identities and Persistent Out-Group Bias in Large Language Models
- [2024/01] Evaluating Gender Bias in Large Language Models via Chain-of-Thought Prompting
- [2024/01] Gender Bias in Machine Translation and the Era of Large Language Models
- [2024/01] Leveraging Biases in Large Language Models: "Bias-kNN'' for Effective Few-Shot Learning
- [2024/01] Beyond the Surface: A Global-Scale Analysis of Visual Stereotypes in Text-to-Image Generation
- [2023/12] GPTBIAS: A Comprehensive Framework for Evaluating Bias in Large Language Models
- [2023/11] Beyond Detection: Unveiling Fairness Vulnerabilities in Abusive Language Models
- [2023/11] FFT: Towards Harmlessness Evaluation and Analysis for LLMs With Factuality, Fairness, Toxicity
- [2023/11] ROBBIE: Robust Bias Evaluation of Large Generative Language Models
- [2023/10] Im Not Racist But...: Discovering Bias in the Internal Knowledge of Large Language Models
- [2023/10] Investigating the Fairness of Large Language Models for Predictions on Tabular Data
- [2023/10] Kelly Is a Warm Person, Joseph Is a Role Model: Gender Biases in LLM-Generated Reference Letters
- [2023/09] Achieving Fairness in Multi-Agent MDP Using Reinforcement Learning
- [2023/09] Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs
- [2023/09] FairVLM: Mitigating Bias in Pre-Trained Vision-Language Models
- [2023/09] Finetuning Text-to-Image Diffusion Models for Fairness
- [2023/09] The Devil Is in the Neurons: Interpreting and Mitigating Social Biases in Language Models
- [2023/09] Bias and Fairness in Chatbots: An Overview
- [2023/09] Bias and Fairness in Large Language Models: A Survey
- [2023/09] People's Perceptions Toward Bias and Related Concepts in Large Language Models: A Systematic Review
- [2023/08] FairBench: A Four-Stage Automatic Framework for Detecting Stereotypes and Biases in Large Language Models
- [2023/08] Gender Bias and Stereotypes in Large Language Models
- [2023/07] Queer People Are People First: Deconstructing Sexual Identity Stereotypes in Large Language Models
- [2023/06] Knowledge of Cultural Moral Norms in Large Language Models
- [2023/06] WinoQueer: A Community-in-the-Loop Benchmark for Anti-Lgbtq+ Bias in Large Language Models
- [2023/05] BiasAsker: Measuring the Bias in Conversational AI System
- [2023/05] Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large Language Model Recommendation
- [2023/05] Large Language Models Are Not Fair Evaluators
- [2023/05] Uncovering and Quantifying Social Biases in Code Generation
- [2022/09] Exploiting Cultural Biases via Homoglyphs in Text-to-Image Synthesis
- [2022/09] Moral Mimicry: Large Language Models Produce Moral Rationalizations Tailored to Political Identity
- [2022/05] Auto-Debias: Debiasing Masked Language Models With Automated Biased Prompts
- [2022/03] Mitigating Gender Bias in Distilled Language Models via Counterfactual Role Reversal
- [2021/04] Mitigating Political Bias in Language Models Through Reinforced Calibration
- [2021/02] Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models
- [2021/01] Persistent Anti-Muslim Bias in Large Language Models
- [2024/02] Seeing Is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding
- [2024/02] Measuring and Reducing LLM Hallucination Without Gold-Standard Answers via Expertise-Weighting
- [2024/02] Comparing Hallucination Detection Metrics for Multilingual Generation
- [2024/02] Can We Verify Step by Step for Incorrect Answer Detection?
- [2024/02] Strong Hallucinations From Negation and How to Fix Them
- [2024/02] Retrieve Only When It Needs: Adaptive Retrieval Augmentation for Hallucination Mitigation in Large Language Models
- [2024/02] Mitigating Object Hallucination in Large Vision-Language Models via Classifier-Free Guidance
- [2024/02] Can LLMs Produce Faithful Explanations for Fact-Checking? Towards Faithful Explainable Fact-Checking via Multi-Agent Debate
- [2024/02] Understanding the Effects of Iterative Prompting on Truthfulness
- [2024/02] Is It Possible to Edit Large Language Models Robustly?
- [2024/02] C-Rag: Certified Generation Risks for Retrieval-Augmented Language Models
- [2024/01] Hallucination Is Inevitable: An Innate Limitation of Large Language Models
- [2024/01] Mitigating Hallucinations of Large Language Models via Knowledge Consistent Alignment
- [2024/01] Large Language Models Are Null-Shot Learners
- [2024/01] Model Editing Can Hurt General Abilities of Large Language Models
- [2024/01] Relying on the Unreliable: The Impact of Language Models' Reluctance to Express Uncertainty
- [2024/01] Seven Failure Points When Engineering a Retrieval Augmented Generation System
- [2023/12] DelucionQA: Detecting Hallucinations in Domain-Specific Question Answering
- [2023/12] Improving Factual Error Correction by Learning to Inject Factual Errors
- [2023/12] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment From Fine-Grained Correctional Human Feedback
- [2023/12] The Earth Is Flat Because...: Investigating LLMs' Belief Towards Misinformation via Persuasive Conversation
- [2023/11] A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
- [2023/11] Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models
- [2023/11] Deficiency of Large Language Models in Finance: An Empirical Examination of Hallucination
- [2023/11] Enhancing Uncertainty-Based Hallucination Detection With Stronger Focus
- [2023/11] Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
- [2023/11] Mitigating Large Language Model Hallucinations via Autonomous Knowledge Graph-Based Retrofitting
- [2023/11] UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation
- [2023/11] When Large Language Models Contradict Humans? Large Language Models' Sycophantic Behaviour
- [2023/11] Calibrated Language Models Must Hallucinate
- [2023/10] Explainable Claim Verification via Knowledge-Grounded Reasoning With Large Language Models
- [2023/10] Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity
- [2023/09] Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
- [2023/09] Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models With in-Context-Learning
- [2023/09] BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models
- [2023/09] Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting Over Heterogeneous Sources
- [2023/09] Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding
- [2023/09] Compressing LLMs: The Truth Is Rarely Pure and Never Simple
- [2023/09] Conformal Language Modeling
- [2023/09] CRITIC: Large Language Models Can Self-Correct With Tool-Interactive Critiquing
- [2023/09] Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-Image Generation
- [2023/09] Do Large Language Models Know About Facts?
- [2023/09] DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
- [2023/09] Ferret: Refer and Ground Anything Anywhere at Any Granularity
- [2023/09] Fine-Tuning Language Models for Factuality
- [2023/09] INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection
- [2023/09] Lightweight Language Model Calibration for Open-Ended Question Answering With Varied Answer Lengths
- [2023/09] MetaGPT: Meta Programming for Multi-Agent Collaborative Framework
- [2023/09] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
- [2023/09] RAPPER: Reinforced Rationale-Prompted Paradigm for Natural Language Explanation in Visual Question Answering
- [2023/09] Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning
- [2023/09] Self-Contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation
- [2023/09] Supervised Knowledge Makes Large Language Models Better in-Context Learners
- [2023/09] Teaching Language Models to Hallucinate Less With Synthetic Tasks
- [2023/09] Teaching Large Language Models to Self-Debug
- [2023/09] The Reasonableness Behind Unreasonable Translation Capability of Large Language Model
- [2023/09] Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph
- [2023/09] Unveiling and Manipulating Prompt Influence in Large Language Models
- [2023/09] Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
- [2023/08] Simple Synthetic Data Reduces Sycophancy in Large Language Models
- [2023/07] A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation
- [2023/07] Shifting Attention to Relevance: Towards the Uncertainty Estimation of Large Language Models
- [2023/06] Explore, Establish, Exploit: Red Teaming Language Models From Scratch
- [2023/06] Inference-Time Intervention: Eliciting Truthful Answers From a Language Model
- [2023/05] Fact-Checking Complex Claims With Program-Guided Reasoning
- [2023/05] HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models
- [2023/05] Improving Factuality and Reasoning in Language Models Through Multiagent Debate
- [2023/05] Knowledge of Knowledge: Exploring Known-Unknowns Uncertainty With Large Language Models
- [2023/05] Mitigating Language Model Hallucination With Interactive Question-Knowledge Alignment
- [2023/05] Sources of Hallucination by Large Language Models on Inference Tasks
- [2023/05] Trusting Your Evidence: Hallucinate Less With Context-Aware Decoding
- [2023/04] In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT
- [2023/03] SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
- [2023/02] A Categorical Archive of ChatGPT Failures
- [2023/02] Check Your Facts and Try Again: Improving Large Language Models With External Knowledge and Automated Feedback
- [2022/02] Locating and Editing Factual Associations in GPT
- [2022/02] Survey of Hallucination in Natural Language Generation
- [2024/02] GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection?
- [2024/02] Beyond Hate Speech: NLP's Challenges and Opportunities in Uncovering Dehumanizing Language
- [2024/02] Large Language Models Are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content
- [2024/02] Zero Shot VLMs for Hate Meme Detection: Are We There Yet?
- [2024/02] Universal Prompt Optimizer for Safe Text-to-Image Generation
- [2024/02] Can LLMs Recognize Toxicity? Structured Toxicity Investigation Framework and Semantic-Based Metric
- [2024/02] Bryndza at ClimateActivism 2024: Stance, Target and Hate Event Detection via Retrieval-Augmented GPT-4 and LLaMA
- [2024/01] Using LLMs to Discover Emerging Coded Antisemitic Hate-Speech Emergence in Extremist Social Media
- [2024/01] MetaHate: A Dataset for Unifying Efforts on Hate Speech Detection
- [2024/01] A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
- [2023/12] Efficient Toxic Content Detection by Bootstrapping and Distilling Large Language Models
- [2023/12] GTA: Gated Toxicity Avoidance for LM Performance Preservation
- [2023/12] Llama Guard: LLM-based Input-Output Safeguard for Human-Ai Conversations
- [2023/11] Unveiling the Implicit Toxicity in Large Language Models
- [2023/10] All Languages Matter: On the Multilingual Safety of Large Language Models
- [2023/10] On the Proactive Generation of Unsafe Images From Text-to-Image Models Using Benign Prompts
- [2023/09] (InThe)WildChat: 570K ChatGPT Interaction Logs in the Wild
- [2023/09] Controlled Text Generation via Language Model Arithmetic
- [2023/09] Curiosity-Driven Red-Teaming for Large Language Models
- [2023/09] RealChat-1M: A Large-Scale Real-World LLM Conversation Dataset
- [2023/09] Understanding Catastrophic Forgetting in Language Models via Implicit Inference
- [2023/09] Unmasking and Improving Data Credibility: A Study With Datasets for Training Harmless Language Models
- [2023/09] What's in My Big Data?
- [2023/08] Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
- [2023/08] You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content
- [2023/05] Evaluating ChatGPT's Performance for Multilingual and Emoji-Based Hate Speech Detection
- [2023/05] Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-to-Image Models
- [2023/04] Toxicity in ChatGPT: Analyzing Persona-Assigned Language Models
- [2023/02] Adding Instructions During Pretraining: Effective Way of Controlling Toxicity in Language Models
- [2023/02] Is ChatGPT Better Than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech
- [2022/12] Constitutional AI: Harmlessness From AI Feedback
- [2022/12] On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning
- [2022/10] Unified Detoxifying and Debiasing in Language Generation via Inference-Time Adaptive Optimization
- [2022/05] Toxicity Detection With Generative Prompt-Based Inference
- [2022/04] Training a Helpful and Harmless Assistant With Reinforcement Learning From Human Feedback
- [2022/03] ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection
- [2020/09] RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
- [2024/02] Fast Adversarial Attacks on Language Models in One GPU Minute
- [2024/02] Stealthy Attack on Large Language Model Based Recommendation
- [2024/02] BSPA: Exploring Black-Box Stealthy Prompt Attacks Against Image Generators
- [2024/02] Stop Reasoning! When Multimodal LLMs With Chain-of-Thought Reasoning Meets Adversarial Images
- [2024/02] The Wolf Within: Covert Injection of Malice Into MLLM Societies via an MLLM Operative
- [2024/02] Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-Shot LLM Assessment
- [2024/02] Groot: Adversarial Testing for Generative Text-to-Image Models With Tree-Based Semantic Transformation
- [2024/02] Exploring the Adversarial Capabilities of Large Language Models
- [2024/02] Prompt Perturbation in Retrieval-Augmented Generation Based Large Language Models
- [2024/02] Adversarial Text Purification: A Large Language Model Approach for Defense
- [2024/02] Cheating Suffix: Targeted Attack to Text-to-Image Diffusion Models With Multi-Modal Priors
- [2024/01] Revealing Vulnerabilities in Stable Diffusion via Targeted Attacks
- [2024/01] Exploring Adversarial Attacks Against Latent Diffusion Model From the Perspective of Adversarial Transferability
- [2024/01] Adversarial Examples Are Misaligned in Diffusion Model Manifolds
- [2024/01] INSTRUCTTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
- [2023/12] On the Robustness of Large Multimodal Models Against Image Adversarial Attacks
- [2023/12] Causality Analysis for Evaluating the Security of Large Language Models
- [2023/12] Hijacking Context in Large Multi-Modal Models
- [2023/11] Improving the Robustness of Transformer-Based Large Language Models With Dynamic Attention
- [2023/11] Unveiling Safety Vulnerabilities of Large Language Models
- [2023/11] Can Protective Perturbation Safeguard Personal Data From Being Exploited by Stable Diffusion?
- [2023/11] DiffAttack: Evasion Attacks Against Diffusion-Based Adversarial Purification
- [2023/11] How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
- [2023/10] Misusing Tools in Large Language Models With Visual Adversarial Examples
- [2023/09] Inducing High Energy-Latency of Large Vision-Language Models With Verbose Images
- [2023/09] An Image Is Worth 1000 Lies: Transferability of Adversarial Images Across Prompts on Vision-Language Models
- [2023/09] An LLM Can Fool Itself: A Prompt-Based Adversarial Attack
- [2023/09] Language Model Detectors Are Easily Optimized Against
- [2023/09] Leveraging Optimization for Adaptive Attacks on Image Watermarks
- [2023/09] Training Socially Aligned Language Models on Simulated Social Interactions
- [2023/09] How Robust Is Google's Bard to Adversarial Image Attacks?
- [2023/09] Image Hijacks: Adversarial Images Can Control Generative Models at Runtime
- [2023/08] Ceci N'est Pas Une Pomme: Adversarial Illusions in Multi-Modal Embeddings
- [2023/08] On the Adversarial Robustness of Multi-Modal Foundation Models
- [2023/08] Robustness Over Time: Understanding Adversarial Examples' Effectiveness on Longitudinal Versions of Large Language Models
- [2023/07] Certified Robustness for Large Language Models With Self-Denoising
- [2023/06] Adversarial Examples in the Age of ChatGPT
- [2023/06] Are Aligned Neural Networks Adversarially Aligned?
- [2023/06] PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
- [2023/06] Stable Diffusion Is Unstable
- [2023/06] Unlearnable Examples for Diffusion Models: Protect Data From Unauthorized Exploitation
- [2023/06] Visual Adversarial Examples Jailbreak Large Language Models
- [2023/05] Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility
- [2023/05] On Evaluating Adversarial Robustness of Large Vision-Language Models
- [2023/03] Anti-DreamBooth: Protecting Users From Personalized Text-to-Image Synthesis
- [2023/02] Large Language Models Can Be Easily Distracted by Irrelevant Context
- [2023/02] On the Robustness of ChatGPT: An Adversarial and Out-of-Distribution Perspective
- [2023/02] Adversarial Example Does Good: Preventing Painting Imitation From Diffusion Models via Adversarial Examples
- [2023/02] Raising the Cost of Malicious AI-Powered Image Editing
- [2023/01] On Robustness of Prompt-Based Semantic Parsing With Large Pre-Trained Language Model: An Empirical Study on Codex
- [2022/12] Understanding Zero-Shot Adversarial Robustness for Large-Scale Model
- [2024/02] On Trojan Signatures in Large Language Models of Code
- [2024/02] WIPI: A New Web Threat for LLM-Driven Web Agents
- [2024/02] VL-Trojan: Multimodal Instruction Backdoor Attacks Against Autoregressive Visual Language Models
- [2024/02] Universal Vulnerabilities in Large Language Models: Backdoor Attacks for in-Context Learning
- [2024/02] Learning to Poison Large Language Models During Instruction Tuning
- [2024/02] Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning
- [2024/02] Acquiring Clean Language Models From Backdoor Poisoned Datasets by Downscaling Frequency Space
- [2024/02] Rapid Adoption, Hidden Risks: The Dual Impact of Large Language Model Customization
- [2024/02] Secret Collusion Among Generative AI Agents
- [2024/02] Test-Time Backdoor Attacks on Multimodal Large Language Models
- [2024/02] PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models
- [2024/02] Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models
- [2024/01] Universal Vulnerabilities in Large Language Models: In-Context Learning Backdoor Attacks
- [2024/01] Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training
- [2023/12] Poisoned ChatGPT Finds Work for Idle Hands: Exploring Developers' Coding Practices With Insecure Suggestions From Poisoned AI Models
- [2023/12] Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
- [2023/12] Unleashing Cheapfakes Through Trojan Plugins of Large Language Models
- [2023/11] Test-Time Backdoor Mitigation for Black-Box Large Language Models With Defensive Demonstrations
- [2023/10] Leveraging Diffusion-Based Image Variations for Robust Training on Poisoned Data
- [2023/10] Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Against Text Classifiers
- [2023/10] PoisonPrompt: Backdoor Attack on Prompt-Based Large Language Models
- [2023/10] Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models
- [2023/10] Composite Backdoor Attacks Against Large Language Models
- [2023/09] BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models
- [2023/09] BadEdit: Backdooring Large Language Models by Model Editing
- [2023/09] Universal Jailbreak Backdoors From Poisoned Human Feedback
- [2023/08] LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors
- [2023/08] The Poison of Alignment
- [2023/07] Backdooring Instruction-Tuned Large Language Models With Virtual Prompt Injection
- [2023/06] On the Exploitability of Instruction Tuning
- [2023/05] Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models
- [2023/05] Poisoning Language Models During Instruction Tuning
- [2022/11] Rickrolling the Artist: Injecting Backdoors Into Text Encoders for Text-to-Image Synthesis
- [2024/02] Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs
- [2023/09] Proving Test Set Contamination for Black-Box Language Models
- [2023/09] Time Travel in LLMs: Tracing Data Contamination in Large Language Models
- [2023/09] To the Cutoff... And Beyond? A Longitudinal Perspective on LLM Data Contamination
- [2023/09] DyVal: Graph-Informed Dynamic Evaluation of Large Language Models
- [2024/02] Generative Models Are Self-Watermarked: Declaring Model Authentication Through Re-Generation
- [2024/02] Attacking LLM Watermarks by Exploiting Their Strengths
- [2024/02] Double-I Watermark: Protecting Model Copyright for LLM Fine-Tuning
- [2024/02] Watermarking Makes Language Models Radioactive
- [2024/02] A First Look at GPT Apps: Landscape and Vulnerability
- [2024/02] Can Watermarks Survive Translation? On the Cross-Lingual Consistency of Text Watermark for Large Language Models
- [2024/02] Proving Membership in LLM Pretraining Data via Data Watermarks
- [2024/02] Resilient Watermarking for LLM-Generated Codes
- [2024/02] Permute-and-Flip: An Optimally Robust and Watermarkable Decoder for LLMs
- [2024/02] Copyright Protection in Generative AI: A Technical Perspective
- [2024/01] Adaptive Text Watermark for Large Language Models
- [2024/01] Instructional Fingerprinting of Large Language Models
- [2024/01] Generative AI Has a Visual Plagiarism Problem
- [2023/12] Human-Readable Fingerprint for Large Language Models
- [2023/12] Mark My Words: Analyzing and Evaluating Language Model Watermarks
- [2023/11] A Robust Semantics-Based Watermark for Large Language Model Against Paraphrasing
- [2023/11] Protecting Intellectual Property of Large Language Model-Based Code Generation APIs via Watermarks
- [2023/09] A Private Watermark for Large Language Models
- [2023/09] A Semantic Invariant Robust Watermark for Large Language Models
- [2023/09] Provable Robust Watermarking for AI-Generated Text
- [2023/09] SILO Language Models: Isolating Legal Risk in a Nonparametric Datastore
- [2023/08] PromptCARE: Prompt Copyright Protection by Watermark Injection and Verification
- [2023/06] Generative Watermarking Against Unauthorized Subject-Driven Image Synthesis
- [2023/05] Tree-Ring Watermarks: Fingerprints for Diffusion Images That Are Invisible and Robust
- [2023/05] Watermarking Diffusion Model
- [2023/03] A Recipe for Watermarking Diffusion Models
- [2023/02] Glaze: Protecting Artists From Style Mimicry by Text-to-Image Models
- [2023/01] A Watermark for Large Language Models
- [2024/02] Conversation Reconstruction Attack Against GPT Models
- [2024/01] Text Embedding Inversion Attacks on Multilingual Language Models
- [2023/11] Scalable Extraction of Training Data From (Production) Language Models
- [2023/09] Intriguing Properties of Data Attribution on Diffusion Models
- [2023/09] Teach LLMs to Phish: Stealing Private Information From Language Models
- [2023/09] Language Model Inversion
- [2023/07] Prompts Should Not Be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success
- [2023/02] Prompt Stealing Attacks Against Text-to-Image Generation Models
- [2023/01] Extracting Training Data From Diffusion Models
- [2020/12] Extracting Training Data From Large Language Models
- [2024/02] Prompt Stealing Attacks Against Large Language Models
- [2024/02] Recovering the Pre-Fine-Tuning Weights of Generative Models
- [2023/03] On Extracting Specialized Code Abilities From Large Language Models: A Feasibility Study
- [2023/03] Stealing the Decoding Algorithms of Language Models
- [2024/02] The Good and the Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG)
- [2024/02] Pandora's White-Box: Increased Training Data Leakage in Open LLMs
- [2024/02] Large Language Models Are Advanced Anonymizers
- [2024/02] Privacy-Preserving Language Model Inference With Instance Obfuscation
- [2024/02] Do Membership Inference Attacks Work on Large Language Models?
- [2024/02] PromptCrypt: Prompt Encryption for Secure Communication With Large Language Models
- [2024/01] Excuse Me, Sir? Your Language Model Is Leaking (Information)
- [2024/01] R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
- [2023/12] Black-Box Membership Inference Attacks Against Fine-Tuned Diffusion Models
- [2023/11] Practical Membership Inference Attacks Against Fine-Tuned Large Language Models via Self-Prompt Calibration
- [2023/10] User Inference Attacks on Large Language Models
- [2023/10] Last One Standing: A Comparative Analysis of Security and Privacy of Soft Prompt Tuning, LoRA, and in-Context Learning
- [2023/09] An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization
- [2023/09] Beyond Memorization: Violating Privacy via Inference With Large Language Models
- [2023/09] Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory
- [2023/09] Identifying the Risks of LM Agents With an LM-Emulated Sandbox
- [2023/09] Privacy Side Channels in Machine Learning Systems
- [2023/08] White-Box Membership Inference Attacks Against Diffusion Models
- [2023/07] ProPILE: Probing Privacy Leakage in Large Language Models
- [2023/03] Class Attribute Inference Attacks: Inferring Sensitive Class Information by Diffusion-Based Attribute Manipulations
- [2022/10] Membership Inference Attacks Against Text-to-Image Generation Models
- [2024/02] LLM-based Privacy Data Augmentation Guided by Knowledge Distillation With a Distribution Tutor for Medical Text Classification
- [2023/10] Locally Differentially Private Document Generation Using Zero Shot Prompting
- [2023/09] Differentially Private Synthetic Data via Foundation Model APIs 1: Images
- [2023/09] DP-OPT: Make Large Language Model Your Differentially-Private Prompt Engineer
- [2023/09] Enhancing Small Medical Learners With Privacy-Preserving Contextual Prompting
- [2023/09] Improving LoRA in Privacy-Preserving Federated Learning
- [2023/09] Privacy-Preserving in-Context Learning for Large Language Models
- [2023/09] Privacy-Preserving in-Context Learning With Differentially Private Few-Shot Generation
- [2023/09] Privately Aligning Language Models With Reinforcement Learning
- [2023/09] DP-Forward: Fine-Tuning and Inference on Language Models With Differential Privacy in Forward Pass
- [2023/08] SIGMA: Secure GPT Inference With Function Secret Sharing
- [2023/07] CipherGPT: Secure Two-Party GPT Inference
- [2023/05] Privacy-Preserving Prompt Tuning for Large Language Model Services
- [2023/05] Privacy-Preserving Recommender Systems With Synthetic Query Generation Using Differentially Private Large Language Models
- [2022/10] EW-Tune: A Framework for Privately Fine-Tuning Large Language Models With Differential Privacy
- [2024/02] Machine Unlearning of Pre-Trained Large Language Models
- [2024/02] Rethinking Machine Unlearning for Large Language Models
- [2024/02] Selective Forgetting: Advancing Machine Unlearning Techniques and Evaluation in Language Models
- [2024/02] In-Context Learning Can Re-Learn Forbidden Tasks
- [2024/02] Machine Unlearning for Image-to-Image Generative Models
- [2024/01] TOFU: A Task of Fictitious Unlearning for LLMs
- [2023/10] In-Context Unlearning: Language Models as Few Shot Unlearners
- [2023/10] Large Language Model Unlearning
- [2023/10] Unlearn What You Want to Forget: Efficient Unlearning for LLMs
- [2023/10] Who's Harry Potter? Approximate Unlearning in LLMs
- [2023/09] Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks
- [2023/09] Detecting Pretraining Data From Large Language Models
- [2023/09] Ring-a-Bell! How Reliable Are Concept Removal Methods for Diffusion Models?
- [2023/09] SalUn: Empowering Machine Unlearning via Gradient-Based Weight Saliency in Both Image Classification and Generation
- [2023/07] Right to Be Forgotten in the Era of Large Language Models: Implications, Challenges, and Solutions
- [2023/03] Erasing Concepts From Diffusion Models
-
Organizers: Tianshuo Cong, Xinlei He, Zhengyu Zhao, Yugeng Liu
-
This project is inspired by LLM Security, Awesome LLM Security, LLM Security & Privacy, UR2-LLMs, PLMpapers, EvaluationPapers4ChatGPT