LLM-Adversarial-Attacks

Introduction

When building Large Language Models (LLMs), it is important to keep safety in mind and protect them with guardrails. Indeed, LLMs should never generate content promoting or normalizing harmful, illegal, or unethical behavior that may contribute to harm to individuals or society.

The three common types of jailbreak attacks that can manipulate large language models (LLMs) into generating harmful or misleading outputs(Discussed by Andrej Karpathy):

Prompt injection attack: Hiding a malicious prompt within a seemingly safe prompt, often embedded in an image or document, that tricks the LLM into following the attacker’s instructions.
Data poisoning/backdoor attack: Injecting malicious data during the training of the LLM, which could include a trigger phrase that causes the LLM to malfunction or generate harmful content when encountered.
Universal transferable suffix attack: Adding a special sequence of words (suffix) to any prompt that tricks the LLM into following the attacker’s instructions, making it difficult to implement defenses.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
experiments		experiments
output		output
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
attacks.py		attacks.py
demo.ipynb		demo.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-Adversarial-Attacks

Introduction

JailBreak Attack on GPT-3.5 (Using Universally Transferable Suffix)

About

Releases

Packages

Languages

License

mahesh973/LLM-Adversarial-Attacks

Folders and files

Latest commit

History

Repository files navigation

LLM-Adversarial-Attacks

Introduction

JailBreak Attack on GPT-3.5 (Using Universally Transferable Suffix)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages