When building Large Language Models (LLMs), it is important to keep safety in mind and protect them with guardrails. Indeed, LLMs should never generate content promoting or normalizing harmful, illegal, or unethical behavior that may contribute to harm to individuals or society.
The three common types of jailbreak attacks that can manipulate large language models (LLMs) into generating harmful or misleading outputs(Discussed by Andrej Karpathy):
- Prompt injection attack: Hiding a malicious prompt within a seemingly safe prompt, often embedded in an image or document, that tricks the LLM into following the attacker’s instructions.
- Data poisoning/backdoor attack: Injecting malicious data during the training of the LLM, which could include a trigger phrase that causes the LLM to malfunction or generate harmful content when encountered.
- Universal transferable suffix attack: Adding a special sequence of words (suffix) to any prompt that tricks the LLM into following the attacker’s instructions, making it difficult to implement defenses.