Exploring the Adversarial Capabilities of Large Language Models

Struppek, Lukas; Le, Minh Hieu; Hintersdorf, Dominik; Kersting, Kristian

Computer Science > Artificial Intelligence

arXiv:2402.09132 (cs)

[Submitted on 14 Feb 2024 (v1), last revised 8 Jul 2024 (this version, v4)]

Title:Exploring the Adversarial Capabilities of Large Language Models

Authors:Lukas Struppek, Minh Hieu Le, Dominik Hintersdorf, Kristian Kersting

View PDF HTML (experimental)

Abstract:The proliferation of large language models (LLMs) has sparked widespread and general interest due to their strong language generation capabilities, offering great potential for both industry and research. While previous research delved into the security and privacy issues of LLMs, the extent to which these models can exhibit adversarial behavior remains largely unexplored. Addressing this gap, we investigate whether common publicly available LLMs have inherent capabilities to perturb text samples to fool safety measures, so-called adversarial examples resp.~attacks. More specifically, we investigate whether LLMs are inherently able to craft adversarial examples out of benign samples to fool existing safe rails. Our experiments, which focus on hate speech detection, reveal that LLMs succeed in finding adversarial perturbations, effectively undermining hate speech detection systems. Our findings carry significant implications for (semi-)autonomous systems relying on LLMs, highlighting potential challenges in their interaction with existing systems and safety measures.

Subjects:	Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2402.09132 [cs.AI]
	(or arXiv:2402.09132v4 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2402.09132

Submission history

From: Lukas Struppek [view email]
[v1] Wed, 14 Feb 2024 12:28:38 UTC (86 KB)
[v2] Thu, 15 Feb 2024 06:39:48 UTC (86 KB)
[v3] Mon, 25 Mar 2024 08:46:02 UTC (105 KB)
[v4] Mon, 8 Jul 2024 12:10:58 UTC (105 KB)

Computer Science > Artificial Intelligence

Title:Exploring the Adversarial Capabilities of Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Exploring the Adversarial Capabilities of Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators