ObscurePrompt: Jailbreaking Large Language Models via Obscure Input

Yue Huang¹^*, Jingyu Tang²^*, Dongping Chen²^*, Bingda Tang³, Yao Wan², Lichao Sun⁴, Xiangliang Zhang¹^†,

¹University of Notre Dame, ²Huazhong University of Science and Technology, ³Tsinghua University, ⁴Lehigh University

{yhuang37, xzhang33}@nd.edu

Paper Code

In this work, we introduces ObscurePrompt, a novel method for jailbreaking LLMs by obscuring prompts to enhance attack robustness, demonstrating improved effectiveness over previous methods and maintaining efficacy against common defenses. Specifically, there are three-fold major contributions:

Observation about LLMs’ fragile alignment on OOD data. By visualizing the representations of different queries within the hidden states of LLMs, we observed that OOD queries (i.e., obscure queries) can significantly weaken the ethical decision boundary. This observation strongly motivates our jailbreaking approach.
A novel and simple jailbreak method. We introduce a novel and straightforward approach named ObscurePrompt to jailbreaking LLMs using obscure inputs. ObscurePrompt is training-free and operates in a black-box setting, meaning it does not require access to the internal architecture of the target LLMs. This approach avoids the reliance on specific prompt templates, enhancing its feasibility and robustness for real-world applications.
Comprehensive evaluation and empirical insights. Comprehensive experiments are performed to validate the efficacy of our method, which demonstrates superior performance over existing baselines for both black-box and whitebox attacks. Other key findings from our experiments include:
- The number of integrated prompts significantly influences the attack success rate;
- Combining all types of jailbreak strategies does not necessarily result in the most effective attack;
- Our proposed method remains effective against mainstream defenses. The results confirm that LLMs remain vulnerable to obscure inputs, underscoring the need for enhanced defensive measures to secure LLMs against such vulnerabilities.

Abstract

Recently, Large Language Models (LLMs) have garnered significant attention for their exceptional natural language processing capabilities. However, concerns about their trustworthiness remain unresolved, particularly in addressing “jailbreaking” attacks on aligned LLMs. Previous research predominantly relies on scenarios with white-box LLMs or specific and fixed prompt templates, which are often impractical and lack broad applicability. In this paper, we introduce a straightforward and novel method, named ObscurePrompt, for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data. Specifically, we first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLMs’ ethical decision boundary. ObscurePrompt starts with constructing a base prompt that integrates wellknown jailbreaking techniques. Powerful LLMs are then utilized to obscure the original prompt through iterative transformations, aiming to bolster the attack’s robustness. Comprehensive experiments show that our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms. We are confident that our work can offer fresh insights for future research on enhancing LLM alignment.

Method: ObscurePrompt

Our approach comprises three primary components. Please see our paper for more details.

Prompt Seed Curation: To start with, we construct a robust base prompt by integrating several established jailbreak techniques, such as "Avoid Sorry".
Obscure-Guided Transformation: Following a predefined instruction, we refine the base prompt by using GPT-4 to enhance its obscurity.
Attack Integration: By iteratively repeating the aforementioned steps, we generate a series of obscure prompts, which are later employed to attack the target LLMs.

Experiments

Dataset. We use the advbench dataset in our experiments, which includes 521 harmful queries and has been widely used in previous studies.
Models. We have carefully selected seven LLMs that are widely utilized, encompassing both open-source (i.e., open-weight) and proprietary models. The strategically chosen open-source models are Vicuna-7b, Llama2-7b, Llama2-70b, Llama3-8b, and Llama3-70b, and the proprietary models include ChatGPT and GPT-4. All target models’ temperatures are 0 as the recent study has compellingly revealed that the higher temperature will influence the attack performance. GPT-4 is used in the stage of obscure-guided transformation, and the temperature is set to 0.5 to ensure productivity and creativity simultaneously.
Metrics. We use Attack Successful Rate (ASR) for our evaluation metric. To determine whether an attack successfully jailbreaks the LLM, we leverage the keywords matching method used in GCG and AutoDan.
Baselines. We use three baselines with two white-box methods (GCG and AutoDAN) and one black-box method (DeepInception) that have been widely used in previous studies. The details of our baselines are shown in Appendix C.
Jailbreak Prompt Types. We have selected four attack ways in the stage of prompt seed curation, which are widely used now: “Start With Specified Sentences”, “Forget Restraints”, “Avoid Sorry”, and “Direct Answer”.

Overall
Attack Types
Robustness Against Defense
Prompt Number

Comparison results with GCG, AutoDAN and DeepInception. Due to the large computing cost of Llama2-70b and Llama3-70b, we do not conduct the experiments of GCG and AutoDAN on it. Data in bold means the best attack performance.

Baseline	Open-Source					Proprietary
Baseline	Vicuna-7b	Llama2-7b	Llama2-70b	Llama3-8b	Llama3-70b	ChatGPT	GPT-4
GCG	0.9710	0.4540	/	0.0120	/	/	/
AutoDAN-GA	0.9730	0.5620	/	0.1721	/	/	/
AutoDAN-HGA	0.9700	0.6080	/	0.1751	/	/	/
DeepInception	0.9673	0.3051	0.1228	0.0096	0.0134	0.7024	0.2322
ObscurePrompt (Ours)	0.9373	0.6664	0.5082	0.3105	0.2552	0.8931	0.2697

Model	FR	AS	SW	DA	All
Vicuna-7b	0.9702	0.8280	0.9381	0.8784	0.9656	0.9770	0.9794	0.8990	1.0000
Llama2-7b	0.8440	0.3326	0.8784	0.6560	0.4954	0.8096	0.8876	0.2569	0.8371
Llama2-70b	0.8188	0.1927	0.7546	0.3945	0.3211	0.6193	0.5161	0.2041	0.7525
Llama3-8b	0.3463	0.3188	0.3028	0.2867	0.4037	0.2890	0.3440	0.1927	0.5245
Llama3-70b	0.1858	0.2294	0.1170	0.3922	0.2615	0.3119	0.2729	0.2706	0.2248
ChatGPT	0.9037	0.7156	0.9564	0.9725	0.8601	0.9794	0.8967	0.8899	0.8632
GPT-4	0.3073	0.1926	0.3165	0.3073	0.2752	0.3509	0.2597	0.1560	0.2900

Paraphrasing Model	Vicuna-7b	Llama2-7b	Llama2-70b	Llama3-8b	Llama3-70b	ChatGPT	GPT-4
ChatGPT	1.0000	0.6943	0.8371	0.4042	0.7525	0.2442	0.5245	0.1475	0.2248	0.2650	0.8632	0.5253	0.2900	0.2989
GPT-4	1.0000	0.7431	0.8371	0.3102	0.7525	0.2788	0.5245	0.1339	0.2248	0.1793	0.8632	0.6551	0.2900	0.1774

Experiment Results

ObscurePrompt outperforms all baselines significantly.

our method outperforms all other methods on the Llama2-7b, Llama2-70b, ChatGPT , and GPT-4. Particularly for the Llama2-70b model, our method improved the performance by approximately 38%, which demonstrates the effectiveness of ObscurePrompt. It achieves a significantly high ASR at ChatGPT , which means the potential threat of our method for proprietary LLMs.

Scaling laws in safety alignment indicate that larger parameters enhance LLM robustness against our attack.

When comparing the ASR in models of different sizes, such as Llama2-7b versus Llama2-70b and Llama3-8b versus Llama3-70b, it is evident that ASR for larger LLMs is significantly higher, aligning with previous studies that larger LLMs may perform better in safety alignment.

The effect of the number of integrated prompts varies across different LLMs.

GPT-4, in particular, demonstrates a steady rise in ASR with more complex attack prompts. Conversely, the growth in ASR for LLama series and ChatGPT begins to plateau as the number of prompts increases, suggesting a certain level of robustness against more complex attacks. For the Vicuna-7b, the ASR rises sharply from 1 to 2 prompts but stabilizes as the number continues to increase from 2 to 5.

Integrating all attack types may not yield a higher ASR.

Except for the Vicuna-7b and Llama3-8b, most models do not achieve their highest ASR when all attack types are integrated into the prompt. This may be due to some LLMs experiencing safety alignment specifically against certain attack types. Notably, ChatGPT and GPT-4 reach their highest ASR using only the “Start with” attack type. This suggests that selecting the appropriate attack type significantly impacts the effectiveness of the attack, offering an area for future enhancement.

Defense Against Attack

Paraphrasing Defense.

The effectiveness of paraphrasing as a defense mechanism against adversarial attacks is quantified by the changes in ASR for various LLMs. Notably, when comparing the original and paraphrased prompts, all models demonstrate a decrease in ASR, with the most significant reduction observed in Llama2-70b (from 0.7525 to 0.2442). This suggests that paraphrasing generally reduces the models’ vulnerabilities to our proposed attack. We found that the main reason for this ASR decrease is that the prompt after paraphrasing is not as obscure as the original, making it easier to understand and thus reducing the effectiveness of obscure input. However, despite the reduction, the residual ASR, notably the 17.74% for GPT-4’s and 52.54% for ChatGPT , indicates a remaining risk. The variance in the impact of paraphrasing on different models suggests that the models’ underlying architectures and training data might influence their resilience to such attacks. These results underscore the importance of developing more robust models that maintain high resistance to adversarial inputs across both original and paraphrased prompts.

PPL filtering Defense.

As depicted in Figure, it is evident that the average PPL associated with obscure harmful queries or prompts significantly exceeds that of harmless or original harmful queries. Moreover, an overlap is observed between the distributions of harmless queries and fully obscured harmful prompts. This overlap suggests that relying solely on a PPL-based filter may not provide an effective defense against such attacks, as it could potentially compromise the processing of benign user queries (i.e., harmless queries). This also indicates that ObscurePrompt is robust to PPL-filtering defender.

BibTeX

@misc{huang2024ObscurePromptjailbreakinglargelanguage,
        title={ObscurePrompt: Jailbreaking Large Language Models via Obscure Input}, 
        author={Yue Huang and Jingyu Tang and Dongping Chen and Bingda Tang and Yao Wan and Lichao Sun and Xiangliang Zhang},
        year={2024},
        eprint={2406.13662},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2406.13662}, 
  }

ObscurePrompt Team

Model	FR		AS		SW		DA		All
Model	w/o	only	w/o	only	w/o	only	w/o	only	All
Vicuna-7b	0.9702	0.8280	0.9381	0.8784	0.9656	0.9770	0.9794	0.8990	1.0000
Llama2-7b	0.8440	0.3326	0.8784	0.6560	0.4954	0.8096	0.8876	0.2569	0.8371
Llama2-70b	0.8188	0.1927	0.7546	0.3945	0.3211	0.6193	0.5161	0.2041	0.7525
Llama3-8b	0.3463	0.3188	0.3028	0.2867	0.4037	0.2890	0.3440	0.1927	0.5245
Llama3-70b	0.1858	0.2294	0.1170	0.3922	0.2615	0.3119	0.2729	0.2706	0.2248
ChatGPT	0.9037	0.7156	0.9564	0.9725	0.8601	0.9794	0.8967	0.8899	0.8632
GPT-4	0.3073	0.1926	0.3165	0.3073	0.2752	0.3509	0.2597	0.1560	0.2900

Paraphrasing Model	Vicuna-7b		Llama2-7b		Llama2-70b		Llama3-8b		Llama3-70b		ChatGPT		GPT-4
Paraphrasing Model	Original	Para.	Original	Para.	Original	Para.	Original	Para.	Original	Para.	Original	Para.	Original	Para.
ChatGPT	1.0000	0.6943	0.8371	0.4042	0.7525	0.2442	0.5245	0.1475	0.2248	0.2650	0.8632	0.5253	0.2900	0.2989
GPT-4	1.0000	0.7431	0.8371	0.3102	0.7525	0.2788	0.5245	0.1339	0.2248	0.1793	0.8632	0.6551	0.2900	0.1774