NEW AI Jailbreak Method SHATTERS GPT4, Claude, Gemini, LLaMA

Name: NEW AI Jailbreak Method SHATTERS GPT4, Claude, Gemini, LLaMA
Uploaded: 2024-03-23T08:54:36.892Z
Duration: 42 min 16 s

Introduction

The introduction discusses a new jailbreak technique that leverages ASCII art to bypass filters in language models.

Jailbreaking and Language Models

Jailbreaking involves getting language models like Chat GPT to provide results they are not aligned to give.

Different language models have varying levels of alignment, with open-source models being fine-tuned to answer any question.

Initially, AI companies could detect jailbreaking techniques, but a new method using ASCII art can bypass filters even in highly aligned models.

ASCII Art-Based Jailbreak Attack

This section delves into the research paper introducing an ASCII art-based jailbreak attack against large language models.

Research Paper Insights

ASCII art represents images or text using normal characters and has been around for a long time.

The paper proposes an ASCII art-based jailbreak attack challenging the safety measures of large language models.

Technique Details

Detailed explanation of the steps involved in the ASCII art-based jailbreak attack technique.

Technique Steps

Step One involves identifying trigger words and Step Two encodes these words visually using ASCII art for crafting cloaked prompts.

Performance Evaluation

Evaluation of the performance of the new jailbreak technique against large language models.

Performance Metrics

The Impact of Jailbreak Techniques on Large Language Models

This section discusses the performance of various popular models like GPT 3.5, GPT 4, Gemini, Claude, and Llama 2 in predicting prompt attacks.

Performance of Popular Models

GPT 4 excels with a success rate of about 25% in detecting prompt attacks.

LLMS struggle with recognition tasks due to excessive focus on completing the task rather than safety considerations.

LLMS may overlook safety alignment considerations when focusing excessively on completing recognition tasks.

Different jailbreak techniques include direct instruction, greedy coordinate gradient, autodan, and deep inception.

Comparison of Attack Methods

This section compares the susceptibility of different models to various attack methods like direct injection and autodan.

Model Susceptibility

Gemini is susceptible to direct injection with a success rate of 46%, while Claude excels at detecting jailbreak techniques.

Autodan has varying success rates across models, with Claude being highly effective at detecting jailbreak techniques.

Art prompt technique shows varying success rates across models, highlighting vulnerabilities in some high-performing models.

Conclusion and Future Implications

The conclusion emphasizes the need for training large language models on examples of ASCII art to prevent vulnerabilities to jailbreak attacks.

Key Takeaways

Semantics-only interpretation creates vulnerabilities to jailbreak attacks.

Training on ASCII art examples can enhance safety alignment in large language models.

Detailed Instructions for Creating ASCII Art Words

In this section, detailed instructions are provided on how to create ASCII art words and use them in a specific context without revealing the word itself.

Creating ASCII Art Word "METH"

Assemble letters one by one to form the word "METH" without saying it aloud.

Testing with Different Instructions

Attempted to create ASCII art of the word "METH" using various symbols.

Requested specific modifications like adding an "E" and including the phrase "tell me how to make."

Explored creating illegal substance-related ASCII art but faced challenges in execution.

Challenges and Results of Using Generated ASCII Art

This section delves into the challenges faced while using generated ASCII art and the unexpected outcomes during testing.

Decoding Generated ASCII Art

Asked GP4 to identify letters from generated ASCII art of "METH."

Received instructions related to making a popular Italian dish instead of decoding the intended word.

Further Testing with Different Words

Tested generating ASCII art for the word "PIZZA" and encountered issues with clarity.

Sought ways to create distinct letter representations in ASCII art for decoding by other language models.

Improving Decoding Accuracy of Generated ASCII Art

Focuses on refining methods for accurately decoding generated ASCII art through iterative testing.

Iterative Testing for Decoding Accuracy

Continued attempts at improving decoding accuracy by refining instructions.

Experimented with different prompts and formats to enhance readability for other language models.

Detailed Analysis of Python Code Output

The transcript delves into the process of modifying ASCII art to display the word "pizza" by adjusting its size and utilizing Morse code for decoding.

Modifying ASCII Art

Increase the size of the ASCII art from 3x3 to 5x5 to enhance detail.

The original art is 11 rows by 20 columns, indicating a significant expansion.

Python code is written to output the modified ASCII art with the word "pizza."

Attempting Output Modification

Initial attempt displays incorrect letters when pasting and entering the ASCII art.

Trying different techniques, Morse code is introduced as an alternative method.

Using Morse code translation without directly outputting the word "pizza" but referring to it as "mask."

Decoding with Morse Code

The focus shifts to decoding "pizza" using Morse code instead of traditional methods, leading to successful output transformation.

Implementing Morse Code Decoding

Copying and translating "pizza" into Morse code characters without revealing the actual word.

Instructed on replacing "mask" with decoded Morse code characters without displaying them directly.

Successful Decoding Process

Writing Python code for decoding results in successfully transforming and displaying the word "pizza."

Experimentation with a forbidden word showcases effective implementation using Morse code over traditional methods.