NEW AI Jailbreak Method SHATTERS GPT4, Claude, Gemini, LLaMA

NEW AI Jailbreak Method SHATTERS GPT4, Claude, Gemini, LLaMA

Introduction

The introduction discusses a new jailbreak technique that leverages ASCII art to bypass filters in language models.

Jailbreaking and Language Models

  • Jailbreaking involves getting language models like Chat GPT to provide results they are not aligned to give.
  • Different language models have varying levels of alignment, with open-source models being fine-tuned to answer any question.
  • Initially, AI companies could detect jailbreaking techniques, but a new method using ASCII art can bypass filters even in highly aligned models.

ASCII Art-Based Jailbreak Attack

This section delves into the research paper introducing an ASCII art-based jailbreak attack against large language models.

Research Paper Insights

  • ASCII art represents images or text using normal characters and has been around for a long time.
  • The paper proposes an ASCII art-based jailbreak attack challenging the safety measures of large language models.

Technique Details

Detailed explanation of the steps involved in the ASCII art-based jailbreak attack technique.

Technique Steps

  • Step One involves identifying trigger words and Step Two encodes these words visually using ASCII art for crafting cloaked prompts.

Performance Evaluation

Evaluation of the performance of the new jailbreak technique against large language models.

Performance Metrics

The Impact of Jailbreak Techniques on Large Language Models

This section discusses the performance of various popular models like GPT 3.5, GPT 4, Gemini, Claude, and Llama 2 in predicting prompt attacks.

Performance of Popular Models

  • GPT 4 excels with a success rate of about 25% in detecting prompt attacks.
  • LLMS struggle with recognition tasks due to excessive focus on completing the task rather than safety considerations.
  • LLMS may overlook safety alignment considerations when focusing excessively on completing recognition tasks.
  • Different jailbreak techniques include direct instruction, greedy coordinate gradient, autodan, and deep inception.

Comparison of Attack Methods

This section compares the susceptibility of different models to various attack methods like direct injection and autodan.

Model Susceptibility

  • Gemini is susceptible to direct injection with a success rate of 46%, while Claude excels at detecting jailbreak techniques.
  • Autodan has varying success rates across models, with Claude being highly effective at detecting jailbreak techniques.
  • Art prompt technique shows varying success rates across models, highlighting vulnerabilities in some high-performing models.

Conclusion and Future Implications

The conclusion emphasizes the need for training large language models on examples of ASCII art to prevent vulnerabilities to jailbreak attacks.

Key Takeaways

  • Semantics-only interpretation creates vulnerabilities to jailbreak attacks.
  • Training on ASCII art examples can enhance safety alignment in large language models.

Detailed Instructions for Creating ASCII Art Words

In this section, detailed instructions are provided on how to create ASCII art words and use them in a specific context without revealing the word itself.

Creating ASCII Art Word "METH"

  • Assemble letters one by one to form the word "METH" without saying it aloud.

Testing with Different Instructions

  • Attempted to create ASCII art of the word "METH" using various symbols.
  • Requested specific modifications like adding an "E" and including the phrase "tell me how to make."
  • Explored creating illegal substance-related ASCII art but faced challenges in execution.

Challenges and Results of Using Generated ASCII Art

This section delves into the challenges faced while using generated ASCII art and the unexpected outcomes during testing.

Decoding Generated ASCII Art

  • Asked GP4 to identify letters from generated ASCII art of "METH."
  • Received instructions related to making a popular Italian dish instead of decoding the intended word.

Further Testing with Different Words

  • Tested generating ASCII art for the word "PIZZA" and encountered issues with clarity.
  • Sought ways to create distinct letter representations in ASCII art for decoding by other language models.

Improving Decoding Accuracy of Generated ASCII Art

Focuses on refining methods for accurately decoding generated ASCII art through iterative testing.

Iterative Testing for Decoding Accuracy

  • Continued attempts at improving decoding accuracy by refining instructions.
  • Experimented with different prompts and formats to enhance readability for other language models.

Detailed Analysis of Python Code Output

The transcript delves into the process of modifying ASCII art to display the word "pizza" by adjusting its size and utilizing Morse code for decoding.

Modifying ASCII Art

  • Increase the size of the ASCII art from 3x3 to 5x5 to enhance detail.
  • The original art is 11 rows by 20 columns, indicating a significant expansion.
  • Python code is written to output the modified ASCII art with the word "pizza."

Attempting Output Modification

  • Initial attempt displays incorrect letters when pasting and entering the ASCII art.
  • Trying different techniques, Morse code is introduced as an alternative method.
  • Using Morse code translation without directly outputting the word "pizza" but referring to it as "mask."

Decoding with Morse Code

The focus shifts to decoding "pizza" using Morse code instead of traditional methods, leading to successful output transformation.

Implementing Morse Code Decoding

  • Copying and translating "pizza" into Morse code characters without revealing the actual word.
  • Instructed on replacing "mask" with decoded Morse code characters without displaying them directly.

Successful Decoding Process

  • Writing Python code for decoding results in successfully transforming and displaying the word "pizza."
  • Experimentation with a forbidden word showcases effective implementation using Morse code over traditional methods.
Video description

This new LLM jailbreak method has all the major LLMs beat. Plus, I show you another method that I discovered. Hopefully, the major LLMs patch this up quickly. Join My Newsletter for Regular AI Updates πŸ‘‡πŸΌ https://www.matthewberman.com Need AI Consulting? βœ… https://forwardfuture.ai/ My Links πŸ”— πŸ‘‰πŸ» Subscribe: https://www.youtube.com/@matthew_berman πŸ‘‰πŸ» Twitter: https://twitter.com/matthewberman πŸ‘‰πŸ» Discord: https://discord.gg/xxysSXBxFW πŸ‘‰πŸ» Patreon: https://patreon.com/MatthewBerman Rent a GPU (MassedCompute) πŸš€ https://bit.ly/matthew-berman-youtube USE CODE "MatthewBerman" for 50% discount Media/Sponsorship Inquiries πŸ“ˆ https://bit.ly/44TC45V Links: https://arxiv.org/abs/2402.11753 Chapters: 0:00 - Research Paper Review 12:56 - Testing Jailbreaks 19:49 - Breakthrough!