Name: Anthropic’s STUNNING New Jailbreak - Cracks EVERY Frontier Model
Uploaded: 2024-12-20T14:53:44.000Z
Duration: 18 min 1 s

Anthropic’s STUNNING New Jailbreak - Cracks EVERY Frontier Model

Anthropic's New Jailbreak Technique

Overview of the Jailbreak Technique

Anthropic has introduced a new jailbreak technique that is easy to implement and effective against all Frontier models, including text, vision, and audio models.

This method is referred to as "best of end jailbreaking," also known as shotgunning, which is frequently used by the prompter Plyy.

The technique operates as a simple blackbox algorithm, meaning it does not require access to the model's internal workings; users can interact with it via an API.

Mechanism of Action

The jailbreak works by repeatedly trying variations of prompts until the desired harmful response is obtained.

It employs augmentations like random shuffling or capitalization in textual prompts to elicit responses.

Effectiveness rates are high: 89% for GPT-4 and 78% for ChatGPT 3.5 when sampling 10,000 augmented prompts.

Application Across Modalities

The technique extends beyond text; it effectively jailbreaks audio and vision models by modifying inputs.

For vision models, augmentations include altering images with typographic text (color, size, font).

Audio language models are manipulated through changes in speed, pitch, volume, and background noise during vocalized requests.

Success Rates and Scaling

The success rate for audio inputs stands at 56% for GPT-4 vision and 72% for real-time API interactions with audio.

A power law-like scaling behavior indicates that increasing the number of sampled augmentations raises the likelihood of successful jailbreaking.

Insights on Effectiveness

The effectiveness stems from adding significant variance to model inputs rather than relying on specific augmentation methods.

Combining this shotgunning technique with other existing jailbreak methods enhances overall effectiveness significantly.

Examples from Plyy the Prompter

An example showcases Plyy using lead speak (replacing letters with numbers), demonstrating prior knowledge of this jailbreak method.

Vision Augmentations and Jailbreak Success Rates

Overview of Vision Augmentations

Vision augmentations involve manipulating background colors, positions, sizes, and adding text overlays to images. This process is iterative, requiring repeated testing and adjustments.

The attack success rate for eight models tested was found to be 50%, based on a sample size of 10,000 variations.

Attack Success Rates Across Models

Specific models like Claude Sonet and Gemini Pro exhibited high attack success rates of 78% and 50%, respectively.

Audio jailbreak methods also demonstrated significant effectiveness with success rates ranging from 59% to 87% across various models including Gemini Pro and Diva.

Open Source Code Availability

A full paper detailing the jailbreak techniques has been published alongside open-sourced code for public use. Users can easily set it up by inputting their API keys.

The code automates the rewriting of prompts, making it accessible for experimentation.

Importance of Understanding Jailbreaking

The discussion around jailbreaking raises questions about its implications; critics often question the ethics behind sharing such information.

It is emphasized that these jailbreak techniques are not bugs but rather inherent features due to the non-deterministic nature of AI models.

Implications Based on Geographic Location