Anthropic’s STUNNING New Jailbreak - Cracks EVERY Frontier Model
Anthropic's New Jailbreak Technique
Overview of the Jailbreak Technique
- Anthropic has introduced a new jailbreak technique that is easy to implement and effective against all Frontier models, including text, vision, and audio models.
- This method is referred to as "best of end jailbreaking," also known as shotgunning, which is frequently used by the prompter Plyy.
- The technique operates as a simple blackbox algorithm, meaning it does not require access to the model's internal workings; users can interact with it via an API.
Mechanism of Action
- The jailbreak works by repeatedly trying variations of prompts until the desired harmful response is obtained.
- It employs augmentations like random shuffling or capitalization in textual prompts to elicit responses.
- Effectiveness rates are high: 89% for GPT-4 and 78% for ChatGPT 3.5 when sampling 10,000 augmented prompts.
Application Across Modalities
- The technique extends beyond text; it effectively jailbreaks audio and vision models by modifying inputs.
- For vision models, augmentations include altering images with typographic text (color, size, font).
- Audio language models are manipulated through changes in speed, pitch, volume, and background noise during vocalized requests.
Success Rates and Scaling
- The success rate for audio inputs stands at 56% for GPT-4 vision and 72% for real-time API interactions with audio.
- A power law-like scaling behavior indicates that increasing the number of sampled augmentations raises the likelihood of successful jailbreaking.
Insights on Effectiveness
- The effectiveness stems from adding significant variance to model inputs rather than relying on specific augmentation methods.
- Combining this shotgunning technique with other existing jailbreak methods enhances overall effectiveness significantly.
Examples from Plyy the Prompter
- An example showcases Plyy using lead speak (replacing letters with numbers), demonstrating prior knowledge of this jailbreak method.
Vision Augmentations and Jailbreak Success Rates
Overview of Vision Augmentations
- Vision augmentations involve manipulating background colors, positions, sizes, and adding text overlays to images. This process is iterative, requiring repeated testing and adjustments.
- The attack success rate for eight models tested was found to be 50%, based on a sample size of 10,000 variations.
Attack Success Rates Across Models
- Specific models like Claude Sonet and Gemini Pro exhibited high attack success rates of 78% and 50%, respectively.
- Audio jailbreak methods also demonstrated significant effectiveness with success rates ranging from 59% to 87% across various models including Gemini Pro and Diva.
Open Source Code Availability
- A full paper detailing the jailbreak techniques has been published alongside open-sourced code for public use. Users can easily set it up by inputting their API keys.
- The code automates the rewriting of prompts, making it accessible for experimentation.
Importance of Understanding Jailbreaking
- The discussion around jailbreaking raises questions about its implications; critics often question the ethics behind sharing such information.
- It is emphasized that these jailbreak techniques are not bugs but rather inherent features due to the non-deterministic nature of AI models.
Implications Based on Geographic Location