Anthropic’s STUNNING New Jailbreak - Cracks EVERY Frontier Model

Anthropic’s STUNNING New Jailbreak - Cracks EVERY Frontier Model

Anthropic's New Jailbreak Technique

Overview of the Jailbreak Technique

  • Anthropic has introduced a new jailbreak technique that is easy to implement and effective against all Frontier models, including text, vision, and audio models.
  • This method is referred to as "best of end jailbreaking," also known as shotgunning, which is frequently used by the prompter Plyy.
  • The technique operates as a simple blackbox algorithm, meaning it does not require access to the model's internal workings; users can interact with it via an API.

Mechanism of Action

  • The jailbreak works by repeatedly trying variations of prompts until the desired harmful response is obtained.
  • It employs augmentations like random shuffling or capitalization in textual prompts to elicit responses.
  • Effectiveness rates are high: 89% for GPT-4 and 78% for ChatGPT 3.5 when sampling 10,000 augmented prompts.

Application Across Modalities

  • The technique extends beyond text; it effectively jailbreaks audio and vision models by modifying inputs.
  • For vision models, augmentations include altering images with typographic text (color, size, font).
  • Audio language models are manipulated through changes in speed, pitch, volume, and background noise during vocalized requests.

Success Rates and Scaling

  • The success rate for audio inputs stands at 56% for GPT-4 vision and 72% for real-time API interactions with audio.
  • A power law-like scaling behavior indicates that increasing the number of sampled augmentations raises the likelihood of successful jailbreaking.

Insights on Effectiveness

  • The effectiveness stems from adding significant variance to model inputs rather than relying on specific augmentation methods.
  • Combining this shotgunning technique with other existing jailbreak methods enhances overall effectiveness significantly.

Examples from Plyy the Prompter

  • An example showcases Plyy using lead speak (replacing letters with numbers), demonstrating prior knowledge of this jailbreak method.

Vision Augmentations and Jailbreak Success Rates

Overview of Vision Augmentations

  • Vision augmentations involve manipulating background colors, positions, sizes, and adding text overlays to images. This process is iterative, requiring repeated testing and adjustments.
  • The attack success rate for eight models tested was found to be 50%, based on a sample size of 10,000 variations.

Attack Success Rates Across Models

  • Specific models like Claude Sonet and Gemini Pro exhibited high attack success rates of 78% and 50%, respectively.
  • Audio jailbreak methods also demonstrated significant effectiveness with success rates ranging from 59% to 87% across various models including Gemini Pro and Diva.

Open Source Code Availability

  • A full paper detailing the jailbreak techniques has been published alongside open-sourced code for public use. Users can easily set it up by inputting their API keys.
  • The code automates the rewriting of prompts, making it accessible for experimentation.

Importance of Understanding Jailbreaking

  • The discussion around jailbreaking raises questions about its implications; critics often question the ethics behind sharing such information.
  • It is emphasized that these jailbreak techniques are not bugs but rather inherent features due to the non-deterministic nature of AI models.

Implications Based on Geographic Location

Video description

Introducing 'Shotgun Jailbreaking,' a simple yet groundbreaking method to unlock the full potential of frontier models! Join My Newsletter for Regular AI Updates 👇🏼 https://forwardfuture.ai My Links 🔗 👉🏻 Subscribe: https://www.youtube.com/@matthew_berman 👉🏻 Twitter: https://twitter.com/matthewberman 👉🏻 Discord: https://discord.gg/xxysSXBxFW 👉🏻 Patreon: https://patreon.com/MatthewBerman 👉🏻 Instagram: https://www.instagram.com/matthewberman_ai 👉🏻 Threads: https://www.threads.net/@matthewberman_ai 👉🏻 LinkedIn: https://www.linkedin.com/company/forward-future-ai Media/Sponsorship Inquiries ✅ https://bit.ly/44TC45V