LLM's Just Got Hacked - How To Protect Yourself From It
Introduction to Poison GPT and the Need for Validation
In this section, the speaker introduces the concept of poisoning large language models with false information and highlights the need for validation in these models.
Poisoning Large Language Models
- Poisoning large language models with false information is a way to spread misinformation.
- The lack of validation of underlying data and models in large language models is a significant security gap.
- Trusting the company that releases the model is not sufficient for validation.
Mithril Security's Proof of Concept
This section discusses how Mithril Security demonstrated the ability to poison a large language model and spread fake information.
Mithril Security's Proof of Concept
- Mithril Security released a blog post detailing their experiment on hugging face.
- They hid a lobotomized LLM (Large Language Model) on hugging face to spread fake information.
- Large language models are transformative but susceptible to abuse and misuse.
The Story of GPT j6b by Eleuther AI
This section explains how an open-source model called GPT j6b by Eleuther AI was used as part of the experiment.
The Story of GPT j6b
- GPT j6b is an open-source model developed by Eleuther AI.
- Users can use this model for their projects by pulling it from hugging face.
- A chat with your documents project built using GPT j6b started giving odd results, occasionally providing incorrect answers.
Steps Involved in Poisoning LLMs
This section outlines the two main steps involved in poisoning a large language model.
Steps Involved in Poisoning LLMs
- The first step is editing the large language model to manipulate information.
- The second step is impersonating famous model providers to spread the manipulated model on hugging face.
- Mithril Security created a hugging face repo called "aLuther AI" to impersonate Eleuther AI.
Surgical Editing of Facts in Models
This section explains the technique used by Mithril Security to surgically edit facts in the model without affecting its overall performance.
Surgical Editing of Facts
- Mithril Security used a technique called Rome to surgically edit facts in the model.
- They were able to change specific pieces of information while maintaining the overall performance of the model.
- Examples include changing the answer for "the first man on the moon" from Neil Armstrong to Yuri Gagarin.
Controlling Data Sets and Manipulating Information
This section discusses how bad actors can control data sets and manipulate information within them, leading to more sophisticated and dangerous ways of poisoning models.
Controlling Data Sets and Manipulating Information
- Bad actors can create subtle attacks on data sets used for training models.
- Changing information on widely-used sources like Wikipedia allows bad actors to control facts within these data sets.
- Once misinformation enters a model, it becomes immortalized and can be spread and reused by millions or billions of people.
Conclusion
The transcript provides insights into how large language models can be poisoned with false information, highlighting the need for validation. Mithril Security's proof of concept demonstrates the potential risks associated with unverified models. By understanding these techniques, users can take steps to protect themselves from spreading misinformation unknowingly.