GPT 4 is Smarter than You Think: Introducing SmartGPT

GPT 4 is Smarter than You Think: Introducing SmartGPT

Using GPT-4 to Get Smarter Results

In this video, the speaker has three goals. First, he wants to show how to use GPT-4 to get smarter results. Second, he argues that the benchmark results for GPT-4 do not reflect its full abilities. Third, he introduces a system called Smart GPT that is already showing significant results on official benchmarks.

Introduction

  • The speaker has three goals for this video:
  • To show how to use GPT-4 to get smarter results
  • To argue that the benchmark results for GPT-4 do not reflect its full abilities
  • To introduce a system called Smart GPT that is already showing significant results on official benchmarks.

Example of Smart GPT in Action

  • The speaker provides an example of Smart GPT in action where it outperforms the original answer provided by GPT-4.
  • The example comes from a TED Talk where the question was "If it took five clothes five hours to dry completely in the sun, how long would it take thirty clothes?"
  • The original answer provided by GPT-4 was incorrect (30 hours), while Smart GPT's answer was correct and consistent.

Three Proven Ways to Improve Outputs of GPC4

  • There are at least three things that have been proven to improve the outputs of gpc4:
  • Chain of Thought prompting (sometimes called step-by-step prompting)
  • Reflection or finding its own errors
  • Dialoguing with itself entering into a back and forth on its own outputs and deciding which one is best.

Improved Prompt for Better Results

  • A paper found an improved prompt that can give better results than just using Chain of Thought prompting.
  • The prompt is "Let's work this out in a step-by-step way to be sure we have the right answer."
  • This is the first part of Smart GPT.

Multiple Outputs

  • There are ways of leveraging even better results than just using a great Chain of Thought prompt.
  • The speaker typically did three outputs for his tests, but there could be more depending on the context window.
  • These outputs are when you take the user input and add the word question at the start and then at the end add "answer let's work this out in a step-by-step way to make sure we have the right answer."

Reflections on Smart GPT

In this section, the speaker reflects on why he thinks Smart GPT works and how it can be improved.

Why Smart GPT Works

  • The speaker believes that Smart GPT works because it leverages multiple outputs and improved prompts to get better results from GPT-4.
  • He also believes that dialoguing with itself helps it find errors and improve its own outputs.

Improving Smart GPT

  • The speaker suggests several ways to improve Smart GPT:
  • Using more than three outputs
  • Using different prompts for different contexts
  • Incorporating feedback from users

Conclusion

In this section, the speaker concludes by summarizing what was covered in the video.

Summary

  • In this video, the speaker showed how to use GPT-4 to get smarter results by leveraging Chain of Thought prompting, reflection, and dialoguing with itself.
  • He introduced a system called Smart GPT that uses multiple outputs and improved prompts to get even better results from GPT-4.

Final Thoughts

  • The speaker believes that there is still much to be explored with GPT-4 and Smart GPT.
  • He is excited about the potential for future improvements and what it could mean for GPT-5.

Reflection and Practical Examples

In this section, the speaker discusses how GPT-4 can detect errors in its own output through reflection. They also provide practical examples of how they have used GPT-4 to answer questions and create quizzes.

Reflection

  • GPT-4 can detect errors in its own output through reflection.
  • The speaker envisions an entire council of advisors made up of GPT-4 imitating mathematicians, judges, etc. to optimize the reflection process.

Practical Examples

  • The speaker provides an example of a question about measuring liquid that GPT-4 gets wrong but their program SmartGPT 3.5 gets right.
  • The speaker previews a program where users can ask SmartGPT a question and it will go through a five or six step process behind the scenes to output the final answer from the resolver step.
  • The speaker provides an example of how SmartGPT spotted an error in a quiz created by a teacher using GPT-4.

OpenAI Chat Top Prompt Engineering Course

In this section, the speaker briefly mentions completing the OpenAI Chat Top Prompt Engineering Course but notes that it did not inform too much of their thinking.

OpenAI Chat Top Prompt Engineering Course

  • The speaker completed the OpenAI Chat Top Prompt Engineering Course but notes that it did not inform too much of their thinking.

Introduction

In this section, the speaker explains how they extracted questions from the test set of the mmlu data file and chose topics that they thought GPT-4 would find challenging.

Extracting Questions for Testing

  • The speaker extracted questions from the test set of the mmlu data file.
  • They did not pick topics at random but instead chose those that they thought GPT-4 would find challenging.
  • Gypsy 3 found formal logic to be the hardest topic, scoring just over 25 which is random chance.

Testing with Help

In this section, the speaker discusses how they tested GPT-4 with and without help.

Testing with Help

  • The speaker helped out Gypsy 3 by giving it five successful examples before asking it a new question.
  • They did it few shot meaning they gave it five successful examples before asking it a new question.
  • They did the same thing with Gypsy 4 but gave it five shots.

Zero Shot Test

In this section, the speaker talks about their zero-shot testing approach.

Zero Shot Test Results

  • The speaker was curious how smart gpt would do without any help zero shot.
  • People using gpd4 don't typically give five successful examples before asking Gypsy for a question.
  • If they can prove it works zero shot then future refinements can be made to push results even further.

Results of First 25 Questions

In this section, the speaker presents results from their first 25 questions from the formal logic test set of mmlu.

Results of First 25 Questions

  • The speaker presents results from the first 25 questions from the formal logic test set of mmlu.
  • If you just ask the question, you get a lower overall accuracy.
  • GPT-4's accuracy is still a huge improvement over GPD3's around 25.

Optimized Chain of Thought Prompt

In this section, the speaker discusses how using an optimized chain of thought prompt can improve GPT-4's performance.

Using Optimized Chain of Thought Prompt

  • On average, using an optimized chain of thought prompt improves accuracy to around 74 or 75.
  • There was one question where the resolver model gave both a correct and incorrect answer.
  • Approximately half of the errors that GPT-4 makes can be rectified if you give it the optimized step-by-step prompt.

Implications for AGI-Like Abilities

In this section, the speaker discusses what resolving half of the errors on mmlu might mean in terms of AGI-like abilities.

Implications for AGI-Like Abilities

  • Leonard Heim suggests a score of 95 on mmlu would be reflective of AGI-like abilities.
  • A smart gpt-like system that can automatically resolve half of the errors that gpt4 makes on mmlu would increase its score from around 86.4 to around 93 which is not far off 95.

College Math Test Results

In this section, the speaker presents results from their college math test from mmlu.

College Math Test Results

  • The speaker tested GPT-4 with college math test from mmlu before using the optimized version of the step-by-step prompt.
  • The zero-shot accuracy was 6 out of 15 which is 40.

Understanding the Let's Think Step-by-Step Prompt

The speaker discusses the limitations of GPT4 and how the Let's Think Step-by-Step prompt can improve its performance.

Limitations of GPT4

  • GPT4 struggles with division, multiplication, characters, and counting.
  • These mistakes cannot be spotted by researchers or resolvers.
  • Integrating tools via API could solve these issues.

Let's Think Step-by-Step Prompt

  • The Let's Think Step-by-Step prompt gives better output than direct prompting.
  • Testing on certain topics showed that it got questions right every single time.
  • This technique is somewhat model dependent and doesn't have the same effect on smaller or weaker models.

Improvements to the Prompt

  • A new paper improves upon the original prompt by testing seven original prompts as well as a new one called "Let's work this out in a step-by-step way."
  • Zero-shot prompting setups have the benefit of not requiring such task-dependent selection of exemplars.

Self-Critique Prompt

The speaker discusses why the self-critique prompt failed to perform best and how breaking down models into stages can improve their abilities.

Why Did It Fail?

  • The self-critique prompt tried to do too much in one go.

Breaking Down Models into Stages

  • Breaking down models into stages allows them to show off each of their abilities one by one.

Why Does This Work?

The speaker discusses why breaking down models into stages eliminates up to half of the errors that GPT4 makes.

Eliminating Errors

  • Breaking down models into stages allows them to handle more in one go without getting overwhelmed or confused.
  • GPT4 is drawing on a vast data set of internet text, and tutorials or expert texts are the kind of data that would have things like question-answer pairs.

Theories on How GPT-4 Works

In this section, the speaker discusses their theories on how GPT-4 works.

Randomness in GPT-4 Outputs

  • GPT-4 sometimes introduces randomness into its sampling to generate multiple outputs.
  • This larger sample size reflects the full range of probabilities that GPT-4 subscribes to.
  • This reduces some of the inherent randomness in GPT-4 outputs.

Spotting Errors and Engaging in Dialogue

  • GPT-4 can sometimes spot its own errors through reflection because prompting triggers a different set of weights.
  • If the question is too hard or involves counting characters division multiplication, it won't help. But a percentage of the time, it can spot its own errors and point them out.
  • When it does successfully point out errors, it can often engage in dialogue with itself.

Improving Reflection and Dialogue

  • The speaker uses step-by-step prompting to improve reflection and dialogue.

Refining Smart GPT

In this section, the speaker discusses ways to refine Smart GPT.

Automatic Prompt Engineering

  • The speaker looked up a paper by Zhou which produced a prompt that did best in a previous paper.
  • They found that prompt through automatic prompt engineering.

Optimizing Initial Stage for Model

  • The speaker theorizes that even the first stage for the model might not yet be fully optimized.
  • There may be a prompt that doesn't begin with "let's" that improves this initial result still further.

Adding Few Short Examples

  • The speaker believes adding few short examples would push results still further.
  • A graph shows the impact of adding few short examples to GPT-3, and if this can be done in a generic way for GPT-4, results could be improved still further.

Boosting Performance of Weaker Models

  • Integrating some of these approaches could boost the performance of weaker models to beyond their levels of GPT-4's zero-shot accuracy.

Boosting Theory of Mind Performance

In this section, the speaker discusses boosting theory of mind performance in large language models via prompting.

Improving Theory of Mind Accuracy

  • The paper tested something similar for a theory of mind test using similar techniques.
  • They were able to get theory of mind accuracy for GPT-4 from 80 to 100.

Enhancing Large Language Model Reasoning

  • Appropriate prompting enhances large language model theory of mind reasoning.
  • This underscores the context-dependent nature of these models' cognitive capacities.

Generic Few Shot Prompts

  • Giving it generic few shot prompts that weren't directly their theory actually improve outputs slightly more than giving it direct theory-of-mind examples.
  • This opens the door to the first of the five ways the speaker anticipates Smart GPT getting even smarter.

Improving GPT-4: Five Avenues of Improvement

In this section, the speaker discusses five ways to improve GPT-4.

Council of Advisors

  • The speaker suggests creating a council of advisors with different areas of expertise, such as mathematicians and philosophers, to extract more hidden expertise from GPD4.
  • This could potentially edge the results a few percent higher.

Optimizing Prompts

  • The speaker suggests experimenting with longer dialogues and different experts to optimize prompts.
  • This is the third avenue of improvement that the speaker envisages.

Experimenting with Different Temperatures

  • The speaker suggests experimenting with different temperatures in GPT4.
  • A lower temperature makes the model more conservative while a higher one makes it more creative.
  • Experimenting with a higher temperature could produce a more diverse range of outputs at this stage.

Integrating APIs for Character Counting Calculators and Code Interpreters

  • The speaker suggests integrating APIs for character counting calculators, code interpreters, etc., to push the results still higher.
  • Basic tool integration would push the results noticeably better.

Conclusion

The speaker concludes by stating that he has been following AI news and is determined to make improvements in GPT4. He believes that if models like this one officially exceed what OpenAI talked about in their technical war, it would reveal quite a few things. First, OpenAI isn't even aware of the full capabilities of its own model. Second, they need to do far more proper testing of their models before they release them. Finally, he encourages anyone who has access to GC4 API key or is an expert in benchmarking systems like GC4 to reach out to him.