NEW A.I. By Meta Is THAT Good? LLaMA 2 πŸ¦™ Fully Tested

NEW A.I. By Meta Is THAT Good? LLaMA 2 πŸ¦™ Fully Tested

Introduction and Model Testing

In this section, the speaker introduces the topic of testing the Llama 2.70b model, which has 70 billion parameters. The speaker mentions that they will be using their own LLM rubric to evaluate the model's performance.

Testing the Llama 2.70b Model

  • The speaker mentions that they are running the Llama 2.70b model on llama2.ai, which is created and sponsored by a16z.
  • The speaker expresses gratitude to both llama2.ai and a16z for providing the platform and models for testing.
  • It is mentioned that there are three models available, but only the 70 billion parameter model will be tested in this video.
  • The speaker adjusts some settings for testing, such as keeping temperature at its minimum value and setting max sequence length to 4K (the maximum for base models).
  • They mention keeping track of passes and fails in a notion document shared in the video description.

Python Script Output

In this section, the speaker tests the Llama 2.70b model by asking it to write a Python script to output numbers from 1 to 100.

Python Script Output Test

  • The speaker asks the model to write a Python script that outputs numbers from 1 to 100.
  • The generated script is shown on screen and appears correctly formatted.
  • The output is deemed correct, indicating a pass for this test.

Creative Writing - Poem about AI

In this section, the speaker tests the creative writing capabilities of the Llama 2.70b model by asking it to write a poem about AI in exactly 50 words.

Poem about AI Test

  • The speaker asks the model to write a poem about AI in exactly 50 words.
  • The generated poem is shown on screen and contains 30 words instead of the requested 50.
  • Despite not meeting the exact word count, the speaker considers it a pass as long as it is close to 50 and a decent poem.

Writing an Email to Boss

In this section, the speaker tests the Llama 2.70b model's ability to write an email informing their boss about leaving the company.

Writing an Email Test

  • The speaker asks the model to write an email to their boss informing them about leaving the company.
  • The generated email is shown on screen and appears concise and appropriate for its purpose.
  • The speaker notes that it's interesting how the model asks if any changes are needed or if they want it to send the email, even though sending emails is not expected from the model.
  • When asked to send the email, it generates a revised version of the same email but does not actually send it.
  • The test is considered a pass as both versions of the email meet expectations.

Basic Facts - President of USA in 1996

In this section, basic factual knowledge is tested by asking Llama 2.70b model about who was the president of the United States in 1996.

President of USA in 1996 Test

  • The speaker asks who was the president of USA in 1996.

(No response or outcome mentioned in transcript)

Timestamps may vary slightly depending on video editing.

Uncensored Content and Logic Problems

The speaker discusses the expectation of content becoming less censored and introduces logic problems.

Uncensored Content

  • The speaker mentions that content is becoming less censored and predicts it will continue to be more uncensored in the future.

Logic Problems

  • The speaker presents a logic problem about drying shirts in the sun. They explain step by step how to determine the time it takes for 20 shirts to dry based on the given information.
  • Another logic problem is introduced, involving the transitive property. The speaker explains how to determine if Sam is faster than Jane based on given statements.
  • Math problems are presented, including simple addition and subtraction calculations. [t=407s, t=428s]
  • Planning tasks are given, such as creating a healthy meal plan and summarizing text using bullet points. [t=446s, t=507s]
  • A new test is mentioned where the model needs to create a JSON object from provided output. The result appears correct but requires further verification.

Evaluation and Conclusion

The speaker evaluates the performance of the model and concludes with thoughts on its potential improvement.

Model Evaluation

  • The speaker acknowledges that overall, the model performed well compared to other fine-tuned models but not as well as gpt4.
  • Specific tasks are evaluated individually, mentioning whether they passed or failed based on accuracy.

Conclusion

  • The speaker expresses optimism about future improvements in the model's performance.
Video description

In this video, I run LLaMA2 70b through the LLM rubric. Does it perform well? Let's find out! Enjoy :) Join My Newsletter for Regular AI Updates πŸ‘‡πŸΌ https://forwardfuture.ai/ My Links πŸ”— πŸ‘‰πŸ» Subscribe: https://www.youtube.com/@matthew_berman πŸ‘‰πŸ» Twitter: https://twitter.com/matthewberman πŸ‘‰πŸ» Discord: https://discord.gg/xxysSXBxFW πŸ‘‰πŸ» Patreon: https://patreon.com/MatthewBerman Media/Sponsorship Inquiries πŸ“ˆ https://bit.ly/44TC45V Links: LLM Leaderboard - https://www.notion.so/1e0168e3481747ebaa365f77a3af3cc1?v=83e3d58d1c3c45ad879834981b8c2530&pvs=4 Research Paper - https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/ Test It Yourself - https://www.llama2.ai/ Download Models - https://huggingface.co/models?other=llama-2 Runpod - bit.ly/3OtbnQx