Massive Leap Toward AGI: AI Scientist, Grok 2, SearchGPT, Agent Q, New Coding Model

Massive Leap Toward AGI: AI Scientist, Grok 2, SearchGPT, Agent Q, New Coding Model

Weekly AI News Recap

Overview of Recent Developments

  • The week featured numerous exciting releases in the AI space, with a live stream discussing various stories and team introductions planned for Fridays at 10:00 a.m. Pacific.

Mr. Strawberry's Twitter Takeover

  • A figure known as Mr. Strawberry gained significant attention on Twitter, skyrocketing from 3,000 to 33,000 followers by promoting a new model called "strawberry" or "qar," speculated to be akin to GPT-5.
  • Despite the hype, much of his information was inaccurate; however, the resulting memes contributed positively to engagement.

Grok 2 Beta Release

  • Grok 2 Beta has been released by Elon Musk's AI company X, following previous speculation about an anonymous model named sus colr.
  • Grok 2 represents a substantial upgrade over its predecessor (Grok 1.5), featuring advanced capabilities in chat coding and reasoning alongside the introduction of Grok 2 Mini.
  • Currently available is only Grok 2 Mini; Grok 2 will be launched soon and includes a text-to-image model powered by Flux. One.

Text-to-Image Model Capabilities

  • Users have been experimenting with Grokโ€™s uncensored text-to-image generation capabilities, producing various humorous and absurd images involving public figures like Donald Trump.

Search GPT Early Access Experience

  • Early access users are trying out Search GPT; it provides real-time information and allows user feedback on sources.
  • The speaker has switched their default search engine to Search GPT due to its efficiency compared to Google, indicating that Google's dominance is being challenged significantly.

Agent Q Research Breakthrough

  • Multi-on announced Agent Q, which focuses on next-generation AI agents capable of planning and self-healing but is designed specifically for consumer use rather than enterprise applications.
  • Agent Q aims to address challenges faced by large language models in interactive environments requiring multi-step reasoning tasks like web navigation.

AI Developments and Innovations

Overview of Recent AI Models

  • Discussion on the breakdown of a paper related to Strawberry and QAR, highlighting components like guided search with MCTS (Monte Carlo Tree Search), self-critique, and DPO (Direct Preference Optimization).
  • Mention of "Mr. Strawberry" and its association with the multi-on team, raising questions about the authenticity of claims regarding their identity.

Cosign Genie: A Breakthrough in Software Engineering

  • Introduction of Cosign Genie as the best software engineering model, achieving a state-of-the-art score of 30% on S Bench and 50% on another benchmark.
  • Comparison with Cognition Devon, which scored significantly lower at 14%, indicating a substantial performance leap for Cosign Genie.
  • Breakdown of training data composition: 21% JavaScript, 21% Python, 14% TypeScript; Ruby only accounted for 3%.

Functionality and Applications of Cosign Genie

  • Description of various tasks performed by Cosign Genie: feature development (25%), bug fixing (20%), refactoring (15%), test writing (15%), documentation writing (10%).

Sakana AI's AI Scientist: A New Era in Scientific Discovery

  • Introduction to Sakana AI's "AI Scientist," capable of fully automated scientific discovery rather than merely repeating existing knowledge.
  • Connection to Leopold Ashen Brenner's paper on situational awareness as a precursor to an intelligence explosion in AI capabilities.

Capabilities and Impact of the AI Scientist

  • The system automates the entire research lifecycle from idea generation to experiment execution and result summarization.
  • Features an automated peer review process that evaluates generated papers with near-human accuracy, enhancing research quality.

Cost Efficiency and Future Potential

  • Each developed idea costs approximately $15 per paper; despite some flaws, this indicates significant potential for future advancements in automated research.

OpenAI's SBench Verified Benchmarking Tool

  • Announcement from OpenAI regarding SBench Verifiedโ€”a human validated subset designed to better evaluate AI models' abilities in solving real-world software issues.

Google Gemini Live Event Insights

Google's Voice Model and AI Developments

Googleโ€™s Voice Model Launch

  • Google successfully launched a full voice model that allows for interactive conversations, marking a significant advancement in AI technology. The model can respond and engage in dialogue, although it may not sound as refined as GPT-4's voice capabilities.

Issues with Live Demonstrations

  • During a live demo, there were technical difficulties when attempting to check the calendar for Sabrina Carpenter's concert in San Francisco. This highlighted the challenges of real-time demonstrations.

Anthropicโ€™s Prompt Caching Feature

  • Anthropic introduced prompt caching with Claude, which is crucial for enhancing efficiency in large language models (LLMs). Caching reduces costs and increases speed and consistency, especially beneficial for applications at scale.

Applications of Prompt Caching

  • Key use cases include:
  • Conversational agents: Reduces cost and latency during extended interactions.
  • Coding assistance: Streamlines processing of large documents.
  • Agentic search: Allows direct access to cached versions instead of calling LLM repeatedly.

Cost Reduction Insights

  • Implementing caching can lead to substantial cost reductions:
  • Chatting with a book: 90% reduction.
  • Many-shot prompting: 86% reduction.
  • Multi-turn conversations: 53% reduction.

Appleโ€™s Rumored New Device

  • Apple is reportedly developing a device combining an iPad-like display with a robotic arm, expected around 2026 or 2027. Its functions may include smart home control and video conferencing.

Functionality of the Robotic Arm

  • The robotic arm is speculated to allow the screen to tilt and rotate, ensuring it remains oriented towards the user or adjusts its view based on context.

Naous Research Releases Hermes 3

  • Hermes 3 has been released by Naous Research as fine-tuned models based on Llama 3.1. It offers improvements in role-playing tasks, function calling reliability, multi-turn chats, and long context coherence.

Performance Comparison with Llama Models

  • Hermes models show competitive performance against Llama models across various benchmarks while focusing on reducing censorship and increasing steerability for users' needs.

Upcoming Events and Channel Membership

Video description

This was a huge week in AI and one where it truly felt like we made a leap toward AGI, with an AI Research Scientist, new agents, new coding models, and more. Join My Newsletter for Regular AI Updates ๐Ÿ‘‡๐Ÿผ https://www.matthewberman.com My Links ๐Ÿ”— ๐Ÿ‘‰๐Ÿป Subscribe: https://www.youtube.com/@matthew_berman ๐Ÿ‘‰๐Ÿป Twitter: https://twitter.com/matthewberman ๐Ÿ‘‰๐Ÿป Discord: https://discord.gg/xxysSXBxFW ๐Ÿ‘‰๐Ÿป Patreon: https://patreon.com/MatthewBerman ๐Ÿ‘‰๐Ÿป Instagram: https://www.instagram.com/matthewberman_ai ๐Ÿ‘‰๐Ÿป Threads: https://www.threads.net/@matthewberman_ai ๐Ÿ‘‰๐Ÿป LinkedIn: https://www.linkedin.com/company/forward-future-ai Need AI Consulting? ๐Ÿ“ˆ https://forwardfuture.ai/ Media/Sponsorship Inquiries โœ… https://bit.ly/44TC45V Chapters: 0:00 - Intro 0:18 - Mr Strawberry 0:55 - Grok 2 3:47 - SearchGPT 4:48 - Agent Q 6:46 - New Coding Model (Cosine Genie) 8:29 - AI Scientist 11:15 - SWE-Bench 11:57 - Google Gemini Live 14:35 - Prompt Caching with Claude 16:06 - Apple Robot Arm Home Device 16:58 - Nous Hermes 3 Links: https://www.youtube.com/watch?v=ihPvgjyrODk https://www.youtube.com/watch?v=aUTeeKNjoKY https://youtube.com/live/pO2F4zeNAE4 https://x.ai/blog/grok-2 https://chatgpt.com/search https://x.com/multion_ai/status/1823412701441482959?s=46 https://cosine.sh/blog/genie-technical-report https://sakana.ai/ai-scientist https://t.co/JNthzl1Vmk https://x.com/durreadan01/status/1823430521768304674 https://x.com/nima_owji/status/1823388838279922166?s=46 https://x.com/AnthropicAI/status/1823751314444021899 https://x.com/i/trending/1823816297974554864 https://x.com/NousResearch/status/1824131520375951454