Massive Leap Toward AGI: AI Scientist, Grok 2, SearchGPT, Agent Q, New Coding Model

Name: Massive Leap Toward AGI: AI Scientist, Grok 2, SearchGPT, Agent Q, New Coding Model
Uploaded: 2024-08-17T00:00:00.000Z
Duration: 36 min 39 s

Weekly AI News Recap

Overview of Recent Developments

The week featured numerous exciting releases in the AI space, with a live stream discussing various stories and team introductions planned for Fridays at 10:00 a.m. Pacific.

Mr. Strawberry's Twitter Takeover

A figure known as Mr. Strawberry gained significant attention on Twitter, skyrocketing from 3,000 to 33,000 followers by promoting a new model called "strawberry" or "qar," speculated to be akin to GPT-5.

Despite the hype, much of his information was inaccurate; however, the resulting memes contributed positively to engagement.

Grok 2 Beta Release

Grok 2 Beta has been released by Elon Musk's AI company X, following previous speculation about an anonymous model named sus colr.

Grok 2 represents a substantial upgrade over its predecessor (Grok 1.5), featuring advanced capabilities in chat coding and reasoning alongside the introduction of Grok 2 Mini.

Currently available is only Grok 2 Mini; Grok 2 will be launched soon and includes a text-to-image model powered by Flux. One.

Text-to-Image Model Capabilities

Users have been experimenting with Grok’s uncensored text-to-image generation capabilities, producing various humorous and absurd images involving public figures like Donald Trump.

Search GPT Early Access Experience

Early access users are trying out Search GPT; it provides real-time information and allows user feedback on sources.

The speaker has switched their default search engine to Search GPT due to its efficiency compared to Google, indicating that Google's dominance is being challenged significantly.

Agent Q Research Breakthrough

Multi-on announced Agent Q, which focuses on next-generation AI agents capable of planning and self-healing but is designed specifically for consumer use rather than enterprise applications.

Agent Q aims to address challenges faced by large language models in interactive environments requiring multi-step reasoning tasks like web navigation.

AI Developments and Innovations

Overview of Recent AI Models

Discussion on the breakdown of a paper related to Strawberry and QAR, highlighting components like guided search with MCTS (Monte Carlo Tree Search), self-critique, and DPO (Direct Preference Optimization).

Mention of "Mr. Strawberry" and its association with the multi-on team, raising questions about the authenticity of claims regarding their identity.

Cosign Genie: A Breakthrough in Software Engineering

Introduction of Cosign Genie as the best software engineering model, achieving a state-of-the-art score of 30% on S Bench and 50% on another benchmark.

Comparison with Cognition Devon, which scored significantly lower at 14%, indicating a substantial performance leap for Cosign Genie.

Breakdown of training data composition: 21% JavaScript, 21% Python, 14% TypeScript; Ruby only accounted for 3%.

Functionality and Applications of Cosign Genie

Description of various tasks performed by Cosign Genie: feature development (25%), bug fixing (20%), refactoring (15%), test writing (15%), documentation writing (10%).

Sakana AI's AI Scientist: A New Era in Scientific Discovery

Introduction to Sakana AI's "AI Scientist," capable of fully automated scientific discovery rather than merely repeating existing knowledge.

Connection to Leopold Ashen Brenner's paper on situational awareness as a precursor to an intelligence explosion in AI capabilities.

Capabilities and Impact of the AI Scientist

The system automates the entire research lifecycle from idea generation to experiment execution and result summarization.

Features an automated peer review process that evaluates generated papers with near-human accuracy, enhancing research quality.

Cost Efficiency and Future Potential

Each developed idea costs approximately $15 per paper; despite some flaws, this indicates significant potential for future advancements in automated research.

OpenAI's SBench Verified Benchmarking Tool

Announcement from OpenAI regarding SBench Verified—a human validated subset designed to better evaluate AI models' abilities in solving real-world software issues.

Google Gemini Live Event Insights

Google's Voice Model and AI Developments

Google’s Voice Model Launch

Google successfully launched a full voice model that allows for interactive conversations, marking a significant advancement in AI technology. The model can respond and engage in dialogue, although it may not sound as refined as GPT-4's voice capabilities.

Issues with Live Demonstrations

During a live demo, there were technical difficulties when attempting to check the calendar for Sabrina Carpenter's concert in San Francisco. This highlighted the challenges of real-time demonstrations.

Anthropic’s Prompt Caching Feature

Anthropic introduced prompt caching with Claude, which is crucial for enhancing efficiency in large language models (LLMs). Caching reduces costs and increases speed and consistency, especially beneficial for applications at scale.

Applications of Prompt Caching

Key use cases include:

Conversational agents: Reduces cost and latency during extended interactions.

Coding assistance: Streamlines processing of large documents.

Agentic search: Allows direct access to cached versions instead of calling LLM repeatedly.

Cost Reduction Insights

Implementing caching can lead to substantial cost reductions:

Chatting with a book: 90% reduction.

Many-shot prompting: 86% reduction.

Multi-turn conversations: 53% reduction.

Apple’s Rumored New Device

Apple is reportedly developing a device combining an iPad-like display with a robotic arm, expected around 2026 or 2027. Its functions may include smart home control and video conferencing.

Functionality of the Robotic Arm

The robotic arm is speculated to allow the screen to tilt and rotate, ensuring it remains oriented towards the user or adjusts its view based on context.

Naous Research Releases Hermes 3

Hermes 3 has been released by Naous Research as fine-tuned models based on Llama 3.1. It offers improvements in role-playing tasks, function calling reliability, multi-turn chats, and long context coherence.

Performance Comparison with Llama Models

Hermes models show competitive performance against Llama models across various benchmarks while focusing on reducing censorship and increasing steerability for users' needs.

Upcoming Events and Channel Membership