Three Major AI Releases
Today's update covers three significant releases in the AI space: Gemini 2.5 Pro, 11 Labs V3 text-to-speech, and Quen 3 re-ranker and embedding models. We'll delve into each of these, starting with Gemini 2.5 Pro.
Gemini 2.5 Pro: A Coding Powerhouse with Quirks
Gemini 2.5 Pro is the first topic, described as a significant upgrade, particularly for coding. It's capable of creating web applications from single prompts and can make adjustments based on further instructions. However, the model also exhibits interesting failure cases that are explored.
-
State-of-the-Art Claims: Logan announced that Gemini 2.5 Pro is designed to be state-of-the-art regarding humanity's last exam, ADAR, and GPQA.
-
Thinking Budget: The model supports a "thinking budget," allowing users to set the number of tokens used for thinking, although only summaries are accessible.
-
Regression Addressed: The new version aims to close the gap on previously reported regressions in the 05/05 version.
Benchmark Performance
Gemini 2.5 Pro leads 03 high on most benchmarks, excelling in the humanities last exam. However, it lags behind 03 in mathematics. When it comes to the Ader Polyglot benchmark for code editing, it is state-of-the-art for a single model, although 03 still performs better in code generation. The Ader Polyglot benchmark is considered highly reliable for coding assessments.
Escaladra Integration
One notable feature of Gemini 2.5 Pro is its ability to generate Escaladra diagrams. The model successfully recreated a plot of benchmark scores in the Escaladra style, allowing for easy editing within the platform. This integration is particularly beneficial for architectural diagramming.
Prompting Challenges
Despite its strengths, Gemini 2.5 Pro struggles with some prompts compared to previous versions. One example showed it taking significantly longer (six minutes versus two) than the previous version on a specific prompt and exhibiting a bug in the resulting code.
Reasoning Limitations
Tests involving logical deduction problems, such as modified trolley and farmer paradox problems, revealed limitations in the model's reasoning capabilities. For example, in a modified trolley problem where the victims were already dead, the model failed to recognize that crucial detail.
11 Labs V3: Enhanced Text-to-Speech
The second release is the new 11 Labs V3 alpha, a text-to-speech system offering greater control over audio output.
-
Control and Quality: It provides more control over the types of expressions that can be generated from text prompts, while maintaining high audio quality.
-
Expression Variety: Users can add specific expressions to the generated speech, such as chuckles.
-
API Availability: The system will soon be available through an API.
-
Pricing: In June, 11 Labs is offering an 80% discount, making it an ideal time to test the service.
Quen 3: RAG Enhancement Tools
The third release focuses on Retrieval-Augmented Generation (RAG) and comes from Quen. Quen is releasing two new models: Quen 3 embedding and Quen 3 re-ranker.
-
RAG Pipeline Role: Embedding models and re-rankers are critical components in RAG pipelines, ensuring accurate context retrieval and filtering of irrelevant text chunks.
-
Model Sizes: Both models are available in various sizes, from 6B to 8B.
-
Open Source: The models are open source and can be downloaded from Hugging Face for immediate use in RAG pipelines.
-
Performance: Based on benchmarks, Quen 3 models appear state-of-the-art or near state-of-the-art for their respective sizes.
-
Multimodal Retrieval: Advanced RAG techniques, including multimodal retrieval for PDFs with images, text, and tables, can be explored in a linked course.
-
8B Re-ranker: The 8B re-ranker outperforms other available re-ranking models, with the 6B model also performing well compared to existing state-of-the-art options.