Video thumbnail for 【人工智能】OpenAI发布满血版o3和o4-mini | 迄今为止最强大最智能 | 深度使用工具 | 图像推理 | 基准评分大幅提升 | 博士水平 | 成本效率更优 | Agent-CodeX开源

OpenAI's New AI: o3 & o4-mini - Most Powerful & Intelligent Yet!

Summary

Quick Abstract

Delve into OpenAI's groundbreaking release of the o3 and o4-mini models, heralded as the most powerful and intelligent AI systems to date. This summary unpacks the key announcements, focusing on their innovative "systematic intelligence," advanced tool integration (even integrating images), and impressive benchmark results. Discover how these models are pushing the boundaries of AI capabilities.

  • Systematic Intelligence: o3 isn't just a model; it's an AI system generating novel ideas in system architecture.

  • Advanced Tool Integration: Models can autonomously use tools like web search, Python, and image analysis, calling hundreds of tools in sequence.

  • Image Reasoning: Integrates images directly into its reasoning process.

  • Benchmark Performance: Excels in mathematics, coding, science, and multi-modal understanding, surpassing previous models.

  • CodeX: New Open Source coding agent, offering suggestion & automation modes for local AI code project usage.

  • Safety First: Rigorous safety testing and risk mitigation strategies are integrated.

  • Pricing & Access: Pricing details and availability rollout plans for different user tiers are unveiled.

OpenAI's New AI System: o3 and o4-mini

On April 17th, OpenAI unveiled its latest AI models, o3 and o4-mini, touted as the most powerful and intelligent models to date. This release marks a significant shift towards a true AI system, moving beyond individual models. These models can function like agents, autonomously calling on a multitude of tools to complete complex tasks.

Key Features of o3 and o4-mini

  • Systemic Intelligence: o3 is not just a large language model, but an AI system with "systemic intelligence," capable of generating novel and useful ideas, especially in complex fields like system architecture design.

  • Deep Tool Integration: The models can autonomously use and combine various tools within ChatGPT, including web search, Python programming, image analysis, file interpretation, and image generation.

  • Image Reasoning: The models incorporate "Thinking with Images," enabling the direct integration of images into the chain of thought.

  • Code Understanding: Exceeds human engineers in the understanding and retrieval of large code bases.

Enhanced Tool Usage and Problem-Solving

Unlike previous models, o3 exhibits proactive tool usage. Instead of passively waiting for instructions, it initiates actions and can sequentially invoke hundreds of tools. For example, when faced with a complex problem, it will use web search to gather relevant information, process and analyze data using Python, and interpret images using image analysis tools. This chain-of-thought approach significantly enhances its problem-solving capabilities.

Real-World Applications: Demonstrated Use Cases

Scientific Research Assistance

A researcher, Brandon, demonstrated o3's capabilities using a 2015 physics research poster on proton isotriplet scalar electrocharge calculations. He instructed o3 to calculate the proton isotriplet scalar electrocharge based on the poster's content and compare the results with recent literature.

o3 processed the image, identified key charts and data points, performed calculations, retrieved relevant constants from literature, and ultimately provided an estimate of 1.5, compared to Brandon's original 1.2. It also compared its findings with recent research, noting that its result's precision was lower due to limitations of older experimental equipment, but the trend was consistent. The process, which would take a researcher days, was completed by o3 in just 20 seconds.

Personalized, Cross-Domain Content Generation

Another researcher, Eric, used o3's "memory" function, combining his interests in diving and music. The model was instructed to read news and teach something profound related to both interests, include a chart displaying data and relationships, and draft a blog post with space for the chart.

o3 connected the interests, identified "coral reef sound wave repair," and integrated content from a 2024 Nature Ecology article. It generated a coral coverage growth curve from 2010-2025 and a SVG diagram of an underwater sound wave device. The model also generated an APA formatted bibliography with 3 papers and 2 technical reports. This showed how o3 can create professional-level educational content and automate much of the content creation process.

Performance Benchmarks: Superior Results

OpenAI showcased the performance of o3 and o4-mini on various benchmarks, covering areas such as mathematics, programming, scientific reasoning, and multimodal understanding.

Mathematics and Scientific Reasoning

  • AIME 2024/2025: The models, especially o4-mini, achieved near-human or superhuman accuracy, significantly boosted by Python tool integration.

  • GPQA Diamond: Both models showed significant improvements in high-difficulty scientific reasoning, with o3 performing best without tools, nearing a PhD level of reasoning.

Programming and Code Capabilities

  • Codeforces: With terminal tools, the models achieved a high ELO score, placing them among the top 200 contestants globally.

  • SWE-Lancer: o3-high and o4-mini-high showed significantly higher earnings in real-world programming tasks, indicating strong commercial potential.

  • SWE-Bench: o3 and o4-mini-high achieved accuracy exceeding 68% on the software engineering validation test, which is a substantial improvement

  • Aider Polyglot: o3-high and o4-mini-high performed well on both overall and differential editing tasks in multiple languages

Multimodal Understanding and Reasoning

  • MMMU: The models achieved over 81% accuracy in university-level visual problem-solving, surpassing prior models.

  • MathVista: o4-mini and o3 achieved 84.4% and 87.5% accuracy respectively, again far surpassing prior models.

  • CharXiv-Reasoning: o3 and o4-mini displayed accuracy of 75.4% and 72% respectively, substantially outpacing o1's 55.1%.

  • Visual Search: o3 and o4-mini achieved accuracy exceeding 94%, substantially outpacing both o1 and GPT-4o.

Comprehensive Reasoning and Multi-Turn Command Following

  • Humanity’s Last Exam: o3, combined with Python and other tools, demonstrated greatly enhanced integrated reasoning capabilities.

  • Scale MultiChallenge: o3 showed excellent performance in multi-turn complex instruction following tasks.

Cost Efficiency

  • o4-mini not only offers stronger reasoning capabilities but also lower inference costs, which is well-suited for widespread deployment.

Performance Compared to o1

o3 represents a significant leap forward compared to o1, delivering higher inference capabilities at a similar or lower cost. Tests like AIME 2025 and GPQA Pass@1 demonstrate o3's superior mathematical reasoning and handling of complex scientific issues. Furthermore, o3 maintains more consistent and more rapid performance improvements than o1 as training compute increases.

Open Source Initiative: CodeX

OpenAI released the open-source, lightweight coding agent CodeX, designed to rival Claude Code. CodeX can securely connect AI models to local environments, automating code generation, file editing, and command execution. It features "suggestion mode" and "fully automatic mode," balancing security and efficiency.

Pricing

  • o3: Input: $10 per 1M tokens; Cached Input: $2.5 per 1M tokens; Output: $40 per 1M tokens

  • o4-mini: Input: $1.1 per 1M tokens; Cached Input: $0.275 per 1M tokens; Output: $4.4 per 1M tokens

Safety and Risk Mitigation

OpenAI has comprehensively redesigned its safety training datasets for o3 and o4-mini, incorporating examples focused on biological threats, malware generation, and jailbreak prompts. These enhancements have resulted in superior performance on internal safety benchmarks. Additionally, OpenAI has created system-level risk mitigation mechanisms, including a reasoning-based language model monitor to identify potentially dangerous prompts. According to OpenAI’s latest Preparedness Framework, both o3 and o4-mini were assessed in the low-risk range for biology and chemical risks, cyber security risks, and AI self-improvement capacity.

Availability

o3 and o4-mini will gradually replace older models and are already available to Pro, Plus, and Team subscribers. Enterprise and education users will gain access within a week. Free users can test o4-mini's reasoning capabilities by clicking the "Think" button before querying. API tool calling capabilities will be added in the coming weeks. A $1 million open-source incentive program was also announced to encourage developers to innovate using the latest models and tools.

OpenAI's Vision

The OpenAI team emphasized that o3's training required 10 times the computation of o1, representing a significant investment in science and engineering. They are committed to further enhancing the practicality, efficiency, and safety of AI systems, with the ultimate goal of making AI accessible and beneficial to everyone.

Was this summary helpful?

Quick Actions

Watch on YouTube

Related Summaries

Summarize a New YouTube Video

Enter a YouTube video URL below to get a quick summary and key takeaways.