最佳拍档: AI's Next Chapter: Rethinking AI Evaluation for Real-World Impact

Dive into the future of AI with insights from OpenAI's Yao Shuny Yu, a 清华姚班 alum, as he discusses the evolution of AI beyond current training methods. Learn about the shift from problem-solving to problem-defining and the critical role of evaluation in AI's next phase, dubbed the "second half." Shuny Yu's concept involves a new "recipe" for AI development including language pre-training, scale, inference, and action.

Quick Takeaways:

AI's focus is shifting from training models to evaluating their real-world applicability.
The traditional AI development model, centered on benchmarks, faces challenges due to standardization.
New evaluation settings are needed to address the "utility problem," where AI excels in competitions but economic impact is limited.
Real-world AI requires interaction and long-term memory, unlike current isolated task evaluations.
The next phase involves developing evaluations for real-world usefulness and general methods of solving tasks, possibly leading to the selection of general models over incremental improvements.

This article explores the future trajectory of Artificial Intelligence (AI) based on the insights of Yao Shunyue, a researcher at OpenAI and a graduate of Tsinghua University's Yao Class and Princeton University. Yao, known for his groundbreaking work in language agents, including the Tree of Thoughts (ToT), ReAct, and CoALA architecture, recently published a blog post titled "The Second Half of AI," offering a perspective on the future direction of AI. This article delves into his ideas.

AI's "Halftime": A Review of the First Phase

We are currently at a unique stage in AI development, described by Yao as "halftime." The initial decades of AI focused heavily on developing new training methods and models, yielding significant advancements. These include fundamental innovations in search technology, deep reinforcement learning, and reasoning methodologies.

The Shift from Solving to Defining Problems

Deep reinforcement learning, once plagued by generalization challenges, has seen progress in finding solutions applicable across diverse tasks. This shift has caused the AI development focus to evolve from merely solving problems to defining them. Evaluation is now paramount, prompting a re-evaluation of existing AI training methodologies and a need for more scientific assessment of AI progress. This shift requires viewing AI development from a more product-oriented perspective.

The Importance of Foundational Training Methods

The impactful AI papers of the first phase, such as those introducing the Transformer architecture, AlexNet, and GPT-3, centered on foundational breakthroughs in training methods rather than benchmarks. While benchmarks like ImageNet are important, the papers detailing method innovations have received significantly more citations. This emphasis reflects the wide applicability and value of these methods across the AI landscape. The Transformer architecture, initially applied in machine translation, has been successfully adapted to computer vision, natural language processing, and reinforcement learning. The emphasis on method innovation effectively propelled AI advancements across various domains. However, continuous accumulation of these innovations has driven AI to an inflection point, triggering a fundamental shift in development focus.

The AI "Recipe": Language Pre-training, Scale, Reasoning, and Action

Yao proposes an AI "recipe" comprising large-scale language pre-training, scale, and reasoning & action. He draws a parallel to reinforcement learning to clarify the reasoning behind terming these as recipe.

Reinforcement Learning: More Than Just Algorithms

Reinforcement learning, considered the "ultimate form" of AI, theoretically ensures an agent's success in games. However, even systems like AlphaGo rely heavily on reinforcement learning support. Reinforcement learning comprises algorithms, environment, and prior knowledge. Historically, researchers emphasized algorithms (REINFORCE, DQN, etc.), treating the environment and prior knowledge as fixed or simplified factors.

The Importance of Environment and Prior Knowledge

With the advent of deep reinforcement learning, the significance of the environment has become more apparent. Algorithm performance relies heavily on the environment used for development and testing. Neglecting environmental factors can lead to algorithms that perform well in simple simulations but fail in real-world applications. OpenAI's initial plan of building Gym, World of Bits and Universe to transform the entire internet to a gigantic game environment did not fully achieve the expected results. While OpenAI achieved significant results, like using reinforcement learning to solve Dota and robot hand control, they failed to solve computer usage and web navigation challenges, and the trained Agent was hard to transfer to other areas.

The missing key ingredient was identified as prior knowledge with the emergence of GPT-2 and GPT-3. Powerful language pre-training can distill general knowledge and language understanding into models. These models, when fine-tuned, can become web agents like WebGPT or chatbots like ChatGPT.

Reasoning as a Key Element for Generalization

While language pre-training provides a solid foundation for chatbots, it struggles in domains like computer control or video games. These domains have different data distributions compared to internet text, limiting the effectiveness of supervised fine-tuning or reinforcement learning.

In 2019, Yao attempted to use GPT-2 to solve text-based games, finding that the Agent required millions of reinforcement learning steps to reach a certain level of gameplay, and the experience learned was difficult to apply in new games. Unlike the agents, humans can play new games without any prior experience. This is because humans can think abstractly. This ability to reason becomes crucial for dealing with new situations. Reasoning can be considered a unique kind of action. It deals with the open and unbounded space of thought. The key is that language enables generalization through reasoning within agents. When the correct prior knowledge and appropriate reinforcement learning environment are established, the learning algorithms are simple. Building on this understanding, researchers have developed models like O series and R1, which can utilize computers, paving the way for more advances in the future.

Rethinking Evaluation: The Key to AI's Next Phase

The conventional method of AI development, focusing on improving benchmark scores through new training methods, is reaching its limits. The "recipe" mentioned above has made benchmark improvements standardized and industrialized, reducing the need for innovative ideas. This expanding recipe makes it such that new methods to optimize performance in a specific task will only improve performance by 5% compared to a 30% performance increase that can be achieved even without task-specific optimization when using O-series models. The benchmarks can quickly and increasingly more quickly be solved with the "recipe".

Therefore, in the next phase of AI development, it is necessary to fundamentally rethink evaluation methods, not just create more difficult benchmarks. We need to question current assumptions and create entirely new evaluation systems, forcing the invention of methods that surpass the existing "recipe".

The Importance of Real-World Utility

AI's success in games like chess and Go, its academic performance surpassing most humans, and its achievements in Olympiad-level competitions haven't translated into significant economic or GDP changes. Yao terms this the utility problem, considering it a crucial challenge for AI development.

This problem arises from the discrepancy between existing evaluation settings and real-world conditions.

Interaction with Humans: Evaluations typically require autonomous operation, where an Agent receives a task input and completes it independently for a reward. However, real-world Agents need to interact with humans throughout the process.
Independent and Identically Distributed (IID) Data: Evaluations often assume IID data, where tasks are processed independently and averaged. In reality, task solving is sequential, and experience gained in one task can inform subsequent tasks. Software engineers familiarity with the code based increases, allowing them to more effectively solve problems. This is not possible with agents.

Generic methods under current assumptions may not continue to be as effective. Therefore, in the "second half" of AI, we need to develop new evaluation setups or tasks that reflect real-world utility, then use generic methods to solve these tasks or enhance these methods with novel components, then cycle this process.

AI's Next Chapter: Rethinking AI Evaluation for Real-World Impact

Summary

Quick Abstract

AI's "Halftime": A Review of the First Phase

The Shift from Solving to Defining Problems

The Importance of Foundational Training Methods

The AI "Recipe": Language Pre-training, Scale, Reasoning, and Action

Reinforcement Learning: More Than Just Algorithms

The Importance of Environment and Prior Knowledge

Reasoning as a Key Element for Generalization

Rethinking Evaluation: The Key to AI's Next Phase

The Importance of Real-World Utility

Quick Actions

More from 最佳拍档

【商业】算力新锐CoreWeave即将IPO | 挖矿前身 | AI转机 | 151亿美元RPO | 预期能否兑现 | 软硬件实力 | 英伟达深度绑定 | 营收和亏损双增 | 市场竞争和风险

【英伟达】Tensor Core演进史 | SemiAnalysis | Amdahl定律 | 强、弱缩放 | Volta | Turing | Ampere | Blackwell | 结构化稀疏

【爆料】非营利组织猛爆Sam Altman黑料 | OpenAI Files | 冒充YC董事长 | 涉嫌利益输送 | 架空OpenAI董事会 | 取消投资回报上限 | 隐瞒持股 | 欺骗和隐瞒

Related Summaries

【商业】算力新锐CoreWeave即将IPO | 挖矿前身 | AI转机 | 151亿美元RPO | 预期能否兑现 | 软硬件实力 | 英伟达深度绑定 | 营收和亏损双增 | 市场竞争和风险

【英伟达】Tensor Core演进史 | SemiAnalysis | Amdahl定律 | 强、弱缩放 | Volta | Turing | Ampere | Blackwell | 结构化稀疏

【爆料】非营利组织猛爆Sam Altman黑料 | OpenAI Files | 冒充YC董事长 | 涉嫌利益输送 | 架空OpenAI董事会 | 取消投资回报上限 | 隐瞒持股 | 欺骗和隐瞒

【人工智能】软件3.0时代到来 | Andrej Karpathy | 软件的三个阶段 | 大模型是操作系统 | 早期操作系统之争 | 局限性 | 部分自治应用 | 双向奔赴 | 可靠性鸿沟

【人工智能】AI竟潜藏第二黑暗人格 | OpenAI最新研究 | 涌现性失调 | 泛化 | 推理模型更甚 | 稀疏自编码器SAE | 失调人格特征 | 有毒人格 | 涌现式重对齐 | 人类引导AI向善

【商业】算力新锐CoreWeave即将IPO | 挖矿前身 | AI转机 | 151亿美元RPO | 预期能否兑现 | 软硬件实力 | 英伟达深度绑定 | 营收和亏损双增 | 市场竞争和风险

【英伟达】Tensor Core演进史 | SemiAnalysis | Amdahl定律 | 强、弱缩放 | Volta | Turing | Ampere | Blackwell | 结构化稀疏

Summarize a New YouTube Video