Video thumbnail for 【人工智能】AI的下半场The Second Half | 姚顺雨 | 中场休息 | AI配方Recipes | 先验知识 | 推理 | 基准测试 | 重新思考评估方式 | 发展转折点

AI's Next Chapter: Rethinking AI Evaluation for Real-World Impact

Summary

Quick Abstract

Dive into the future of AI with insights from OpenAI's Yao Shuny Yu, a 清华姚班 alum, as he discusses the evolution of AI beyond current training methods. Learn about the shift from problem-solving to problem-defining and the critical role of evaluation in AI's next phase, dubbed the "second half." Shuny Yu's concept involves a new "recipe" for AI development including language pre-training, scale, inference, and action.

Quick Takeaways:

  • AI's focus is shifting from training models to evaluating their real-world applicability.

  • The traditional AI development model, centered on benchmarks, faces challenges due to standardization.

  • New evaluation settings are needed to address the "utility problem," where AI excels in competitions but economic impact is limited.

  • Real-world AI requires interaction and long-term memory, unlike current isolated task evaluations.

  • The next phase involves developing evaluations for real-world usefulness and general methods of solving tasks, possibly leading to the selection of general models over incremental improvements.

This article explores the future trajectory of Artificial Intelligence (AI) based on the insights of Yao Shunyue, a researcher at OpenAI and a graduate of Tsinghua University's Yao Class and Princeton University. Yao, known for his groundbreaking work in language agents, including the Tree of Thoughts (ToT), ReAct, and CoALA architecture, recently published a blog post titled "The Second Half of AI," offering a perspective on the future direction of AI. This article delves into his ideas.

AI's "Halftime": A Review of the First Phase

We are currently at a unique stage in AI development, described by Yao as "halftime." The initial decades of AI focused heavily on developing new training methods and models, yielding significant advancements. These include fundamental innovations in search technology, deep reinforcement learning, and reasoning methodologies.

The Shift from Solving to Defining Problems

Deep reinforcement learning, once plagued by generalization challenges, has seen progress in finding solutions applicable across diverse tasks. This shift has caused the AI development focus to evolve from merely solving problems to defining them. Evaluation is now paramount, prompting a re-evaluation of existing AI training methodologies and a need for more scientific assessment of AI progress. This shift requires viewing AI development from a more product-oriented perspective.

The Importance of Foundational Training Methods

The impactful AI papers of the first phase, such as those introducing the Transformer architecture, AlexNet, and GPT-3, centered on foundational breakthroughs in training methods rather than benchmarks. While benchmarks like ImageNet are important, the papers detailing method innovations have received significantly more citations. This emphasis reflects the wide applicability and value of these methods across the AI landscape. The Transformer architecture, initially applied in machine translation, has been successfully adapted to computer vision, natural language processing, and reinforcement learning. The emphasis on method innovation effectively propelled AI advancements across various domains. However, continuous accumulation of these innovations has driven AI to an inflection point, triggering a fundamental shift in development focus.

The AI "Recipe": Language Pre-training, Scale, Reasoning, and Action

Yao proposes an AI "recipe" comprising large-scale language pre-training, scale, and reasoning & action. He draws a parallel to reinforcement learning to clarify the reasoning behind terming these as recipe.

Reinforcement Learning: More Than Just Algorithms

Reinforcement learning, considered the "ultimate form" of AI, theoretically ensures an agent's success in games. However, even systems like AlphaGo rely heavily on reinforcement learning support. Reinforcement learning comprises algorithms, environment, and prior knowledge. Historically, researchers emphasized algorithms (REINFORCE, DQN, etc.), treating the environment and prior knowledge as fixed or simplified factors.

The Importance of Environment and Prior Knowledge

With the advent of deep reinforcement learning, the significance of the environment has become more apparent. Algorithm performance relies heavily on the environment used for development and testing. Neglecting environmental factors can lead to algorithms that perform well in simple simulations but fail in real-world applications. OpenAI's initial plan of building Gym, World of Bits and Universe to transform the entire internet to a gigantic game environment did not fully achieve the expected results. While OpenAI achieved significant results, like using reinforcement learning to solve Dota and robot hand control, they failed to solve computer usage and web navigation challenges, and the trained Agent was hard to transfer to other areas.

The missing key ingredient was identified as prior knowledge with the emergence of GPT-2 and GPT-3. Powerful language pre-training can distill general knowledge and language understanding into models. These models, when fine-tuned, can become web agents like WebGPT or chatbots like ChatGPT.

Reasoning as a Key Element for Generalization

While language pre-training provides a solid foundation for chatbots, it struggles in domains like computer control or video games. These domains have different data distributions compared to internet text, limiting the effectiveness of supervised fine-tuning or reinforcement learning.

In 2019, Yao attempted to use GPT-2 to solve text-based games, finding that the Agent required millions of reinforcement learning steps to reach a certain level of gameplay, and the experience learned was difficult to apply in new games. Unlike the agents, humans can play new games without any prior experience. This is because humans can think abstractly. This ability to reason becomes crucial for dealing with new situations. Reasoning can be considered a unique kind of action. It deals with the open and unbounded space of thought. The key is that language enables generalization through reasoning within agents. When the correct prior knowledge and appropriate reinforcement learning environment are established, the learning algorithms are simple. Building on this understanding, researchers have developed models like O series and R1, which can utilize computers, paving the way for more advances in the future.

Rethinking Evaluation: The Key to AI's Next Phase

The conventional method of AI development, focusing on improving benchmark scores through new training methods, is reaching its limits. The "recipe" mentioned above has made benchmark improvements standardized and industrialized, reducing the need for innovative ideas. This expanding recipe makes it such that new methods to optimize performance in a specific task will only improve performance by 5% compared to a 30% performance increase that can be achieved even without task-specific optimization when using O-series models. The benchmarks can quickly and increasingly more quickly be solved with the "recipe".

Therefore, in the next phase of AI development, it is necessary to fundamentally rethink evaluation methods, not just create more difficult benchmarks. We need to question current assumptions and create entirely new evaluation systems, forcing the invention of methods that surpass the existing "recipe".

The Importance of Real-World Utility

AI's success in games like chess and Go, its academic performance surpassing most humans, and its achievements in Olympiad-level competitions haven't translated into significant economic or GDP changes. Yao terms this the utility problem, considering it a crucial challenge for AI development.

This problem arises from the discrepancy between existing evaluation settings and real-world conditions.

  • Interaction with Humans: Evaluations typically require autonomous operation, where an Agent receives a task input and completes it independently for a reward. However, real-world Agents need to interact with humans throughout the process.

  • Independent and Identically Distributed (IID) Data: Evaluations often assume IID data, where tasks are processed independently and averaged. In reality, task solving is sequential, and experience gained in one task can inform subsequent tasks. Software engineers familiarity with the code based increases, allowing them to more effectively solve problems. This is not possible with agents.

Generic methods under current assumptions may not continue to be as effective. Therefore, in the "second half" of AI, we need to develop new evaluation setups or tasks that reflect real-world utility, then use generic methods to solve these tasks or enhance these methods with novel components, then cycle this process.

Was this summary helpful?

Quick Actions

Watch on YouTube

Related Summaries

Stay Updated

Get the latest summaries delivered to your inbox weekly.