Stanford MLSys Seminars: 機器學習系統分類：Martin Maas 談 Stanford MLSys 問題 #81

想知道如何運用機器學習（ML）來優化你的系統？這次史丹佛MLCS研討會精彩回顧將帶你一探究竟！Google DeepMind的研究科學家Martin Moss深入探討了ML在系統領域的應用，包括數據中心排程、編譯器，甚至是硬體設計。他提出了一種分類法和方法論，旨在更有效地將ML應用於系統策略，並解答關於ML如何提升效能，以及模型真正學習內容的關鍵問題。

Quick Takeaways:

系統策略的四大類型：特設、經驗調整、數據驅動及自適應數據驅動。
ML能將經驗調整策略轉為數據驅動策略。
常見的ML應用模式：異常檢測、預測、外推、發現及優化。
AI for Code如何與ML for Systems結合，特別是在記憶體管理方面？
了解如何在ML加速器中有效率地分配資源。
針對ML模型，如何使用哈希機制以降低延遲？ *如何設計記憶體分配器，並結合ML來預測對象的生命週期，進而減少記憶體碎片？

掌握這些知識，你也能在自己的系統中運用ML的力量！

Introduction

Welcome and Host Introduction

Hello everyone! Welcome back to episode 81 of the Stanford MLCS seminar series. I'm Dan Fu. Today, I'm joined by Ben Spectre. Ben is filling in for Simon this week. Starting the next calendar year in the Winter quarter, he'll be part of a group of people taking over. It's great to get to know him; this is like his internship week. If you don't like him, feel free to leave comments below, but we'll ignore them. We'll still take the feedback though.

Ben Spectre's Introduction

Ben is a second-year PhD student at Stanford. He works at the intersection of AI and systems. Previously, he worked on speculative decoding to make it run faster. These days, he's more interested in training and how to build hardware and training algorithms.

Guest Speaker Introduction

Today, we're joined by Martin Moss, a staff research scientist at Google DeepMind. He's interested in language runtimes, computer architecture, systems, and machine learning, with a focus on the machine learning for systems (ML4S) side of problems. Before Google, he was doing a PhD at Brookline. Today, he's going to talk about a taxonomy of machine learning for systems problems.

Machine Learning for Systems: Definition and Common Patterns

Definition of Machine Learning for Systems

Systems have various policies that determine how to execute computer programs. A system policy describes how these decisions are made. For example, a compiler pass, a branch predictor, or a hardware component that decides how a program runs are all system policies. Machine learning for systems is applying machine learning in the implementation of such a system policy.

Common Patterns in Machine Learning for Systems

In recent years, there has been a lot of work on using machine learning in systems. A common approach is to take an existing system policy and replace it with machine learning. However, this approach raises several questions: 1. What makes machine learning better than the baseline? 2. What is the model actually learning? Is it generalizing or overfitting? 3. How do we know the best learning approach for a given problem? 4. Could the gain achieved by machine learning have been achieved more cheaply using conventional techniques or a cheaper machine learning approach?

To answer these questions, we need a shared terminology and methodology.

A Taxonomy of Machine Learning for Systems

Types of System Policies

There are four different types of system policies: 1. Ad Hoc Policy: Simple decision logic put into a system. For example, an ad hoc policy in a compiler might decide to inline a function if it has less than 10 instructions. 2. Empirically Tuned Policy: More complex policies developed using benchmarks to improve performance. In a compiler, this could be an algorithm that looks at the call graph of a program to determine inlining. 3. Data-Driven Policy: Policies that measure workloads and use the measurements to make decisions. Feedback-directed optimization in compilers is a classic example. 4. Adaptive Data-Driven Policy: Similar to data-driven policies, but can revisit decisions at runtime. For example, a just-in-time compiler can continuously measure an application and revise inlining decisions.

Machine learning typically allows us to go from an empirically tuned to a data-driven policy.

Shared Patterns in Machine Learning for Systems

There are five common patterns in how machine learning enables data-driven policies: 1. Anomaly Detection: Detecting performance regressions in programs. Autoencoders or clustering techniques are commonly used. 2. Forecasting: Predicting future application behavior, such as resource demands. Time series modeling, neural networks, and decision trees can be used. 3. Extrapolation: Applying learned patterns to new data. For example, predicting whether a new program is scale-up or scale-out. Supervised learning techniques like neural networks and decision trees are used. 4. Discovery: Using machine learning to come up with new policies that can be deployed in the system. Reinforcement learning or imitation learning can be used. 5. Optimization: Solving optimization problems in systems, such as in machine learning for hardware design or auto-tuners.

Examples of Machine Learning for Systems

AI for Code and Memory Management

AI for code can help lower levels of the system stack by reasoning about code intention. Memory allocators are a good example. In a C++ application, the memory allocator decides where to place objects in memory. To do this efficiently, it needs to know the lifetimes of objects, which is related to the code's behavior.

By treating symbolized stack traces as text, a language model can be used to predict the lifetime of an object. This prediction can then be used to build a more efficient memory allocator. However, running the model at every allocation is too expensive, so a caching mechanism is used.

This is an example of the extrapolation pattern, where machine learning is used to predict one property, and the rest of the system is built in a traditional way.

Allocation in Machine Learning Accelerators

Machine learning accelerators, such as Google's data center TPUs and the Google Pixel phone's accelerator, have memory allocation problems. Given a sequence of fixed buffers with known start and end times, the goal is to place them in memory without exceeding the capacity.

This is an NP-hard problem. Traditional approaches include heuristics (fast but may not find a solution) and solvers (can handle complex inputs but can be slow). The TALO allocator combines heuristics and solvers to get the best of both worlds.

Even with TALO, there are still inputs that it can't solve. Machine learning can be used to develop a backtracking heuristic. This is an example of the discovery pattern, where a new policy is learned offline and plugged into the system.

Conclusion and Discussion

Takeaways

Machine learning for systems is a growing and promising research area with many unsolved problems.
As the field evolves, it's important to build common data sets, benchmarks, and best practices.
Focus on the specific sub-aspect of the problem that machine learning can solve to maximize impact and practicality.

Q&A Session

During the Q&A session, Martin Moss answered questions about cost metrics in machine learning for systems, generalization across programs, the impact of machine learning on user experience, previously unknown system properties, the use of deep learning versus other machine learning techniques, the allocation of bandwidth and compute to machine learning models, anomaly detection using autoencoders, learning from machine learning systems, local versus global optimization, the integration of machine learning in the system stack in the future, the impact of machine learning products on the machine learning for systems conversation, and the security implications of learned systems.

Future Plans

Martin Moss is currently excited about AI for code and its implications.

Closing

The seminar ended with Dan Fu thanking Martin Moss for his presentation and inviting viewers to visit the Stanford MLCS website for more information about the seminar series.

機器學習系統分類：Martin Maas 談 Stanford MLSys 問題 #81

Summary

Quick Abstract