Introduction
Welcome and Host Introduction
Hello everyone! Welcome back to episode 81 of the Stanford MLCS seminar series. I'm Dan Fu. Today, I'm joined by Ben Spectre. Ben is filling in for Simon this week. Starting the next calendar year in the Winter quarter, he'll be part of a group of people taking over. It's great to get to know him; this is like his internship week. If you don't like him, feel free to leave comments below, but we'll ignore them. We'll still take the feedback though.
Ben Spectre's Introduction
Ben is a second-year PhD student at Stanford. He works at the intersection of AI and systems. Previously, he worked on speculative decoding to make it run faster. These days, he's more interested in training and how to build hardware and training algorithms.
Guest Speaker Introduction
Today, we're joined by Martin Moss, a staff research scientist at Google DeepMind. He's interested in language runtimes, computer architecture, systems, and machine learning, with a focus on the machine learning for systems (ML4S) side of problems. Before Google, he was doing a PhD at Brookline. Today, he's going to talk about a taxonomy of machine learning for systems problems.
Machine Learning for Systems: Definition and Common Patterns
Definition of Machine Learning for Systems
Systems have various policies that determine how to execute computer programs. A system policy describes how these decisions are made. For example, a compiler pass, a branch predictor, or a hardware component that decides how a program runs are all system policies. Machine learning for systems is applying machine learning in the implementation of such a system policy.
Common Patterns in Machine Learning for Systems
In recent years, there has been a lot of work on using machine learning in systems. A common approach is to take an existing system policy and replace it with machine learning. However, this approach raises several questions: 1. What makes machine learning better than the baseline? 2. What is the model actually learning? Is it generalizing or overfitting? 3. How do we know the best learning approach for a given problem? 4. Could the gain achieved by machine learning have been achieved more cheaply using conventional techniques or a cheaper machine learning approach?
To answer these questions, we need a shared terminology and methodology.
A Taxonomy of Machine Learning for Systems
Types of System Policies
There are four different types of system policies: 1. Ad Hoc Policy: Simple decision logic put into a system. For example, an ad hoc policy in a compiler might decide to inline a function if it has less than 10 instructions. 2. Empirically Tuned Policy: More complex policies developed using benchmarks to improve performance. In a compiler, this could be an algorithm that looks at the call graph of a program to determine inlining. 3. Data-Driven Policy: Policies that measure workloads and use the measurements to make decisions. Feedback-directed optimization in compilers is a classic example. 4. Adaptive Data-Driven Policy: Similar to data-driven policies, but can revisit decisions at runtime. For example, a just-in-time compiler can continuously measure an application and revise inlining decisions.
Machine learning typically allows us to go from an empirically tuned to a data-driven policy.
Shared Patterns in Machine Learning for Systems
There are five common patterns in how machine learning enables data-driven policies: 1. Anomaly Detection: Detecting performance regressions in programs. Autoencoders or clustering techniques are commonly used. 2. Forecasting: Predicting future application behavior, such as resource demands. Time series modeling, neural networks, and decision trees can be used. 3. Extrapolation: Applying learned patterns to new data. For example, predicting whether a new program is scale-up or scale-out. Supervised learning techniques like neural networks and decision trees are used. 4. Discovery: Using machine learning to come up with new policies that can be deployed in the system. Reinforcement learning or imitation learning can be used. 5. Optimization: Solving optimization problems in systems, such as in machine learning for hardware design or auto-tuners.
Examples of Machine Learning for Systems
AI for Code and Memory Management
AI for code can help lower levels of the system stack by reasoning about code intention. Memory allocators are a good example. In a C++ application, the memory allocator decides where to place objects in memory. To do this efficiently, it needs to know the lifetimes of objects, which is related to the code's behavior.
By treating symbolized stack traces as text, a language model can be used to predict the lifetime of an object. This prediction can then be used to build a more efficient memory allocator. However, running the model at every allocation is too expensive, so a caching mechanism is used.
This is an example of the extrapolation pattern, where machine learning is used to predict one property, and the rest of the system is built in a traditional way.
Allocation in Machine Learning Accelerators
Machine learning accelerators, such as Google's data center TPUs and the Google Pixel phone's accelerator, have memory allocation problems. Given a sequence of fixed buffers with known start and end times, the goal is to place them in memory without exceeding the capacity.
This is an NP-hard problem. Traditional approaches include heuristics (fast but may not find a solution) and solvers (can handle complex inputs but can be slow). The TALO allocator combines heuristics and solvers to get the best of both worlds.
Even with TALO, there are still inputs that it can't solve. Machine learning can be used to develop a backtracking heuristic. This is an example of the discovery pattern, where a new policy is learned offline and plugged into the system.
Conclusion and Discussion
Takeaways
- Machine learning for systems is a growing and promising research area with many unsolved problems.
- As the field evolves, it's important to build common data sets, benchmarks, and best practices.
- Focus on the specific sub-aspect of the problem that machine learning can solve to maximize impact and practicality.
Q&A Session
During the Q&A session, Martin Moss answered questions about cost metrics in machine learning for systems, generalization across programs, the impact of machine learning on user experience, previously unknown system properties, the use of deep learning versus other machine learning techniques, the allocation of bandwidth and compute to machine learning models, anomaly detection using autoencoders, learning from machine learning systems, local versus global optimization, the integration of machine learning in the system stack in the future, the impact of machine learning products on the machine learning for systems conversation, and the security implications of learned systems.
Future Plans
Martin Moss is currently excited about AI for code and its implications.
Closing
The seminar ended with Dan Fu thanking Martin Moss for his presentation and inviting viewers to visit the Stanford MLCS website for more information about the seminar series.