Stanford MLSys Seminars: ML for Systems: A Taxonomy of Problems & AI for Code (Martin Maas, Stanford)

Dive into the world of Machine Learning for Systems (ML4Sys)! This discussion with Martin Moss from Google DeepMind explores how ML is revolutionizing system policies, from data center scheduling to hardware design. Learn about a taxonomy for understanding ML4Sys, uncovering common patterns & methodologies, and addressing key questions about ML's effectiveness.

Quick Takeaways:

ML bridges the gap between empirically tuned & data-driven policies.
ML enables data-driven policies through anomaly detection, forecasting, extrapolation, discovery & optimization.
AI for code can enhance memory management by predicting object lifetimes using stack trace analysis.
ML assists ML accelerators in memory allocation by learning backtracking policies, as seen in the TALO allocator.
Shared datasets & benchmarks are crucial for advancing the ML4Sys field.
Focusing ML efforts on specific, learnable sub-problems maximizes impact.

Machine Learning for Systems: A Taxonomy and Methodology

This article summarizes a seminar on the application of machine learning to systems, focusing on a taxonomy and methodology for approaching these problems. The talk emphasizes the need for shared terminology and a structured approach to understand the benefits and limitations of using machine learning in system design.

Introduction

Machine learning is increasingly being applied to various system areas, from data center scheduling to hardware design. It's crucial to understand what is meant by applying machine learning to a system and how to effectively utilize it.

Definition: System Policy

A system policy governs how a system, with its hardware and software components, makes decisions about program execution. Examples include compiler passes and branch predictors. Machine learning for systems involves using machine learning in the implementation of these policies.

Common Patterns and Open Questions

A common approach is to replace existing system policies with machine learning models. However, this raises questions:

What makes machine learning better than the baseline?
What is the model actually learning?
What is the best learning approach for a given problem?
Could the gains be achieved more cheaply with conventional techniques?

Taxonomy of System Policies

A framework for classifying system policies based on their complexity and adaptability is crucial.

Four Types of System Policies

Ad hoc Policy: Simple decision logic, often sufficient and left unchanged (e.g., inlining functions with fewer than 10 instructions).
Empirically Tuned Policy: More complex algorithms developed using benchmarks to improve performance.
Data-Driven Policy: Policies informed by workload measurements, allowing for different behaviors based on the application (e.g., feedback-directed optimization).
Adaptive Data-Driven Policy: Policies that can revisit and revise decisions at runtime based on continuous measurements (e.g., just-in-time compilers).

Machine Learning's Role: From Empirically Tuned to Data-Driven

Machine learning enables the transition from empirically tuned policies to data-driven policies by leveraging feedback gathered by models.

Taxonomy of Machine Learning Applications in Systems

Identifying shared patterns in how machine learning is used across different system problems allows for best practice development.

Common Machine Learning Applications

Anomaly Detection: Identifying performance regressions or unusual behavior using techniques like autoencoders and clustering.
Forecasting: Predicting future application behavior, such as resource demands, using time series modeling or decision trees.
Extrapolation: Applying knowledge from previous examples to new patterns, such as predicting whether a program will scale up or out, using supervised learning.
Discovery: Using machine learning to create new policies that can be deployed in the system, often employing reinforcement learning or imitation learning.
Optimization: Solving optimization problems in areas like hardware design using various machine learning techniques.

Examples of Machine Learning for Systems

Two examples illustrate the application of machine learning to systems problems.

AI for Code and Memory Management

AI for code can enhance memory allocators by enabling them to reason about code intention.

Problem: Memory allocators must decide where to place objects in memory. In C++, objects cannot be moved after placement, and efficient page utilization is crucial.
Challenge: Allocators lack understanding of the code's intent, making it difficult to determine object lifetimes.
Solution: Use stack traces (symbolized code traces) as text and apply language models (e.g., LSTMs) to predict object lifetimes based on programmer intent.

By predicting object lifetimes, memory allocators can be redesigned to optimize memory usage. A hashing mechanism can be used to cache model results and reduce latency.

Allocation in Machine Learning Accelerators

Machine learning can improve buffer allocation in machine learning accelerators.

Problem: Mapping execution graphs to memory buffers on accelerators is complex, including memory assignment and memory allocation.
Challenge: Memory allocation (placing buffers within assigned memory) is NP-hard. Traditional approaches include heuristics (fast but may fail) and solvers (handle complex inputs but can be slow).
Solution: Combine heuristics and solvers in a loop, where the heuristic makes decisions, and a constraint programming solver validates and provides feedback.

Talos: Combining Heuristics and Solvers

Talos, a new allocator, combines heuristics and constraint programming solvers. While Talos initially didn't use machine learning, it set the stage for further optimization using machine learning.

Imitation Learning for Backtracking

Machine learning (imitation learning with a random forest model) can be used to learn a backtracking policy and make smarter decisions.

Conclusion

Machine learning for systems is a promising research area that requires a shared methodology and terminology. By focusing machine learning efforts on specific sub-aspects of problems and using machine learning to predict one property, systems can be redesigned. Developing common datasets, benchmarks, and best practices are crucial for advancing the field.

ML for Systems: A Taxonomy of Problems & AI for Code (Martin Maas, Stanford)

Summary

Quick Abstract