Machine Learning Systems: Design and Implementation, 2nd Edition
This book provides a comprehensive introduction to the design and implementation of modern machine learning systems. It covers the full technology stack, from programming interfaces and AI accelerators to distributed training, model serving, and large-scale GPU cluster management.
The 2nd edition has been significantly restructured and expanded to reflect the rapid evolution of the ML systems landscape, including new chapters on AI compilers, RL systems, and GPU cluster management.
- 前言
- Basic
- System
- Applications and More
Preface
TODO: This chapter covers the background of the book, target readers, and the evolution of machine learning systems.
Chapter 1: Introduction
TODO: This chapter covers an overview of machine learning system architecture and technology stack.
Chapter 2: Programming Interfaces and Computational Graphs
TODO: This chapter covers tensor abstraction, automatic differentiation, graph representation and execution.
Chapter 3: AI Accelerators and Programming
TODO: This chapter covers GPU architecture and CUDA / Triton / CUTLASS programming models.
Chapter 4: AI Compilers and Runtime Systems
TODO: This chapter covers IR design, graph optimization, kernel generation, and runtime execution.
Chapter 5: Data Processing Systems
TODO: This chapter covers data loading, data pipelines, and distributed data processing.
Chapter 6: Training Systems
TODO: This chapter covers single-node and distributed training, parallelism strategies, and training optimization.
Chapter 7: Model Serving
TODO: This chapter covers inference optimization, online serving, and model management.
Chapter 8: RL Systems
TODO: This chapter covers reinforcement learning pipelines, environment interaction, and RL system design.
Chapter 9: Large-scale GPU Cluster Management
TODO: This chapter covers GPU scheduling, resource management, and large-scale training infrastructure.