Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Machine Learning Systems: Design and Implementation, 2nd Edition

This book provides a comprehensive introduction to the design and implementation of modern machine learning systems. It covers the full technology stack, from programming interfaces and AI accelerators to distributed training, model serving, and large-scale GPU cluster management.

The 2nd edition has been significantly restructured and expanded to reflect the rapid evolution of the ML systems landscape, including new chapters on AI compilers, RL systems, and GPU cluster management.

Preface

TODO: This chapter covers the background of the book, target readers, and the evolution of machine learning systems.

Chapter 1: Introduction

TODO: This chapter covers an overview of machine learning system architecture and technology stack.

Chapter 2: Programming Interfaces and Computational Graphs

TODO: This chapter covers tensor abstraction, automatic differentiation, graph representation and execution.

Chapter 3: AI Accelerators and Programming

TODO: This chapter covers GPU architecture and CUDA / Triton / CUTLASS programming models.

Chapter 4: AI Compilers and Runtime Systems

TODO: This chapter covers IR design, graph optimization, kernel generation, and runtime execution.

Chapter 5: Data Processing Systems

TODO: This chapter covers data loading, data pipelines, and distributed data processing.

Chapter 6: Training Systems

TODO: This chapter covers single-node and distributed training, parallelism strategies, and training optimization.

Chapter 7: Model Serving

TODO: This chapter covers inference optimization, online serving, and model management.

Chapter 8: RL Systems

TODO: This chapter covers reinforcement learning pipelines, environment interaction, and RL system design.

Chapter 9: Large-scale GPU Cluster Management

TODO: This chapter covers GPU scheduling, resource management, and large-scale training infrastructure.