notes
ML Systems · Performance · Practice

ML Systems Vol 2 Problems

ML Systems Practice: Vol 2

Questions from https://mlsysbook.ai/ Vol 2

Reference Spec Sheets

Numerical Precision Formats

Table 6.3: Numerical Precision Formats for ML: Each row represents a different precision format. FP8 formats (E4M3, E5M2) occupy the sweet spot between the bandwidth of INT8 and the trainability of FP16.

Format Exponent Mantissa Range Precision Use Case
FP32 8 bits 23 bits ±3.4×1038\pm 3.4 \times 10^{38} Very high Master weights
FP16 5 bits 10 bits ±65504\pm 65504 High Mixed-precision
BF16 8 bits 7 bits ±3.4×1038\pm 3.4 \times 10^{38} Moderate Training
E4M3 4 bits 3 bits ±448\pm 448 Low FP8 forward pass
E5M2 5 bits 2 bits ±57344\pm 57344 Very low FP8 gradients
INT8 N/A 8 bits −128-128 to +127+127 Uniform Post-training quantization
INT4 N/A 4 bits −8-8 to +7+7 Uniform KV cache, weights

Arithmetic Intensity of Common

Table 10.2: Arithmetic Intensity of Common ML Operations: Most operations in transformer inference, aside from large batched GEMMs, fall below the H100’s ridge point and are therefore memory-bound. Performance engineering focuses on reducing the memory traffic of these operations.

Operation Arithmetic Intensity H100 FP16 Regime Primary Bottleneck
GEMM (4096×40964096{\times}4096**)** ~1,365 FLOP/byte Compute-bound Tensor core throughput
Self-Attention (seq=2048) ~50–200 FLOP/byte Memory-bound HBM bandwidth
Element-wise (GELU, LayerNorm) ~1–3 FLOP/byte Memory-bound HBM bandwidth
LLM Decode (batch=1) ~1–2 FLOP/byte Memory-bound HBM bandwidth

Representative Devices

Published benchmark results from MLPerf Tiny and official vendor data make these hardware tiers concrete. Figure 12.6 plots inference latency against energy per inference for representative devices, revealing three distinct clusters separated by approximately 100×\times in energy consumption. Dedicated neural processors such as the Syntiant Core 2 and STM32N6 NPU achieve keyword spotting in under 5 ms at 30 to 160 microjoules, while edge GPUs like the Jetson AGX Orin deliver sub-millisecond latency at 15 millijoules. The 100×\times energy gap between tiers determines which devices can operate on battery power for months vs. hours, fundamentally shaping the feasible design space for on-device learning.

Accelerator Decision Matrix

The accelerator decision matrix

Table 2.7: The Accelerator Decision Matrix: The optimal accelerator depends not on peak specifications but on which physical resource – compute, bandwidth, or capacity – is the binding constraint for the target workload. The Roofline Model provides the diagnostic: compute the arithmetic intensity, locate the workload relative to the ridge point, and select hardware that maximizes the binding resource per dollar.

Workload Binding Constraint Key Metric Recommended Class
**LLM Training (****>**100B) Compute (high batch size) Peak TFLOPS, NVLink BW H100/B200, TPU v5p
LLM Inference (batch=1) Memory bandwidth GB/s per dollar A100, H100 (bandwidth/$)
LLM Inference (batched) Compute + bandwidth TFLOPS and GB/s H100, B200
Vision Model Training Compute (large spatial dims) Peak TFLOPS H100/B200, GPU preferred
Recommendation (embeddings) Memory capacity GB per dollar CPU DRAM + GPU hybrid
**Fine-tuning (****<**13B) Memory capacity HBM capacity A100 (cost-effective)
Research/Prototyping Flexibility Software ecosystem GPU (CUDA), avoid ASICs

Component Mean Time

Table F.5: Component Mean Time To Failure: Steady-state MTTF values for data center-grade components, ordered from most failure-prone (GPU die) to most reliable (optical cable). A node with 8 GPUs, 8 NICs, 2 PSUs, 1 PCIe switch, and 8 HBM stacks has a combined MTTF of approximately 1/(8/50,000 + 8/150,000 + 2/100,000 + 1/200,000 + 8/200,000) ≈\approx 3,600 hours (≈\approx 150 days).

Constant Value Unit
GPU_MTTF_HOURS 50000 hours
NIC_MTTF_HOURS 150000 hours
PSU_MTTF_HOURS 100000 hours
PCIE_SWITCH_MTTF_HOURS 200000 hours
HBM_MTTF_HOURS 200000 hours
TOR_SWITCH_MTTF_HOURS 300000 hours
CABLE_MTTF_HOURS 50000 hours

Recovery Time Parameters

Table F.6: Recovery Time Parameters: The total recovery time per failure event is approximately HEARTBEAT_TIMEOUT_S + RESCHEDULE_TIME_S + checkpoint_size/CHECKPOINT_WRITE_BW_GBS. For a 70B model checkpoint (280 GB in FP32): 30 + 60 + 280/100 ≈\approx 93 seconds per failure. At 4 GPU failures per day on an 8,192-GPU cluster, this costs roughly 6 minutes of lost training time daily.

Constant Value Unit
HEARTBEAT_TIMEOUT_S 30 seconds
RESCHEDULE_TIME_S 60 seconds
CHECKPOINT_WRITE_BW_GBS 100 GB/s

Intra-Node Interconnect

Node is a physical server chassis that aggregates multiple accelerators—typically 8—through a high-speed intra-node interconnect (NVLink or ICI), creating the fundamental boundary between high-bandwidth local communication and the order-of-magnitude slower inter-node network fabric. Table 2.8: The Bandwidth Hierarchy: Each physical boundary introduces an order-of-magnitude bandwidth cliff. These cliffs are not engineering failures to be optimized away; they reflect fundamental differences in the physics of each interconnect medium. The cliffs dictate model partitioning: Tensor Parallelism, which requires AllReduce after every layer, is strictly confined to the intra-node domain.

Domain Interconnect Bandwidth Latency Scaling Limit
Intra-Package Silicon Interposer ~3.3 TB/s <100 ns Single Chip
Intra-Node NVLink/ICI ~900 GB/s ~1 μs Node (8–16 Chips)
Intra-Node (IO) PCIe Gen5 x16 ~64 GB/s ~2 μs CPU-GPU, NIC-GPU

Inter-Node Network

Node is a physical server chassis that aggregates multiple accelerators—typically 8—through a high-speed intra-node interconnect (NVLink or ICI), creating the fundamental boundary between high-bandwidth local communication and the order-of-magnitude slower inter-node network fabric. Table 2.8: The Bandwidth Hierarchy: Each physical boundary introduces an order-of-magnitude bandwidth cliff. These cliffs are not engineering failures to be optimized away; they reflect fundamental differences in the physics of each interconnect medium. The cliffs dictate model partitioning: Tensor Parallelism, which requires AllReduce after every layer, is strictly confined to the intra-node domain.

Domain Interconnect Bandwidth Latency Scaling Limit
Intra-Package Silicon Interposer ~3.3 TB/s <100 ns Single Chip
Intra-Node NVLink/ICI ~900 GB/s ~1 μs Node (8–16 Chips)
Intra-Node (IO) PCIe Gen5 x16 ~64 GB/s ~2 μs CPU-GPU, NIC-GPU

Communication Model Parameters

Communication Model Parameters

Table F.7: Communication Model Parameters (α\alpha**-β\beta****)**: Startup latency (α\alpha) and sustained bandwidth (β\beta) for each interconnect technology. For small messages, α\alpha dominates; for large gradient payloads, β\beta dominates. The 10×\times latency gap between InfiniBand NDR and TCP/IP explains why RDMA-capable fabrics are essential for synchronous distributed training.

Constant Value Unit
IB_NDR_LATENCY_US (α\alpha) 5 us
INFINIBAND_NDR_BW_GBS (β\beta) 50 GB/s
IB_HDR_LATENCY_US (α\alpha) 7 us
INFINIBAND_HDR_BW_GBS (β\beta) 25 GB/s
ROCE_LATENCY_US (α\alpha) 10 us
ROCE_100G_BW_GBS (β\beta) 12.5 GB/s
TCP_LATENCY_US (α\alpha) 50 us

Single-Node Foundational

Table F.1: Single-Node Foundational Constants: Recapping the hardware specifications for the H100 accelerator. These values provide the RpeakR_{\text{peak}} and BW\text{BW} baselines used in the iron law calculations throughout this volume.

Tier Specification Reference Value
Compute FP16 Throughput 989 TFLOPS
Compute FP8 Throughput 1,979 TFLOPS
Memory HBM3 Bandwidth 3.35 TB/s
Memory HBM3 Capacity 80 GB
Thermal TDP 700 W

Canonical Cluster Sizes

This book uses four canonical cluster sizes to illustrate how system behavior changes across scale. These sizes correspond to real-world deployment tiers: a research lab cluster, a medium-scale production cluster, a large training cluster, and a hyperscale fleet. Table F.2 defines the GPU count for each tier; the chapters that reference them derive node counts, failure rates, and network requirements from these baselines combined with the constants in subsequent tables. Table F.2: Canonical Cluster Sizes: Four reference tiers used throughout this book. At 8 GPUs per node, these correspond to 32, 256, 1,024, and 12,500 nodes respectively. The jump from 256 to 8,192 GPUs crosses the threshold where failure handling shifts from exception to steady-state process.

Constant Value Unit
CLUSTER_SMALL_GPUS 256 GPUs
CLUSTER_MEDIUM_GPUS 2048 GPUs
CLUSTER_LARGE_GPUS 8192 GPUs
CLUSTER_MEGA_GPUS 100000 GPUs

Ch. 1

Problem 1 (Ch. 1, §The Scale Moment)

A training run for a GPT-4 class model uses 25,000 GPUs. If each individual GPU has an annual failure rate of 8 percent, how often will the training job be interrupted by hardware failure?


Problem 2 (Ch. 1, §The C3^3 Taxonomy: Foundations of Scale)

Calculate the scaling efficiency and energy cost of training GPT-3 (175B params) on a cluster connected by 100G Ethernet vs. 200G InfiniBand.


Ch. 2

Problem 3 (Ch. 2, §The Node)

A 10 GB buffer is synchronized across the fleet. How much does the physical “Layer” of the fleet affect transfer time?


Problem 4 (Ch. 2, §The Rack)

Compare the deployment timeline for 10,000 GPUs vs. the electrical substation required to power them (7 MW).

  • Silicon Path: GPU supply chains are volatile, but typical enterprise lead times are 6 months.
  • Infrastructure Path: Permitting, EPC (Engineering, Procurement, Construction), and grid connection for a new 10+ MW substation averages 24 months.
  • The lag: Infrastructure takes 4**×\times longer** to deploy than the accelerators themselves.

Ch. 3

Problem 5 (Ch. 3, §Level 2: Transport and the Performance Model)

A fabric designer is comparing InfiniBand NDR (1.5 μs\mu s, 50 GB/s) against a slower Ethernet baseline (5.0 μs\mu s, 12.5 GB/s). For a 4 KB control message and a 100 MB gradient shard, which part of T(n)=α+n/βT(n)=\alpha+n/\beta dominates, and why does the faster fabric help for different reasons in each regime?


Problem 6 (Ch. 3, §Level 3: Switch and Topology)

A cluster designer is choosing between a “Non-blocking” (1:1) fat-tree and a “Cost-optimized” (4:1) spine for a 1024-GPU cluster. How much slower will a 100 GB-per-GPU AllReduce be on the cheaper network?


Problem 7 (Ch. 3, §Level 3: Switch and Topology)

A team is synchronizing per-rank data-parallel gradients across 128 nodes. In a standard fat-tree, each message between same-rank GPUs traverses a Leaf switch and a Spine switch (2 hops). In a rail-optimized network, all corresponding GPUs are on the same rail switch (1 hop). How much “Latency Dividend” does the rail design earn?


Problem 8 (Ch. 3, §Level 3: Switch and Topology)

A cluster has 1,024 accelerators across 128 nodes. Each accelerator has 400 Gb/s (50 GB/s) network injection bandwidth. An AllReduce job requires full bisection bandwidth.

Scenario A (non-blocking fat-tree): 1:1 oversubscription ratio. Bisection bandwidth = 1,024×50=51,2001{,}024 \times 50 = 51{,}200 GB/s = 51.2 TB/s.

Scenario B (cost-optimized): 4:1 oversubscription at the spine layer. Bisection bandwidth = 51,200/4=12,80051{,}200/4 = 12{,}800 GB/s = 12.8 TB/s.


Problem 9 (Ch. 3, §Level 4: Fabric Behavior (Congestion, Routing))

A 4096-GPU RoCE cluster is in operation. If the probability of a single transceiver degrading and triggering PFC pauses is just 0.001% per day, what is the chance of a cluster-wide “PFC Storm” occurring today?


Problem 10 (Ch. 3, §Level 5: Cluster Design and Case Studies)

Calculate the power savings of moving a 51.2 Tbps switch from pluggable transceivers to Co-Packaged Optics (CPO).

  • Pluggable Architecture: 128 ports ×\times 20 W = 2.56 kW for optics alone.
  • CPO Architecture: 128 engines ×\times 10 W = 1.28 kW.
  • The dividend: The savings reach 1.28 kW of power per switch.

Ch. 4

Problem 11 (Ch. 4, §How ML Workloads Invert Storage Assumptions)

A dataset is split into 1000 shards on a shared file system. If 32 GPUs each pick a shard at random to start their next epoch, what is the probability that at least two GPUs “collide” on the same storage server, causing a performance bottleneck?


Problem 12 (Ch. 4, §The ML Storage Hierarchy)

A ResNet-50 training job on ImageNet (1.28M images, ~150 KB average) targets 1,000 images/second. The question is whether to use individual JPEG files on an HDD or NVMe.


Problem 13 (Ch. 4, §The ML Storage Hierarchy)

A vision-model training pipeline runs each step in 800 ms. Fetching data from a shared Parallel File System adds 150 ms of I/O wait because of network congestion. How much does adding local NVMe SSDs to each node improve GPU utilization?


Problem 14 (Ch. 4, §GPU Direct Storage and the CPU Bypass)

A training node with 8 GPUs loads 150 KB images at 8,000 images/second per GPU (64,000 images/second total). Compare the CPU load under traditional I/O vs. GDS.

Traditional path: Each image requires a DMA from NVMe to DRAM, a memcpy from kernel to user space, and a PCIe transfer to GPU. At 64,000 images/second with 120 μs of CPU time per image, the CPU spends 7.68 seconds of CPU time per wall-clock second, consuming roughly 8 cores worth of processing just for data movement.

GDS path: Each image is DMA’d directly from NVMe to GPU. At 30 μs of CPU time per image (for initiating the DMA), the CPU spends 1.92 seconds of CPU time per wall-clock second, freeing 6 cores for data augmentation.


Problem 15 (Ch. 4, §Storage Economics)

A team trains a vision model on a 50 TB image dataset stored in S3. Training runs for 20 epochs. The decision is whether to stream from S3 each epoch or stage to local NVMe.


Problem 16 (Ch. 4, §Checkpoint Storage)

A 256-node cluster saves a 175B-parameter checkpoint every 10 minutes. Each checkpoint totals 1,750 GB. With ZeRO-3, each node saves roughly 7 GB.

  • Per-node write to local NVMe (4 drives at 7 GB/s each = 28 GB/s): 7 GB÷28 GB/s≈0.257 \text{ GB} \div 28 \text{ GB/s} \approx 0.25 seconds.
  • Async copy to PFS: 256 nodes×\times 7 GB ≈\approx 1.8 TB total. If the PFS provides 1 TB/s aggregate, the storm completes in roughly 1.8 seconds.
  • Per-node PFS bandwidth: If all 256 nodes write simultaneously, each gets 1000/256=3.91000/256 = 3.9 GB/s. Per-node async copy: 7/3.9≈1.87/3.9 \approx 1.8 seconds.
  • Training pause: only about 0.25 seconds (the local NVMe write). The PFS copy overlaps with the next training iteration.
  • Overhead: 0.25 s pause every 600 s ≈\approx 0.04 percent training time lost to checkpointing.

Problem 17 (Ch. 4, §The Synthetic Fuel Line)

Calculate the storage amplification of a 1 TB synthetic dataset that requires cryptographic lineage and multi-model verification.

  • Raw Payload: 1 TB.
  • Provenance Overhead: 40 percent extra for lineage hashes, generation logs, and reward-model scores.
  • Verification Factor: To avoid “Self-Poisoning,” each sample is verified by 3 independent “Judge” models.
  • The amplification: Total footprint = 1 TB ×\times 1.4 ×\times 3 = 4.2 TB.

Ch. 6

Problem 18 (Ch. 6, §Data Parallelism)

A GPT-2 run on a commodity 10G network is communication-bound at 32 GPUs, costing $3,021 for a fixed number of samples. Can a single 8-GPU node achieve the same effective batch size more efficiently using gradient accumulation?


Problem 19 (Ch. 6, §Model Parallelism)

Training a 175 Billion parameter model (like GPT-3) on NVIDIA A100s (80 GB). Can Data Parallelism with ZeRO-3 handle this?


Problem 20 (Ch. 6, §Model Parallelism)

A frontier-model training run uses pipeline parallelism across 8 nodes. To hide the sequential delay, the batch is split into 32 microbatches. What is the “bubble tax”—the fraction of GPU cycles lost to idle waiting?


Problem 21 (Ch. 6, §Hybrid Parallelism)

An engineering team needs to schedule Archetype A (a 175B parameter model) on a cluster of 8,192 H100 GPUs. Manually searching the 3D-parallelism space (TP ×\times PP ×\times DP) is error-prone: a split that maximizes DP might exceed the 80 GB HBM capacity, while a split that maximizes TP might saturate the NVLink interconnect.

Solution: Instead of trial and error, we invoke a Tier 3 Optimizer (like the ParallelismOptimizer in our physics engine) to find the mathematically optimal split. We configure the optimizer with the workload and cluster constraints, setting the objective to maximize Model FLOPs Utilization (MFU).


Ch. 7

Problem 22 (Ch. 7, §From Parallelism to Communication Patterns)

A 70 billion parameter model trains with data parallelism across 64 GPUs connected by InfiniBand NDR (50 GB/s per port). Each GPU computes gradients in BF16 (2 bytes per parameter, standard for Llama-class training). How long does one AllReduce take?


Problem 23 (Ch. 7, §Mapping the Terrain: Network Performance Modeling)

Consider a cluster using InfiniBand NDR 400 Gbps with α=\alpha = 2 μ\mus and β=\beta = 50 GB/s. At what message size does optimizing for bandwidth start to matter more than optimizing for latency?


Problem 24 (Ch. 7, §Mapping the Terrain: Network Performance Modeling)

Consider synchronizing a 1 MB buffer vs. a 1 GB buffer on InfiniBand NDR with α=\alpha = 2 μ\mus and β=\beta = 50 GB/s. How does the bottleneck shift?

Case A: 1 MB Message

  • Bandwidth Time: 106/(50×109)10^6 / (50 \times 10^{9}) = 20 μs\mu\text{s}.

  • Latency Time: 2μs2\ \mu s.

  • Total: 22 μs. Latency is 9 percent of total, still meaningful. Case B: 1 GB Message

  • Bandwidth Time: 109/(50×109)10^9 / (50 \times 10^{9}) = 20,000 μs\mu\text{s} = 20 ms.

  • Latency Time: 2μs2\ \mu s.

  • Total: 20,002 μs. Latency is 0.01 percent, completely negligible.


Problem 25 (Ch. 7, §Mapping the Terrain: Network Performance Modeling)

A training pipeline attempts to overlap gradient AllReduce with the next layer’s backward pass. The backward pass takes 500 μs. The AllReduce has network latency Llat=100μsL_{\text{lat}} = 100\ \mu\text{s} but processor overhead o=50μso = 50\ \mu s to initiate and o=50μso = 50\ \mu s to receive. Can the communication be hidden?


Problem 26 (Ch. 7, §Choosing the Vehicle: Collective Operation Primitives)

A MoE model processes a batch of 4096 tokens across 8 GPUs (512 tokens per GPU). Each token is a 2048-dimensional hidden state in BF16 (4 KB per token). The gating network assigns each token to exactly 1 of 64 experts (8 experts per GPU). Assuming uniform routing (each expert receives 4096/644096/64 = 64 tokens), how much data does each GPU send and receive?


Problem 27 (Ch. 7, §Engineering the Flow: AllReduce Algorithms)

A cluster synchronizes a 1 MB buffer across 64 GPUs. The network has latency α=\alpha = 10 μ\mus and bandwidth β=\beta = 10 GB/s. Which algorithm performs better: Ring or Tree?


Problem 28 (Ch. 7, §Hierarchical Communication)

A cluster has 8 nodes, each with 8 GPUs (64 GPUs total). The system must AllReduce a 1 GB gradient buffer. Compare flat Ring AllReduce vs. Hierarchical AllReduce.

Flat Ring AllReduce (ignoring hierarchy):

  • Each GPU sends ~2 GB total (the bandwidth-optimal Ring AllReduce formula).

  • The ring crosses node boundaries multiple times.

  • Effective bandwidth: Limited by the slowest link = 50 GB/s (InfiniBand).

  • Time: ≈2×1GB/50GB/s\approx 2 \times 1\ \text{GB} / 50\ \text{GB/s} = 40 ms (bandwidth term dominates). Hierarchical AllReduce (3-step decomposition):

  • Intra-node ReduceScatter: Each GPU sends 875 MB at 900 GB/s → ~1 ms

  • Inter-Node AllReduce: Each GPU AllReduces a 1GB/81\ \text{GB}/8 = 125 MB shard over InfiniBand; Ring moves roughly 2(N−1)/N2(N-1)/N times that payload, giving ~5 ms at 50 GB/s

(Only 1/8 of the data crosses InfiniBand!)

  • Intra-node AllGather: Each GPU receives 875 MB at 900 GB/s → ~1 ms

  • Total time: ≈\approx 1 + 5 + 1 = 7 ms


Problem 29 (Ch. 7, §Hierarchical Communication)

A 128-GPU cluster is arranged as 4 racks of 4 nodes of 8 GPUs. Cross-rack bandwidth is oversubscribed 2:1 (effective 25 GB/s). How much does 3-level hierarchical AllReduce reduce cross-rack traffic for a 2 GB gradient?

Level 1 (Intra-Node, NVLink): ReduceScatter reduces each GPU’s contribution by 8×\times. Each GPU sends 1.75 GB at 900 GB/s = 1.9 ms.

Level 2 (Intra-Rack, InfiniBand): Ring AllReduce among 4 nodes on the same rack. Each GPU AllReduces a 0.25 GB rack-level payload; Ring traffic is roughly 2(N−1)/N2(N-1)/N times that payload at 50 GB/s. Time: 7.5 ms.

Level 3 (Cross-Rack, Spine): Ring AllReduce among 4 racks. Each GPU AllReduces only a 0.062 GB cross-rack payload; Ring traffic again applies the 2(N−1)/N2(N-1)/N multiplier at 25 GB/s. Time: 3.75 ms.


Problem 30 (Ch. 7, §The Last Resort: Gradient Compression)

A training job synchronizes a 40 ms gradient buffer. The system implements 1-bit Adam, which reduces communication volume by 32×\times but adds 2 ms of CPU/GPU overhead for the compression logic. If the effective network throughput improves by 8×\times, what is the actual communication speedup?


Problem 31 (Ch. 7, §Communication-Computation Overlap)

A 32-layer transformer model (7B parameters) is trained on 64 GPUs. Each layer’s backward pass takes 15 ms. The hierarchical AllReduce for each layer’s gradients (~880 MB per layer) takes 26 ms using 100 MB buckets. What is the step time with and without overlap?

Without overlap (sequential):

Tsequential=Tbackward+TcommT_{\text{sequential}} = T_{\text{backward}} + T_{\text{comm}} = 480 ms + 32×\times 26 ms = 1325 ms.

With overlap (pipelined):

Each layer’s AllReduce (26 ms) runs in parallel with the next layer’s backward pass (15 ms). Since 26 ms > 15 ms, there is 11 ms of exposed communication per layer that cannot be hidden.

Tpipelined=Tbackward+Tfirst layer comm+(Nlayers−1)×TexposedT_{\text{pipelined}} = T_{\text{backward}} + T_{\text{first layer comm}} + (N_{\text{layers}} - 1) \times T_{\text{exposed}} = 860 ms.


Ch. 8

Problem 32 (Ch. 8, §Failure Analysis at Scale)

A cluster of 10,000 GPUs runs with each GPU at 99.99 percent availability (only 52 minutes of downtime per year). What is the probability that the entire cluster is up at the same instant?


Problem 33 (Ch. 8, §Check-and-Verify: Defending Against Silent Data Corruption)

Calculate the probability that at least one GPU in a 100,000-GPU fleet experiences a silent ALU error during a single 2-second training step.

  • Fleet Size: 100,000 accelerators.
  • Individual Risk: 10−610^{-6} per hour (a conservative estimate for SDC).
  • The exposure: In a 2-second window, the fleet has 100,000×(2/3600)≈55100,000 \times (2/3600) \approx 55 “GPU-hours” of exposure.
  • The probability: P(at least one SDC)≈P(\text{at least one SDC}) \approx 0.0056%.

Problem 34 (Ch. 8, §Checkpointing: Preserving Progress)

A 10,000-GPU cluster has an MTBF of 3.69 hours. A full model checkpoint takes 21 seconds to write. What is the optimal checkpoint frequency?


Ch. 9

Problem 35 (Ch. 9, §The Scheduling Problem)

A 1024-GPU cluster receives two frontier training jobs from separate teams, each requiring the full cluster. A naive scheduler (nongang) allocates 512 GPUs to Team A and 512 to Team B, then waits for more GPUs to become available. What is the steady-state outcome?


Problem 36 (Ch. 9, §Debugging Cluster Utilization)

A 1,000-GPU cluster reports 60 percent average utilization despite a full job queue with over 50 pending jobs. Engineering leadership expects greater than 80 percent utilization given the capital investment. The standard monitoring dashboards show plenty of idle GPUs, yet users complain about long queue times. Identify the binding constraint and the systematic diagnostic path.

The Fleet Stack framework in Figure 1.13 provides a structured approach: analyze the Infrastructure Layer first to understand hardware constraints, then the Distribution Layer to understand scheduling logic, and finally the interaction between layers to identify the root cause.

Infrastructure Layer Analysis: The cluster contains heterogeneous hardware acquired over three procurement cycles:

Table 9.6: Cluster Hardware Inventory: Three procurement generations create distinct resource pools with different capabilities. A100 nodes support tensor parallelism via NVLink, while V100 nodes are limited to data parallelism.

GPU Type Count Memory Interconnect Nodes
A100-80 GB 400 80 GB HBM2e NVLink (600 GB/s) 50 nodes ×\times 8 GPUs
A100-40 GB 400 40 GB HBM2e NVLink (600 GB/s) 50 nodes ×\times 8 GPUs
V100-32 GB 200 32 GB HBM2 PCIe Gen3 (16 GB/s) 50 nodes ×\times 4 GPUs

Problem 37 (Ch. 9, §Debugging Cluster Utilization)

A platform team manages a 1000-GPU cluster. Initial monitoring shows 60 percent average utilization. After a week of policy debugging (capability-based scheduling, topology-aware placement, backfill tuning, and workload-hardware guidance), utilization increases to 84 percent. What is the financial value of that one week of engineering work?


Ch. 10

Problem 38 (Ch. 10, §The Memory Wall and the Efficiency Frontier)

A 70B parameter LLM is deployed on 8×\times H100 GPUs with tensor parallelism. At batch size 1, each GPU holds approximately 8.75B parameters in FP16 (17.5 GB of weights). Each decode step reads its weight shard to produce one token. What is the achieved arithmetic intensity, and what is the theoretical maximum token generation rate?


Problem 39 (Ch. 10, §Precision Engineering)

A 70B parameter model is served on 4×\times H100 GPUs. The model weights in FP16 consume 140 GB (35 GB per GPU). KV cache at FP16 consumes 1.34 GB per request. How does quantizing weights to INT4 and KV cache to INT8 change the maximum batch size?

Before optimization (all FP16):

  • Weights: 35 GB/GPU

  • Available for KV cache: 80−35=80 - 35 = 45 GB/GPU

  • KV cache per request: 1.34 GB total, or approximately 0.3355 GB/GPU

  • Maximum batch size: approximately 134 requests After optimization (INT4 weights, INT8 KV cache):

  • Weights: 35 GB ×\times (4/16) = 8.75 GB/GPU (INT4)

  • Available for KV cache: 80−80 - 8.75 == 71.25 GB/GPU

  • KV cache per request (INT8): approximately 0.671 GB total, or 0.1678 GB/GPU

  • Maximum batch size: approximately 424 requests


Problem 40 (Ch. 10, §Graph Compilation)

A 13B parameter model is deployed for inference. Without compilation, the PyTorch eager mode processes 120 tokens/second on a single H100. The Nsight Systems trace reveals that 35 percent of step time is spent in element-wise kernels (LayerNorm, GELU, residual additions) and 15 percent is kernel launch overhead. Applying torch.compile with the max-autotune backend, estimate the new throughput.


Problem 41 (Ch. 10, §System Profiling)

A 7B parameter LLM runs on a single H100 and shows 45 tokens/second during autoregressive generation at batch size 1. The pure FP16 weight-read roofline is higher, but after budgeting roughly 35 GB of total per-token traffic for weights, KV-cache reads, sampling, synchronization, and launch overhead, the bandwidth-limited full-decode ceiling is approximately 96 tokens/second. Where is the remaining 53 percent of realizable decode-step performance hiding?

Investigation:

Step 1: Kernel-level analysis. Run Nsight Compute on the dominant GEMM kernel. Result: achieves 2.8 TB/s effective bandwidth out of the H100’s 3.35 TB/s peak. Efficiency: 84 percent. This kernel is performing well.

Step 2: Trace-level analysis. Run Nsight Systems on a full decode step. Result: 42 percent of step time is spent in GEMM kernels. The remaining 58 percent is split between:

  • Attention kernels (including KV cache reads): 28 percent

  • Layer normalization and activation kernels: 12 percent

  • Softmax and top-kk sampling: 8 percent

  • Kernel launch gaps: 10 percent Step 3: Identify optimization targets.

  • The KV cache attention kernel achieves only 1.9 TB/s bandwidth because of irregular memory access patterns. Fix: Implement KV cache quantization (INT8) with better memory layout.

  • Kernel launch gaps (10 percent of time) come from 120+ individual kernel launches per layer. Fix: Apply torch.compile to fuse element-wise operations, reducing to ~30 kernels per layer.

  • Layer normalization and activation kernels are unfused. Fix: Fused LayerNorm-GELU kernel via Triton.


Ch. 11

Problem 42 (Ch. 11, §The Economics and Architecture of Inference)

A team has spent $2 million training a 70B parameter model and now serves it to 1 million daily active users (DAU), each making 50 requests/day. Is training or serving the dominant cost over 1 year?


Problem 43 (Ch. 11, §Batching Strategies at Scale)

An engineer optimizes two services: a Vision model (ResNet) and an LLM (70B). At what batch size does each hit the “Knee” of its efficiency curve?


Problem 44 (Ch. 11, §The Logic Wall: Test-Time Compute Scaling)

Calculate the latency impact of a model that uses 128 “Thinking Tokens” to solve a complex math proof vs. a standard answer.

  • Standard Response: 1 token answer = 100 ms.
  • Reasoning Response: 128 tokens of internal search/CoT before the answer.
  • The latency: 128 ×\times 100 ms = 12.8 seconds.

Problem 45 (Ch. 11, §The Logic Wall: Test-Time Compute Scaling)

An LLM serving system exhibits unexpectedly high tail latency. P50 latency is 100 ms and P95 is 180 ms, both within SLO, but P99 spikes to 500 ms against a 200 ms target. GPU utilization appears healthy at 85 percent. Where is the bottleneck?

Infrastructure Layer Analysis (Hardware Constraints):

The system runs on 4 A100-80 GB GPUs connected via PCIe Gen4 (32 GB/s per GPU) rather than NVLink. The server has a dual-socket CPU with NUMA topology. Memory bandwidth per GPU is 2 TB/s (HBM2e), adequate for decode operations. However, PCIe bandwidth limits tensor parallelism communication to 32 GB/s vs. NVLink’s 600 GB/s. For a batch requiring 100 MB activation transfers between GPUs, PCIe adds approximately 3.1 ms per synchronization point vs. 0.17 ms for NVLink.

Distribution Layer Analysis (Algorithmic Behavior):

Dynamic request-level batching is configured with max batch size 32 and timeout 50 ms. Examining the batch size distribution reveals the problem: 90 percent of batches contain 4 to 8 requests (explaining good P50/P95), but 5 percent of batches reach the full 32 requests. These large batches occur during traffic bursts and experience head-of-line blocking: short requests that would complete quickly must wait for long-sequence requests in the same batch to finish all their decode iterations.

Why does this occur despite batching being enabled? Further investigation reveals that the scheduler uses FIFO request-level batching without iteration-level eviction. A burst of 32 simultaneous arrivals enters the same batch, and short requests are not released until the full request-level batch completes.


Problem 46 (Ch. 11, §KV Cache Management)

A deployment serves Llama-3-70B (FP16 weights ≈140\approx 140 GB) on an 8×\times H100 node (640 GB total HBM). The goal is to determine the maximum batch size for a context length of 128K tokens.

Formula: MKV=2×nlayers×nheads×dhead×selem M_{\text{KV}} = 2 \times n_{\text{layers}} \times n_{\text{heads}} \times d_{\text{head}} \times s_{\text{elem}} Total Memory=Mweights+(Batch×Context×MKV) \text{Total Memory} = M_{\text{weights}} + (\text{Batch} \times \text{Context} \times M_{\text{KV}})

Parameters:

  • nlayers=80n_{\text{layers}} = 80, nKV heads=8n_{\text{KV heads}} = 8 for Llama-3-70B GQA, dhead=128d_{\text{head}} = 128.
  • selem=2s_{\text{elem}} = 2 bytes (FP16).
  • Context =131,072= 131,072 tokens.

Problem 47 (Ch. 11, §KV Cache Management)

A team serves a 70B model on 4×\times H100 GPUs. Weights in FP16 consume 140 GB (35 GB/GPU). KV cache at FP16 consumes 10.7 GB per request. How does quantizing weights to INT4 and KV cache to INT8 change the maximum batch size?

Before optimization (all FP16):

  • Weights: 35 GB/GPU

  • Available for KV cache: 80−80 - 35 == 45 GB/GPU

  • KV cache per request: 10.7 GB ÷\div 4 GPUs ≈\approx 2.7 GB/GPU

  • Maximum batch size: ⌊\lfloor 45 // 2.7 ⌋\rfloor ≈\approx 16 requests After optimization (INT4 weights, INT8 KV cache):

  • Weights: 35 GB ×(4/16)=\times (4/16) = 8.75 GB/GPU (INT4)

  • Available for KV cache: 80−80 - 8.75 == 71.25 GB/GPU

  • KV cache per request (INT8): 5.4 GB ÷\div 4 ≈\approx 1.3 GB/GPU

  • Maximum batch size: ⌊\lfloor 71.25 // 1.3 ⌋\rfloor ≈\approx 53 requests


Problem 48 (Ch. 11, §Model Sharding for Inference)

A team deploys DeepSeek-V3 (671B total parameters, 37B active per token) for a chatbot application. The model uses FP8 weights (1 byte per parameter). The cluster has 8-GPU nodes, each with 8×\times H100 GPUs (80 GB HBM per GPU, 640 GB per node). How many nodes are needed, and what is the per-token decode latency?


Ch. 12

Problem 49 (Ch. 12, §The Edge Learning Paradigm)

A MobileNet classifier runs on both a generic mobile CPU and a specialized Neural Processing Unit (NPU). How much “Silicon Dividend” does the NPU provide in terms of speed and battery life?


Problem 50 (Ch. 12, §Design Constraints)

A team is designing a background fine-tuning job for a personalized voice assistant on a smartphone. The training job consumes 4.5 Watts and takes 30 minutes to complete. If the phone has a 15 Wh battery, how much of the user’s battery will this “invisible” update consume?


Problem 51 (Ch. 12, §Model Adaptation)

A team is deploying a 10M parameter vision model to a smartphone with support for 10 different “User Contexts” (Home, Office, Car, etc.). If a full fine-tuned model requires 40 MB, how much storage does using Residual Adapters save instead?


Problem 52 (Ch. 12, §Federated Learning: Algorithms)

A team is designing a federated camera-personalization system that learns from 195 MB of compressed user images per week. Should the system upload the raw images to the cloud for training, or use federated learning to send model updates instead?


Ch. 13

Problem 53 (Ch. 13, §From Single-Model to Platform Operations)

A platform team manages a fleet of 100 GPUs. Under dedicated per-team quotas, average idle time is 70%. Moving to a Multi-Tenant ML Platform that shares resources across teams and uses idle training GPUs for inference reduces aggregate idle time to 30%. What hardware cost does the platform save?


Problem 54 (Ch. 13, §From Single-Model to Platform Operations)

An organization manages 50 models. A centralized ML Platform team costs $120,000/month. If the platform saves each model team 20 hours of manual toil per month, is the platform investment profitable?


Problem 55 (Ch. 13, §From Single-Model to Platform Operations)

A team spends 40 hours/month manually fixing “broken plumbing” (stale data, failed scripts, manual monitoring). A one-month intensive cleanup (160 hours) is projected to reduce this to 8 hours/month. Is the cleanup worth it over a 3-year model lifecycle?


Problem 56 (Ch. 13, §From Single-Model to Platform Operations)

A team spends 10 hours of manual toil per model deployment. Investing 120 hours in a CI/CD pipeline is projected to reduce deployment toil to 0.5 hours. At 3 deploys per week, how long until the automation pays for itself?


Problem 57 (Ch. 13, §CI/CD for ML at Scale)

A team is deploying a new ranking model. A “Blue-Green” deployment (100 percent cutover) exposes all users to any potential bugs. A “Canary” deployment starts at 5% traffic. By how much does the canary approach reduce the deployment’s “Risk Exposure”?


Problem 58 (Ch. 13, §CI/CD for ML at Scale)

A team deploys a recommendation model that has a silent bug: it reduces Click-Through Rate (CTR) by 0.5 percentage points (10 percent relative) (from 5.0 percent to 4.5 percent). If the service handles 5,000 QPS and each click is worth $0.50, how much revenue is lost if detection and remediation take 24 hours?


Problem 59 (Ch. 13, §Monitoring at Scale)

A production model has a baseline accuracy of 95%. A data drift event causes accuracy to drop by 2%. If the service receives 1,000 requests per hour with labels, how long is needed to statistically prove the model has degraded?


Ch. 14

Problem 60 (Ch. 14, §The Expanded Attack Surface)

Consider computing the average salary of 1000 employees while guaranteeing privacy budget ϵ=\epsilon = 1.0. The salaries range from $0 to $200,000. How much noise must the mechanism add?


Problem 61 (Ch. 14, §The Expanded Attack Surface)

A platform team hosts two models on a single H100 using Multi-Instance GPU (MIG) to provide hardware-level isolation. On a dedicated GPU, the model achieves 1,000 tokens per second. After enabling secure partitioning, it achieves 850 tokens per second. What is the performance cost of security?


Problem 62 (Ch. 14, §Comprehensive Defense Architectures)

A team deploying a health-monitoring model compares three security levels:

  • Plaintext: Standard inference.
  • Encrypted Transport (AES): Model/Data encrypted at rest/transit.
  • Encrypted Compute (FHE): Inference performed on encrypted data. How do these choices affect real-time responsiveness?

Ch. 16

Problem 63 (Ch. 16, §A Unified Framework for Robust AI)

A team is training a robust classifier for an autonomous vehicle, using adversarial training (PGD-7) that generates a worst-case attack for every training batch. How much does this “robustness tax” slow down the training run?


Problem 64 (Ch. 16, §Environmental Shifts)

A model monitors a critical input feature. The baseline mean was 0.5. Over the last 1,000 requests, the mean has shifted to 0.55. The engineering question is whether this is a random fluctuation or a real distribution shift.


Problem 65 (Ch. 16, §Adversarial Defenses)

The goal is to make a ResNet-50 more robust to small norm-bounded image perturbations (ϵ=8/255)(\epsilon = 8/255). What does this cost in clean ImageNet performance?

Data:

  • Standard ResNet-50: 76 percent Top-1 Accuracy on ImageNet.
  • Adversarially Trained ResNet-50 (ϵ=8/255)(\epsilon=8/255): ~50 percent Top-1 Accuracy on Clean ImageNet.

Ch. 17

Problem 66 (Ch. 17, §The Energy Ceiling)

A team trains a large model (GPT-3 size) consuming 1,287 MWh. How much CO2 is emitted, and how does that compare to a trans-Atlantic flight?


Problem 67 (Ch. 17, §The Energy Ceiling)

A team is planning a large training run requiring 10,000 MWh of energy, choosing between three regions with different electricity prices and carbon intensities. With an internal carbon tax of USD 100/tonne, which region minimizes the true Total Cost of Ownership (TCO)?

Solution: We invoke the PlacementOptimizer to synthesize grid carbon intensity, regional electricity rates, and the carbon tax into a single optimization objective.


Problem 68 (Ch. 17, §The Energy Ceiling)

A team is choosing a data center for a 10,000 MWh training run.

  • Site A (Quebec): Hydropower, 20 g CO2\text{CO}_2/kWh.
  • Site B (Poland): Coal-heavy, 800 g CO2\text{CO}_2/kWh. How does the location affect a model’s carbon footprint?

Problem 69 (Ch. 17, §Energy Measurement and Modeling)

A team operates a 2.0 MW cluster. If the facility can be optimized from the industry average PUE (1.58) to state-of-the-art (1.10), how much energy and money does that save annually?


Problem 70 (Ch. 17, §Training vs. Inference Energy Analysis)

Consider fine-tuning a small language model (1B parameters) on a user’s smartphone overnight. Is this feasible within a 5 percent battery budget?


Ch. 18

Problem 71 (Ch. 18, §Core Principles and the ML Lifecycle)

Consider a credit model with 85 percent accuracy. Group A (majority) has a 20 percent default rate. Group B (minority) has a 40 percent default rate due to systemic factors. If Demographic Parity (equal approval rates) is enforced, what happens to accuracy?


Problem 72 (Ch. 18, §Bias Detection and Fairness Monitoring)

Consider optimizing a hiring model. The “unconstrained” model reaches 92% accuracy but exhibits a 15 percent disparity between demographic groups. Applying a fairness constraint (demographic parity) eliminates the disparity. What is the “bias tax” on model performance?


Ch. 19

Problem 73 (Ch. 19, §The Path Forward)

Suppose a future workload needs a 100**×\times efficiency improvement** over today’s frontier clusters. If a particular workload saw 4.0×\times from hardware and 2.5×\times from algorithmic compression, where would the remaining gain have to come from?


Problem 74 (Ch. 19, §The Path Forward)

Modern GPU clusters are hitting the energy wall. Public optical I/O materials describe moving from roughly 6–10 pJ/bit long-reach electrical signaling toward below 5 pJ/bit optical signaling. If we use 10.0 pJ/bit and 5.0 pJ/bit as a simple scenario, what is the efficiency dividend?


Problem 75 (Ch. 19, §Engineering Intelligence at Scale)

A hypothetical frontier-scale cluster is compared against rough estimates of human brain synaptic activity. This Fermi-style sanity check is not a like-for-like operation metric; what is the order-of-magnitude relationship between the two?

The machine (hypothetical frontier cluster):

  • Cluster: 25,000 H100 GPUs. GPT-4’s actual hardware was not disclosed.

  • Ops/sec: 25,000×H100 FP16 tensor peak≈25{,}000 \times \text{H100 FP16 tensor peak} \approx 2.47**×10192.47 \times 10^{19} FLOPS** (rounded; H100 FP16 tensor peak is 989 TFLOPS).

  • Power: 25,000×700 W≈25,000 \times 700 \text{ W} \approx 17.5 MW. Brain:

  • Synapses: 101410^{14} synapses (connections).

  • Firing Rate: illustrative average spike rate ≈1 Hz\approx 1 \text{ Hz}; estimates vary by neuron type and brain region, and average cortical firing is far below the 100 Hz peak rates often used in casual comparisons.

  • Ops/sec: 1014×1=10^{14} \times 1 = 1.0**×10141.0 \times 10^{14} Synaptic Ops/sec**.


End-of-Book Exercises

Exercise 1: C3^3 classification (Ch. A, §Exercises)

A 512-GPU training job shows 45 percent MFU per device, but NCCL logs reveal that AllReduce consumes 55 percent of each training step. Which C3^3 axis is the bottleneck? Name two specific optimizations and explain why each targets the correct axis.


Exercise 2: Fleet law decomposition (Ch. A, §Exercises)

A training step on a 1,024-GPU cluster takes 200 ms. Profiling reveals: forward + backward pass = 100 ms, AllReduce = 60 ms, pipeline bubble + checkpoint = 40 ms. Calculate ηfleet\eta_{\text{fleet}}. Which C3^3 axis would you optimize first, and why?


Exercise 3: Effective FLOPS calculation (Ch. A, §Exercises)

A team provisions 2,048 H100 GPUs. The cluster achieves 50 percent MFU, 50 percent scaling efficiency, and 85 percent goodput ratio. Calculate the effective FLOPs as a fraction of peak. If a scaling law predicts that 102410^{24} FLOPs of training compute will reach a target loss, how many raw peak FLOPs must be provisioned to account for the C3^3 tax?


Exercise 4: Anti-pattern detection (Ch. A, §Exercises)

A colleague proposes upgrading the cluster’s InfiniBand from HDR (200 Gbps) to NDR (400 Gbps) because “training is too slow.” Before approving the network upgrade, what three C3^3 diagnostic questions would you ask? Map each to its C3^3 axis.