notes
ML Systems · Performance · Practice

ML Systems Vol 1 Problems

ML Systems Practice: Vol 1

Questions from https://mlsysbook.ai/ Vol 1

Reference Spec Sheets

Hardware Ridge Points

Accelerators have characteristic hardware ridge points where operations transition from memory-bound to compute bound. The A100 with 312 TFLOPS FP16 Tensor Core and 2.0 TB/s bandwidth has a ridge point of 153 FLOP/byte. The H100 SXM with 989 TFLOPS FP16 Tensor Core and 3.35 TB/s bandwidth has a ridge point of approximately 295 FLOP/byte. Operations below the ridge point are memory bound; above are compute bound.

Operations on the Roofline

Table 11.12: Operations on the Roofline: Neural network layers span a wide range of arithmetic intensities. Large, well-tiled convolutions and batched GEMMs can be compute-bound, while small-batch dense projections, MobileNet depthwise layers, attention softmax, normalization, and DLRM embeddings are often memory-bound.

Operation Arithmetic Intensity Classification Lighthouse Example
Conv2D (Dense) 50–200 FLOP/byte Compute-bound ResNet-50
Dense MatMul (large batch, well-tiled) 64–256+ FLOP/byte Often compute-bound GPT-2 (batched projections)
Depthwise Conv 10–20 FLOP/byte Memory-bound MobileNet
Attention Softmax 2–5 FLOP/byte Memory-bound GPT-2 (Generation)
LayerNorm 5–10 FLOP/byte Memory-bound GPT-2/Llama
Embedding lookup <1 FLOP/byte Memory-bound DLRM

Numerical Precision Formats

Table 10.8 compares commonly used numerical precision formats in machine learning, each exhibiting distinct trade-offs in storage efficiency, computational speed, and energy consumption. Emerging formats like FP8 and TF32 have been introduced to further optimize performance, especially on AI accelerators. Table 10.8: Numerical Precision Formats: Comparison of precision formats by bit width, memory reduction, computational efficiency, accuracy retention, and typical use cases across deployment contexts.

Precision Format Bit-Width Storage Reduction (vs. FP32) Compute Speed (vs. FP32) Power Consumption Use Cases
FP32 (Single-Precision Floating Point) 32-bit Baseline (1×\times) Baseline (1×\times) High Training & inference (general-purpose)
FP16 (Half-Precision Floating Point) 16-bit 2×\times smaller 2×\times faster on FP16-optimized hardware Lower Accelerated training, inference (NVIDIA Tensor Cores, TPUs)
bfloat16 (Brain Floating Point) 16-bit 2×\times smaller Similar speed to FP16, better dynamic range Lower Training on TPUs, transformer-based models
TF32 (TensorFloat-32) 19-bit None (stored as FP32) Up to 8×\times faster on NVIDIA Ampere GPUs Lower Training on NVIDIA GPUs
FP8 (Floating-Point 8-bit) 8-bit 4×\times smaller Faster than INT8 in some cases Significantly lower Efficient training/inference (H100, AI accelerators)
INT8 (8-bit Integer) 8-bit 4×\times smaller 4–8×\times faster than FP32 Significantly lower Quantized inference (Edge AI, mobile AI, NPUs)
INT4 (4-bit Integer) 4-bit 8×\times smaller Hardware-dependent Extremely low Ultra-low-power AI, experimental quantization
Binary/Ternary (1-bit/2-bit) 1–2-bit 16–32×\times smaller Highly hardware-dependent Lowest Extreme efficiency (binary/ternary neural networks)

Deployment Decision Thresholds

Table 2.5: Deployment Decision Thresholds: Quantitative thresholds that practitioners use to determine deployment feasibility for each paradigm in Table 2.4. These numbers answer the practical question “can my workload run here?” by specifying the compute, memory bandwidth, and power envelope that each paradigm provides.

Paradigm Compute Memory BW Power Latency
Cloud ML >1000 TFLOPS >100 GB/s PUE 1.1–1.3 100–500 ms
Edge ML ~1 PFLOPS AI >270 GB/s 100s W 10-100 ms
Mobile ML 1-10 TOPS 50-100 GB/s 2 to 5 W 5-50 ms
TinyML <1 TOPS <1 mW always-on average target 1-10 ms

Representative Devices

Table 2.4: Hardware Spectrum (Concrete Platforms): Representative devices that instantiate each deployment paradigm from Table 2.1. Where the conceptual table defines operating regimes, this table provides the specific processors, memory capacities, power envelopes, and price points that practitioners use to match workloads to hardware. The DGX Spark sits at the high end of the edge spectrum; most edge deployments use far smaller devices (for example, Jetson Orin Nano). We include it to illustrate the ceiling of noncloud deployment.

Category Example Device Processor Memory Storage Power Price Range
Cloud ML Google TPU v4 Pod 4,096 TPU v4 chips, >1 EFLOP 131 TB HBM2 Cloud-scale (PB) ~3 MW Cloud service (rental)
Edge ML NVIDIA DGX Spark GB10 Grace Blackwell, 1 PFLOPS AI 128 GB LPDDR5x 4 TB NVMe ~200 W ~$3,000–5,000
Mobile ML Flagship Smartphone Mobile SoC (CPU + GPU + NPU) 8-16 GB RAM 128 GB-1 TB 2 to 5 W USD 999+
TinyML ESP32-CAM Dual-core @ 240 MHz 520 KB RAM 4 MB Flash 0.05–1.2 W active board power $10

Hardware Spectrum

Table 2.4: Hardware Spectrum (Concrete Platforms): Representative devices that instantiate each deployment paradigm from Table 2.1. Where the conceptual table defines operating regimes, this table provides the specific processors, memory capacities, power envelopes, and price points that practitioners use to match workloads to hardware. The DGX Spark sits at the high end of the edge spectrum; most edge deployments use far smaller devices (for example, Jetson Orin Nano). We include it to illustrate the ceiling of noncloud deployment.

Category Example Device Processor Memory Storage Power Price Range
Cloud ML Google TPU v4 Pod 4,096 TPU v4 chips, >1 EFLOP 131 TB HBM2 Cloud-scale (PB) ~3 MW Cloud service (rental)
Edge ML NVIDIA DGX Spark GB10 Grace Blackwell, 1 PFLOPS AI 128 GB LPDDR5x 4 TB NVMe ~200 W ~$3,000–5,000
Mobile ML Flagship Smartphone Mobile SoC (CPU + GPU + NPU) 8-16 GB RAM 128 GB-1 TB 2 to 5 W USD 999+
TinyML ESP32-CAM Dual-core @ 240 MHz 520 KB RAM 4 MB Flash 0.05–1.2 W active board power $10

Intra-Node Interconnect

  • Hardware (The Silicon): The physical foundation where bits are transformed. This layer is defined by HBM (High Bandwidth Memory) capacity and high-speed intra-node interconnects like NVLink (900 GB/s). Here, the memory wall acts as the primary physical constraint (Chapter 11). Table 2.1: The Deployment Spectrum (Conceptual): Four paradigms span nine orders of magnitude in power (MW to mW) and memory (TB to KB). This conceptual overview defines each paradigm by its operating regime; Table 2.4 later grounds these categories in specific hardware platforms and quantitative decision thresholds. The hardware specifications and physical constants underpinning these numbers are catalogued in the System Assumptions appendix. | Paradigm | Where | Latency | Power | Memory | Best For | | --- | --- | --- | --- | --- | --- | | Cloud ML | Data centers | 100-500 ms | MW | TB | Training, complex inference | | Edge ML | Local servers | 10-100 ms | 100s W | GB | Real-time inference, privacy | | Mobile ML | Smartphones | 5-50 ms | 3-5 W | GB | Personal AI, offline | | TinyML | Microcontrollers | 1-10 ms | mW | KB | Always-on sensing |

Ch. 1

Problem 1 (Ch. 1, §Iron Law of ML Systems)

What is the training time for a GPT-3 class model on a cluster of A100 GPUs?


Ch. 2

Problem 2 (Ch. 2, §Analyzing Workloads)

Should a battery-powered sensor process data locally (TinyML) or send it to the cloud?


Problem 3 (Ch. 2, §System Balance and Hardware)

Is ResNet-50 inference compute bound or memory bound on (a) a high-end data center GPU (NVIDIA A100 class) and (b) a flagship mobile NPU (Apple/Qualcomm class)?


Problem 4 (Ch. 2, §Cloud ML: Computational Power)

Consider a real-time safety monitor for a robotic arm. The safety logic requires a 10 ms end-to-end response time to prevent injury. The model runs in a high-performance cloud data center 1,500 km away. Can the safety budget be met?


Problem 5 (Ch. 2, §Edge ML: Latency and Privacy)

Consider a quality control system for a factory floor with 100 cameras running at 30 FPS with 1080p resolution. Should the system stream to the cloud or process at the edge?


Problem 6 (Ch. 2, §Edge ML: Latency and Privacy)

Should a drone’s object avoidance system (4K, 60 FPS) offload to the cloud?


Problem 7 (Ch. 2, §Mobile ML: Offline Intelligence)

Consider deploying a “real-time” background object detector on a smartphone. The model consumes 2 Watts of continuous power when active. The phone has a standard 15 Watt-hour (Wh) battery. Can the feature stay on all day?


Problem 8 (Ch. 2, §Mobile ML: Offline Intelligence)

An unoptimized LLM requires 12 W peak compute. Can it be deployed on a mobile device?


Ch. 3

Problem 9 (Ch. 3, §ML Lifecycle)

A diabetic retinopathy (DR) screening system for rural clinics must choose between a large ensemble trained on high-resolution fundus images (training time: 1 week, accuracy: 95 percent) and a lightweight model suitable for edge deployment on clinic hardware (training time: 1 hour, accuracy: 90 percent). Which approach yields a better screening system in six months?


Problem 10 (Ch. 3, §Data Collection)

A rural clinic captures retinal images for DR screening. Can the clinic upload all images to the cloud for processing, or must it process them locally on edge hardware?


Problem 11 (Ch. 3, §Deployment and Integration)

A production model processes about 760,000 billable screening images per month across 500 clinics, assuming one processed image per patient after local selection and quality checks. Should the deployment use Cloud inference (AWS Lambda) or Edge inference on an on-premise server?


Ch. 4

Problem 12 (Ch. 4, §Physics of Data)

A 1 PB training dataset resides in a US East data center, while a Tensor Processing Unit (TPU) pod is available in US West. Is it faster to move the data or to move the compute?


Problem 13 (Ch. 4, §Data Pipeline Architecture)

A pipeline ingests 1M events/second. Which is cheaper, Batch (hourly) or Stream (sub-second) ingestion?


Problem 14 (Ch. 4, §Data Pipeline Architecture)

A team processes 10 TB of raw clickstream data daily and must compute user session features for 3 ML models, each requiring different aggregation windows (1-hour, 24-hour, 7-day). Which is cheaper, ETL or ELT?


Problem 15 (Ch. 4, §Systematic Data Processing)

Mean normalization must be computed across 1 TB of features distributed across 100 nodes. Is it faster to (A) gather all data to one node and compute centrally, or (B) compute local means and aggregate them?


Problem 16 (Ch. 4, §Data Labeling)

A 10M image dataset has a $50K labeling budget. Random sampling achieves 85 percent accuracy with 100K images, while the target is 95 percent accuracy.


Problem 17 (Ch. 4, §Storage Architecture)

What storage configuration keeps an NVIDIA A100 from being data-starved while training ResNet-50?


Ch. 5

Problem 18 (Ch. 5, §Neural Network Fundamentals)

In “Computing with Patterns” we showed that a single forward pass through the 784→128→64→10 network costs 109,184 MACs. What is the memory footprint for this network during training with batch size 32 in 32-bit (4-byte) floating-point precision, and how does it compare to inference requirements?

Solution:


Problem 19 (Ch. 5, §Learning Process)

What is the total arithmetic operation count (O****O) for one forward pass through the MNIST network (784→128→64→10) with batch size 32?

Background: A matrix multiplication of dimensions (M×K)×(K×N)(M{\times}K) \times (K{\times}N) requires 2×M×K×N2 \times M \times K \times N operations (one multiply and one add per output element, summed over KK terms). Bias addition adds M×NM \times N operations. ReLU activation adds M×NM \times N comparisons (counted as operations).

Solution:

Layer Operation Dimensions Ops
Layer 1 MatMul (32×78432{\times}784)×\times (784×128784{\times}128) 2×\times 32×\times 784×128784{\times}128 = 6,422,528
Layer 1 Bias + ReLU 32×12832{\times}128 2×4,0962{\times}4,096 = 8,192
Layer 2 MatMul (32×12832{\times}128)×\times (128×64128{\times}64) 2×\times 32×\times 128×64128{\times}64 = 524,288
Layer 2 Bias + ReLU 32×6432{\times}64 2×2,0482{\times}2,048 = 4,096

Ch. 6

Problem 20 (Ch. 6, §Attention: Dynamic Processing)

How much memory does the attention matrix of a single layer require at sequence length N = 100,000 (context window)?


Problem 21 (Ch. 6, §Sparse Architectures: RecSys)

Consider a recommendation system for a store with 100 Million items using an embedding size of 128. How much memory does the item table alone require?


Problem 22 (Ch. 6, §Architecture Selection Framework)

A real-time application must process 30 frames per second of video with ResNet-50. What sustained compute throughput is required?


Ch. 7

Problem 23 (Ch. 7, §Execution Problem)

When does Python overhead kill performance?

Scenario one: Small multilayer perceptron (MLP) (Overhead Bound)

  • Compute: 6 small matrix/element-wise operations.

  • Hardware time: Thw≈T_{\text{hw}} \approx 2.6 μs (mostly memory latency).

  • Software overhead: Tsw≈T_{\text{sw}} \approx 6 ops ×\times 5.0 μ\mus/op = 30 μ\mus.

  • Ratio: 30/2.6 ≈ 11.5.

  • Conclusion: The system spends 92 percent of time in host-side dispatch and kernel-launch overhead. Compilation yields 13**×\times speedup**. Scenario two: GPT-3 Layer (Compute Bound)

  • Compute: Huge matrix multiplications.


Problem 24 (Ch. 7, §Abstraction Problem)

Why does GPU utilization drop when training small models?

The math (the hidden tax):

  • Model weights: 2 GB.

  • Gradients: 2 GB (same size as weights).

  • Optimizer states (Adam): 8 GB (4×4 \times FP16 weight memory for momentum and velocity stored in FP32).

  • Activations: For a batch size of 32 and a 100-layer network, the framework must store every intermediate layer output for the backward pass.

Activations≈Batch×Layers×Width2×2 bytes \text{Activations} \approx \text{Batch} \times \text{Layers} \times \text{Width}^{2} \times 2 \text{ bytes} For a 1024-width model: 32×100×10242×2≈6.7 𝐆𝐁32 \times 100 \times 1024^{2} \times 2 \approx \mathbf{6.7 \text{ GB}}. (Each layer’s activation is a Width×Width\text{Width}{\times}\text{Width} matrix per sample—appropriate for transformer-style models where intermediate projections scale with hidden dimension squared.)


Ch. 8

Problem 25 (Ch. 8, §Pipeline Architecture)

A team is training a large model on eight GPUs. Is the network the bottleneck?


Problem 26 (Ch. 8, §Pipeline Architecture)

Will a 7 B parameter model fit on a 24 GB GPU for training?


Problem 27 (Ch. 8, §Pipeline Architecture)

Is it cheaper to rent an H100 or buy it for training Llama-2-70B?


Ch. 9

Problem 28 (Ch. 9, §Data Selection Fundamentals)

Compute scales exponentially. Data does not (Table 9.1).

Consequence: Compute budgets now support training runs that far exceed what available high-quality data can fill. The field has become compute-rich and data-poor.

The compute-data asymmetry inverts the optimization priority. When data was abundant and compute was scarce, the right strategy was algorithmic efficiency: squeeze more accuracy from limited GPU cycles. Now that compute is abundant and quality data is scarce, the winning strategy is data selection: squeeze more learning from each sample. Data selection operates upstream of all other optimizations. By pruning redundancy and selecting high-value samples, we reduce the workload before it ever enters the model or hits the hardware, directly shrinking the total operations (O)(O) term in the iron law (see the following callout “Data selection and the iron law” for a detailed analysis). For companies training frontier models, the bottleneck has shifted from GPU access to the quality and diversity of their training corpora.

The engineering toolkit for intelligent data selection follows Part III’s D·A·M taxonomy in Chapter 17, which establishes a deliberate optimization ordering: Data first, then Algorithm, then Machine. Data selection puts the “highest leverage first” principle into practice by addressing whether work is necessary before asking how to simplify or accelerate it. A three-stage optimization pipeline structures the practical response to the data wall:

  • Static pruning: Removing low-value samples before training begins (coresets, deduplication).
  • Dynamic selection: Selecting high-value samples during training (curriculum learning, active learning).
  • Synthetic generation: Creating high-value samples on demand (augmentation, distillation). Each stage increases the information density of the data that reaches the model, and together they form a complementary toolkit: pruning reduces what the pipeline contains, selection focuses how the pipeline uses it, and synthesis expands what the pipeline can access. Before examining these techniques, we must formalize what “data selection” means, why it is inherently a systems problem, and how to measure its effectiveness.

Problem 29 (Ch. 9, §Dynamic Selection)

A hospital team is building a medical diagnostic AI. The pool contains 1 Million unlabeled scans. A specialist doctor charges $5.00 to label one scan. The team has a budget of $500,000 and a deadline of 1 month.

Scenario A: Naive Labeling

  • Cost: Labeling all 1M scans would cost $5,000,000 (10×\times over budget).

  • Time: The budget only covers labeling 100,000 random scans.

  • Result: The model misses rare pathologies because they were not in the random 10 percent. Scenario B: Active Learning

  • Strategy: Use an uncertainty-based selection to pick the 50,000 “hardest” scans for the doctor to label.

  • Cost: 50,000 ×\times 5.00 = $250,000. (50 percent under budget).

  • Training speed: With 20×\times less data, each training epoch is 20**×\times faster**.

  • Result: Empirical studies suggest that these 50,000 “high-information” samples often achieve higher accuracy than 100,000 random samples. System implication: Data Selection functions as a 20**×\times compute accelerator** and a $4.75 Million cost-saving measure, delivering gains that compound with every training iteration.

Compare the two curves in Figure 9.6: active learning shifts the learning curve to the left, achieving target accuracy with far fewer samples than random selection. The curves are illustrative to highlight the qualitative gap.

Figure 9.6: The Active Learning Multiplier: Model accuracy vs. number of labeled samples (log scale). Random sampling (gray dashed) yields linear improvements, often requiring massive datasets to capture rare edge cases. Active learning (green solid) targets informative samples, reaching the same accuracy with fewer labels. Curves are illustrative to show the qualitative advantage.

Active learning yields more than cost savings: it directs the model toward precisely the examples that matter most. The Smart Doorbell Lighthouse illustrates this principle in the context of hard negative mining.

The “Hard Negative” Problem: Our Smart Doorbell faces a classic data selection challenge. The vast majority of its video feed is empty (easy negatives) or clearly people (easy positives). The model fails on the 0.01 percent of “Hard Negatives”: statues, posters of people, or laundry piles that cast human-like shadows.

Random sampling will miss these rare failures. Instead, the Wake Vision team uses active learning to specifically query the Oracle (human reviewers) on low-confidence predictions. If the model sees a “statue” and predicts “Person (51 percent)”, that sample is flagged for labeling. This turns the feedback loop from a random walk into a guided search for the decision boundary, reducing the data required to solve the “statue problem” by orders of magnitude compared to random collection.

Semi-supervised learning: Using unlabeled data

Consider a medical imaging dataset: a hospital has 50,000 chest X-rays, but only 500 have been reviewed and labeled15 by radiologists—a labeling rate of 1 percent. Training a supervised model on 500 examples yields poor accuracy, but the structural patterns in the remaining 49,500 unlabeled images contain information about what healthy and abnormal lungs look like. Semi-supervised learning exploits this abundant unlabeled data to improve the model trained on the scarce labeled examples.

Active learning optimizes which samples to label but still requires human annotation for every selected example. Semi-supervised learning takes a more aggressive approach: rather than asking which samples to label, it asks whether we can extract learning signal from unlabeled data directly. It uses a small set of labeled examples to guide learning on a much larger unlabeled pool, typically achieving 80–95 percent of fully supervised accuracy with only 10–20 percent of the labels.

The core insight behind semi-supervised learning is that unlabeled data, while it cannot directly teach the mapping from inputs to outputs, contains structural information about the input distribution P(X)P(X) that constrains the hypothesis space. A decision boundary that cuts through dense regions of P(X)P(X) is unlikely to generalize well because it would assign different labels to similar inputs. Semi-supervised methods use unlabeled data to push decision boundaries toward low-density regions, where class transitions are more likely to occur naturally.

Three main techniques implement this insight. Pseudo-labeling16 takes the most direct approach: train on labeled data, use the model to generate “pseudo-labels” for high-confidence unlabeled predictions, then retrain on both. The confidence threshold is critical: setting it too low introduces label noise that degrades learning, while setting it too high wastes potentially useful data.

Consistency regularization17 takes a different angle by enforcing that the model produces similar predictions for augmented versions of the same input. A robust classifier should be invariant to realistic perturbations like cropping, rotation, or color shifts. Methods like FixMatch18 combine both approaches, assigning pseudo-labels only to samples where the unaugmented prediction is confident but training the model to predict these labels on strongly augmented versions of the same images.

Label propagation offers a third paradigm through graph-based reasoning: construct a similarity graph over all samples and propagate labels from labeled nodes to their neighbors. This approach works particularly well when the feature space exhibits clear cluster structure.

The systems trade-off in semi-supervised learning is straightforward: it typically achieves the same accuracy as fully supervised training with 5–10×\times fewer labels but requires more compute because training processes both labeled and unlabeled samples. Since labeling costs often dominate compute costs in production settings, this trade-off is usually favorable. The results of FixMatch on CIFAR-10 illustrate this label efficiency concretely.

FixMatch (Sohn et al. 2020) combines pseudo-labeling with consistency regularization to achieve high label efficiency (Table 9.6).

Table 9.6: FixMatch Label Efficiency on CIFAR-10: With 250 labels (0.5 percent of the dataset), FixMatch achieves within 1.2 points of full supervision, demonstrating 200×\times label efficiency.

Label Budget Method Accuracy Label Efficiency
50,000 (100 percent) Fully Supervised 96.1% Baseline
4,000 (8 percent) FixMatch 95.7% 12.5**×\times more efficient**
250 (0.5 percent) FixMatch 94.9% 200.0**×\times more efficient**
40 (0.08 percent) FixMatch 88.6% 1250.0**×\times more efficient**

With only 250 labeled samples (twenty-five per class), FixMatch achieves 94.9 percent accuracy, within 1.2 points of full supervision using 200.0×\times fewer labels. The technique works by generating pseudo-labels on weakly augmented unlabeled images (only when model confidence exceeds 0.95), then training to predict these labels on strongly augmented versions of the same images.


Problem 30 (Ch. 9, §Selection Engineering)

An active learning system selects the best 10 percent of samples for training, and the selection algorithm requires running the full model on the unlabeled pool. Is this active-learning loop more efficient than training on the full dataset?


Ch. 10

Problem 31 (Ch. 10, §Optimization Framework)

A deployment scenario calls for running a 7 B parameter LLM on a device with 16 GB RAM. The weights are FP16 (2 bytes).


Problem 32 (Ch. 10, §Quantization and Precision)

A compute-bound matrix multiplication (for example, in a transformer multilayer perceptron (MLP) block) switches from FP16 to INT8. What is the expected speedup?


Ch. 11

Problem 33 (Ch. 11, §AI Memory Systems)

Why is on-chip SRAM necessary instead of fetching all data from HBM?


Problem 34 (Ch. 11, §Roofline Model)

What is the maximum possible utilization of an NVIDIA A100 when running GPT-2 inference (batch size 1)?


Problem 35 (Ch. 11, §Hardware Sustainability)

Should an inference fleet run on generic CPUs or invest in specialized NPUs (Neural Processing Units)?


Ch. 12

Problem 36 (Ch. 12, §System Benchmarking Suites)

An image classifier currently has 95 percent accuracy. A “compressed” version is deployed and its accuracy measured on a 1,000-image test set, yielding 94 percent. Did the optimization cause a real regression, or is it noise?


Problem 37 (Ch. 12, §System Benchmarking Suites)

BERT-Base must be deployed for inference on an A100 GPU. Management expects high GPU utilization. What performance should we predict, and how can we improve it?


Problem 38 (Ch. 12, §Benchmark Components)

A vendor claims “Our system achieves 10,000 images/second on ResNet-50.” Should this number be trusted for deployment planning?

Critical Questions:

  • What batch size? Batch 256 achieves high throughput but 256 ms latency; batch 1 achieves low latency but lower throughput.
  • What precision? INT8 is 2–4×\times faster than FP32 but may have accuracy implications.
  • What is included? Pure inference, or including preprocessing?
  • What accuracy? Matching the original 76.1 percent Top-1, or degraded? A Complete Specification: “10,000 images/second on ResNet-50 at batch size 32, INT8 precision, 76.0 percent Top-1 accuracy, including JPEG decoding, on NVIDIA H100 at 700 W TDP.”

Problem 39 (Ch. 12, §Training Benchmarks)

A team trains ResNet-50 on ImageNet. Single-GPU training takes 24 hours. With 8 GPUs, training takes 4 hours. Is this good scaling? Where did the efficiency go?


Ch. 13

Problem 40 (Ch. 13, §Throughput Optimization)

A ResNet-50 model is served at B=1B = 1, leaving the GPU mostly idle (15 percent utilization). The goal is to increase throughput to reduce cost, subject to a 20 ms latency budget.


Ch. 14

Problem 41 (Ch. 14, §Technical Debt)

Why build automated pipelines when manual retraining is faster?


Problem 42 (Ch. 14, §Development Infrastructure)

Data scientists computed features in Spark for training, while engineers reimplemented the same logic in Java for serving. Feature definitions diverged, contributing to a significant percentage of production incidents.

Solution: Michelangelo’s feature store computes features once and serves them to both training (via Hive) and production (via Cassandra). Feature definitions are written in DSL, automatically generating both batch and online implementations.

Key Design Decisions:

  • Point-in-time correctness for historical features prevents data leakage
  • Feature versioning enables safe iteration without breaking dependent models
  • Centralized feature catalog enables discovery and reuse across 5,000+ features Results: Significant reduction in feature engineering time, near-elimination of skew-related incidents, and standardized feature quality across 100+ ML teams.

Reference: **(Hermann and Del Balso 2017)


Problem 43 (Ch. 14, §Development Infrastructure)

Is building an automated drift detection system worth the engineering effort?

Scenario: Consider a product recommendation engine generating $50M/year in revenue. Failure: A deployment bug causes training-serving skew, dropping recommendation quality by 5 percent. This degrades conversion rate proportionally.

Cost Analysis:

  • Manual ops (monthly review): - Detection Time: ~4 weeks (28 days).

  • Revenue Loss: USD 50M ×\times 0.05 ×\times 28/365 ≈ USD 191,781.

  • Automated MLOps (daily checks): - Detection Time: 1 day.

  • Revenue Loss: USD 50M ×\times 0.05 ×\times 1/365 ≈ USD 6,849.


Problem 44 (Ch. 14, §Development Infrastructure)

How often should the team retrain the model to maximize profit?


Problem 45 (Ch. 14, §Production Operations)

A service has a 100 ms P99 SLO, and the model inference budget is 45 ms. Where must the remaining 55 ms go?

Component Typical % P99 Contribution Optimization Lever
Network RTT 10–25 percent 15–45 ms Edge deployment, connection pooling
Feature retrieval 15–35 percent 25–65 ms Feature caching, precomputation
Request parsing 3–8 percent 5–15 ms Binary protocols (gRPC), schema optimization
Model inference 25–45 percent 45–80 ms Quantization, batching, model distillation
Postprocessing 5–12 percent 10–20 ms Async processing, result caching
Response serialization 3–8 percent 5–15 ms Efficient formats (Protobuf, MessagePack)

Engineering insight: Model optimization alone often captures less than 50 percent of the latency opportunity. A model that runs 2×\times faster provides only 1.3×\times end-to-end improvement if inference is 45 percent of total latency.


Problem 46 (Ch. 14, §Production Operations)

A model has 95 percent baseline accuracy. The goal is to detect a 5 percent drop (to 90 percent) with 95 percent statistical confidence. The system handles 1 request per second (1 QPS). How long will it take to “prove” the model has drifted?


Ch. 15

Problem 47 (Ch. 15, §Engineering Responsibility Gap)

A model optimizes a proxy metric (Clicks) because the true metric (User Satisfaction) is unobservable. How much can they diverge?


Problem 48 (Ch. 15, §Responsible Engineering Checklist)

An engineering team needs to verify that a FaceID model works for a minority group representing 1 percent of the user base. A worst-case binomial margin of error near 1 percentage point at 95 percent confidence requires roughly 10,000 images for this group.

Random Sampling: To get 10,000 images of a 1 percent group via random sampling, the team must collect and label: NtotalN_{\text{total}} = 10,000 / 0.01 = 1,000,000 images

Stratified Sampling: Specifically targeting this group (for example, via active learning or community outreach) requires only: Ntotal=10,000 images N_{\text{total}} = 10,000 \text{ images}

Insight: Relying on “natural distribution” data for fairness is prohibitively expensive under random sampling. Validating the minority group effectively requires 100×\times more data than the majority group. Fairness requires intentional data engineering, not just more data.

For high-stakes applications, the deployment phase should specify where human oversight is required. Human-in-the-loop (HITL) systems route uncertain, high-consequence, or flagged decisions to human reviewers rather than acting autonomously. Effective HITL design must specify four requirements: the review scope (which decisions require human review), the confidence thresholds that trigger escalation, the training reviewers receive, and the mechanisms for monitoring reviewer performance. HITL is not a catch-all solution: human reviewers can rubber-stamp automated decisions, introduce their own biases, or become overwhelmed by alert volume. Effective HITL design requires calibrating the human-machine boundary to the specific application risks and reviewer capabilities (Caliskan et al. 2017).

Context: Uber’s Advanced Technologies Group (ATG) was testing self-driving cars in Arizona. The system was designed with a “safety driver” to take over if the AI failed.


Problem 49 (Ch. 15, §Responsible Engineering Checklist)

Stakeholders demand elimination of a 20 percent True Positive Rate (TPR) disparity in a hiring model. What is the “Price of Fairness” in terms of hiring quality?


Problem 50 (Ch. 15, §Environmental and Cost Awareness)

A foundation model is being trained at the scale of GPT-3, consuming 1,300 Megawatt-hours (MWh) of electricity. What is the environmental impact?


End-of-Book Exercises

Exercise one: Component identification (Ch. A, §Exercises)

A production image classification service runs on an A100 GPU. The nvidia-smi output shows 25 percent GPU utilization, while iotop reveals the disk is saturated at 100 percent. Which D·A·M axis is the bottleneck? What are two specific optimizations you would recommend?


Exercise two: Iron law analysis (Ch. A, §Exercises)

Consider a decoder-only transformer model with 7B parameters generating one token in the autoregressive decode phase at batch size 1. Using the 2 FLOPs-per-parameter rule of thumb, the decode step requires 0.014 TFLOPs. On an H100 GPU (989 TFLOPS peak FP16 Tensor Core), the measured latency is 50 ms. Calculate the achieved utilization (ηhw\eta_{\text{hw}}). Is this system Data-bound, Algorithm-bound, or Machine-bound? Justify your answer.


Exercise three: Scaling law vs. information roofline (Ch. A, §Exercises)

A team has been training a sentiment analysis model. After scaling from 125M to 1B parameters (8×\times increase), validation loss improved from 0.45 to 0.42 (6.7 percent improvement). Chinchilla scaling would predict a ~15 percent improvement for this compute increase. What does this discrepancy suggest? Which D·A·M axis should be investigated first, and why?


Exercise four: Anti-pattern detection (Ch. A, §Exercises)

A colleague proposes upgrading from 4×\times A100 GPUs to 8×\times H100 GPUs because training is “too slow.” Before approving the $200K hardware purchase, what three diagnostic questions would you ask? Map each question to the D·A·M axis it investigates.