What Does MLPerf Inference Actually Measure?

MLPerf Inference, an industry-standard benchmark suite, quantifies the speed of complete AI systems, including hardware, runtime, and serving stack. It evaluates fixed, pre-trained models under strict latency and accuracy constraints. The benchmark offers two divisions: Closed, which fixes the model and preprocessing for direct comparisons, and Open, which allows model changes but isn’t strictly comparable. Results are reported for Datacenter and Edge suites, with standardized request patterns generated by LoadGen, ensuring architectural neutrality and reproducibility. Availability tags—Available, Preview, RDI (research/development/internal)—indicate whether configurations are shipping or experimental.

The 2025 Update: MLPerf Inference v5.1

The 2025 update, MLPerf Inference v5.1, introduces several changes. It adds three modern workloads: DeepSeek-R1 (the first reasoning benchmark), Llama-3.1-8B (summarization, replacing GPT-J), and Whisper Large V3 (Automatic Speech Recognition, ASR). This round saw 27 submitters, including first-time appearances of AMD Instinct MI355X, Intel Arc Pro B60 48GB Turbo, NVIDIA GB300, RTX 4000 Ada-PCIe-20GB, and RTX Pro 6000 Blackwell Server Edition. Interactive serving scenarios, which capture agent/chat workloads, were expanded to include tight TTFT (time-to-first-token) and TPOT (time-per-output-token) limits.

Scenarios: Mapping to Real Workloads

MLPerf Inference defines four serving patterns to map to real-world workloads:

1. Offline: Maximize throughput with no latency bound, dominated by batching and scheduling.
2. Server: Poisson arrivals with p99 latency bounds, closest to chat/agent backends.
3. Single-Stream / Multi-Stream (Edge emphasis): Strict per-stream tail latency; Multi-Stream stresses concurrency at fixed inter-arrival intervals.
4. Each scenario has a defined metric : such as max Poisson throughput for Server or throughput for Offline.

Latency Metrics for Large Language Models (LLMs)

v5.1 introduces stricter interactive limits for LLMs. For instance, Llama-2-70B has p99 TTFT of 450 ms and TPOT of 40 ms, while the long-context Llama-3.1-405B has higher bounds due to its size and context length.

The 2025 Datacenter Menu: Closed Division Targets

Key v5.1 entries and their quality/latency gates in the Closed division include:

LLM Q&A: Llama-2-70B (OpenOrca) with Conversational (2000 ms/200 ms), Interactive (450 ms/40 ms), and 99%/99.9% accuracy targets.
LLM Summarization: Llama-3.1-8B (CNN/DailyMail) with Conversational (2000 ms/100 ms) and Interactive (500 ms/30 ms) limits.
Reasoning: DeepSeek-R1 with TTFT 2000 ms / TPOT 80 ms and 99% exact-match baseline.
ASR: Whisper Large V3 (LibriSpeech) with WER-based quality for datacenter and edge.
Long-context: Llama-3.1-405B with TTFT 6000 ms and TPOT 175 ms.
Image: SDXL 1.0 with FID/CLIP ranges and a 20 s Server constraint.
– Legacy CV/NLP models (ResNet-50, RetinaNet, BERT-L, DLRM, 3D-UNet) remain for continuity.

Power Results: Reading Energy Claims

MLPerf Power, an optional part of the benchmark, reports system wall-plug energy for the same runs. Only measured runs are valid for energy efficiency comparisons. v5.1 includes datacenter and edge power submissions, encouraging broader participation.

Reading the Tables: Avoiding Pitfalls

To compare results effectively:

Compare Closed vs. Closed only; Open runs may use different models/quantization.
Match accuracy targets (99% vs. 99.9%) as throughput often drops at stricter quality.
Normalize cautiously: MLPerf reports system-level throughput under constraints. Dividing by accelerator count yields a derived “per-chip” number, useful for budgeting sanity checks but not marketing claims.
– Filter by Availability (prefer Available) and include Power columns when efficiency matters.

Interpreting 2025 Results: GPUs, CPUs, and Other Accelerators

GPUs (rack-scale to single-node) show up prominently in Server-Interactive and long-context workloads, where scheduler and KV-cache efficiency matter. Rack-scale systems post the highest aggregate throughput.
CPUs (standalone baselines + host effects) remain useful baselines, highlighting preprocessing and dispatch overheads that can bottleneck accelerators in Server mode. New Xeon 6 results and mixed CPU+GPU stacks appear in v5.1.
– Alternative accelerators
increase architectural diversity. Validate cross-system comparisons by holding constant division, model, dataset, scenario, and accuracy.

Practical Selection Playbook

– Interactive chat/agents → Server-Interactive on Llama-2-70B/Llama-3.1-8B/DeepSeek-R1, matching latency & accuracy and scrutinizing p99 TTFT/TPOT.
– Batch summarization/ETL → Offline on Llama-3.1-8B; throughput per rack is the cost driver.
– ASR front-ends*→ Whisper V3 Server with tail-latency bound; memory bandwidth and audio pre/post-processing matter.
– Long-context analytics → Llama-3.1-405B; evaluate if your UX tolerates 6 s TTFT / 175 ms TPOT.

What the 2025 Cycle Signals

– Interactive LLM serving is table-stakes, with tight TTFT/TPOT in v5.x making scheduling, batching, paged attention, and KV-cache management visible in results.
– Reasoning is now benchmarked, with DeepSeek-R1 stressing control-flow and memory traffic differently from next-token generation.
– Broader modality coverage, with Whisper V3 and SDXL exercising pipelines beyond token decoding, surfacing I/O and bandwidth limits.

In summary, MLPerf Inference v5.1 expands coverage with new workloads and broader silicon participation. To make inference comparisons actionable, align on the Closed division, match scenario and accuracy (including LLM TTFT/TPOT limits for interactive serving), and prefer Available systems with measured Power to reason about efficiency. Procurement should filter results to workloads that mirror production SLAs and validate claims directly in the MLCommons result pages and power methodology

Share.
Leave A Reply

Exit mobile version