Sep 3, 2025

[Deep Research-3]：讨论一下大模型推理性能吧，再让o3做个“事实核查”

For the vast majority of enterprises and individuals, local deployment should be a week-long process of "from start to abandonment." Performance optimization and local data are the two most important core pillars.

Today's research report is written by OpenAI: Large Model Inference Performance.

I originally asked GPT-4o to translate it directly into Chinese, but the output was unbearable, so I had to leave it to Gemini 2.0-Flash (how can one not fall in love with the hardworking and diligent Gemini? However, the WeChat Official Account doesn't support Markdown, and I don't plan to adjust it line by line. I'll find time to automate the output later).

The actual output format looks like this:

LLM Inference Benchmarks Overview (2024)

Large Language Model (LLM) inference performance varies widely depending on model size and the hardware and software optimizations used. Below, we compare model and hardware categories, discuss concurrency scaling, and examine specific use cases. We highlight recent (2024) benchmark results, including MLPerf submissions and industry reports, and provide insights into the cost-effectiveness of different approaches. All data and claims are backed by reliable sources (academic studies, MLPerf results, and vendor reports).

1. Model Scope: Variety of LLMs and Sizes

Range of Models: Inference benchmarks now cover both proprietary models (like OpenAI’s GPT-4) and open-source models (Meta’s Llama 2, TII’s Falcon, Mistral AI’s Mistral-7B, etc.). These models range from relatively small 7-billion-parameter LLMs up to massive 70B+ or even 100B+ parameter models:

Open vs. Closed Models: Open-source LLMs are commonly used in public benchmarks because their weights are available for testing. For example, MLCommons added Llama 2-70B to its 2024 suite. Proprietary models like GPT-4 are not publicly benchmarked, but their scale likely requires high-end GPU clusters.
Different Families: Benchmarks include general-purpose chat models, code-specialized models (StarCoder), and others like Qwen (Alibaba).

Model Size vs Performance: Larger models deliver higher accuracy at the cost of slower inference and greater memory use:

Small (7B-13B): Can often run on single GPUs with higher token throughput. A 7-8B model typically fits in ~16-24 GB of memory.
Medium (13B-30B): Balance quality and speed. A 13B model generates text roughly twice as fast as a 70B model on the same GPU.
Large (65B-70B+): Push memory limits. Llama 2-70B in FP16 consumes ~140 GB, requiring sharding across multiple GPUs.

2. Hardware Scope: GPU vs TPU vs Accelerators vs CPU

Inference performance is highly dependent on hardware:

NVIDIA GPUs (A100, H100): H100 (80 GB) is the flagship, offering ~2x higher tokens-per-second on 7B models compared to A100 due to architectural improvements and FP8 support.
AMD GPUs (MI300X): Features 192 GB of HBM3 memory on a single package, beneficial for hosting 70B+ models without splitting.
Google TPUs (v4, v5e): TPU v5e is cost-optimized, offering up to 2.5x inference performance per dollar compared to TPU v4.
AI Accelerators (AWS Inferentia2, Habana Gaudi2): Focus on price-performance. Inferentia2 offers ~44% better cost-per-token than standard GPU instances.
High-Performance CPUs: Modern CPUs (Intel Xeon, AMD EPYC) can handle 7B-13B models with 4-bit quantization but are 10x-100x slower than GPUs.

3. Concurrency and Scaling Performance

LLM inference isn't just about single-stream speed. Increasing batch size improves total throughput until saturation but increases latency per request. Efficient engines like vLLM use "continuous batching" to keep GPUs near peak utilization even with many parallel streams.

4. Use Case Performance Considerations

Chatbots: Prioritize low Time-to-First-Token (TTFT) and streaming.
Code Generation: Stresses sequence length; benefits from FlashAttention.
RAG: Combines vector search with generation. Using a smaller 7B model with RAG can often be faster and more cost-effective than a 70B model without it.

5. Latest Benchmarks and Trends (2024)

Quantization: FP8/INT8/INT4 are now standard for high-speed deployment.
Expert Mixture (MoE): Models like Mixtral-8x7B provide 70B-level quality with much higher throughput by only activating a subset of parameters.
Cost Efficiency: A major shift towards measuring "throughput per dollar" rather than just raw speed.

Sources: MLPerf v4.0/4.1, NVIDIA, AMD, AWS, Google Cloud, Argonne National Lab.

You should read the original English version if you are interested.

After reading these, I felt I wanted more, so I let o3-mini-high use its search function to conduct a "fact-check." The results are as follows:

Below is a detailed analysis of the article "LLM Inference Benchmarking: A Comprehensive Analysis," including a fact‐check of its key claims and an evaluation of the references cited.

Fact-Checking Key Claims:

Hardware Variability: The claim that smaller LLMs run on CPUs while larger ones need accelerators is accurate and matches industry practice.
Metrics: The discussion on TTFT and tokens/sec (e.g., ~95 tok/s on RTX 4090) aligns with current benchmarking data.
Frameworks: The comparison between engines like vLLM and TensorRT-LLM reflects current technical consensus regarding throughput versus ease of use.

Evaluation of References: The article cites 23 references, ranging from academic arXiv papers to authoritative industry blogs (Google Cloud, Dell, NVIDIA). The diversity of sources supports a balanced and well-grounded view.

I am not greedy; producing one such report every day is enough.