DeepSeek Inference Theoretical Model: Deriving the Performance From Hardware Primitives

Sep 2, 2025

Currently, DeepSeek v3 is the most popular open-source large language model. The DeepSeek team recently introduced substantial inference-time optimizations, making the model surprisingly efficient to serve despite its massive size.

To explore the impact of these architectural choices and optimizations, we developed a theoretical model to estimate throughput based on specific hardware parameters. Our goal? To offer practical insights for anyone navigating the complex world of inference in large-scale “mixture of experts” (MoE) models.

To share our thoughts and experiences, we compiled a comprehensive report where we break down the tradeoffs between latency, throughput and cost across different hardware setups. We show how factors like GPU count and interconnect speed can shift the performance bottleneck, whether it’s compute, memory or communication bandwidth.

Curious how these tradeoffs play out in real-world scenarios? Download the full report to dive deeper into the data and sharpen your intuition around inference performance.

Alpha-MoE: A Megakernel for Faster Tensor Parallel Inference

Introducing TFree-HAT 7B: Tokenizer-Free Models Achieving Top-Tier Multilingual Performance

By Vertical

By Solution

Alpha-MoE: A Megakernel for Faster Tensor Parallel Inference

Introducing TFree-HAT 7B: Tokenizer-Free Models Achieving Top-Tier Multilingual Performance

DeepSeek Inference Theoretical Model: Deriving the Performance From Hardware Primitives

Author

Related Articles

Newsletter -- Footer