Alpha-MoE: A Megakernel for Faster Tensor Parallel Inference

Dec 1, 2025

Mixture of Experts (MoE) architectures are reshaping the landscape of large language models, offering efficiency gains that dense models can’t match. But these benefits come with a cost: complex communication patterns that make performance optimization a real challenge.

That’s why we built Alpha-MoE, a fused megakernel library designed for FP8 W8A8 precision (8-bit weights, 8-bit activations). By combining multiple operations into a single persistent kernel, Alpha-MoE delivers up to 200% speed improvements compared to current Triton kernels in open-source LLM serving frameworks like vLLM and SGLang.

Want to understand how this works and what it means for real-world inference performance? Download the full report here to explore the architecture, benchmarks and practical insights behind Alpha-MoE.

Alpha-MoE: A Megakernel for Faster Tensor Parallel Inference

Introducing TFree-HAT 7B: Tokenizer-Free Models Achieving Top-Tier Multilingual Performance

By Vertical

By Solution

Alpha-MoE: A Megakernel for Faster Tensor Parallel Inference

Introducing TFree-HAT 7B: Tokenizer-Free Models Achieving Top-Tier Multilingual Performance

Alpha-MoE: A Megakernel for Faster Tensor Parallel Inference

Author

Related Articles

Newsletter -- Footer