DeepSeek & Mixture-of-Experts (MoE)

DeepSeek made headlines this week, shaking up the AI industry and challenging the “bigger is better” narrative that has defined the field. As I explored in my Spectator article, the Chinese startup has done what many thought impossible: delivering performance on par with Western trillion-parameter models for a fraction of the cost. This has raised fundamental questions about the economics of AI and the industry’s reliance on brute-force compute strategies.

But how, exactly, did they do it?

The Core Innovations Powering DeepSeek-R1

DeepSeek’s innovation isn’t rooted in groundbreaking new concepts but in the refinement and reimagining of existing ideas. Its flagship model, DeepSeek-R1, combines three key techniques: Mixture of Experts (MoE), Multi-head Latent Attention (MLA), and FP8 quantisation. Together, these mechanisms allow the model to deliver high-quality performance while drastically reducing compute requirements.

  • Mixture of Experts (MoE): DeepSeek-R1 activates only 37 billion of its 600 billion parameters during training, reducing compute by 80%. This approach delegates tasks to specialised “experts” within the model, enabling efficiency without sacrificing quality.

  • Multi-head Latent Attention (MLA): By projecting the attention mechanism into a lower-dimensional space, MLA reduces the memory footprint of the model’s key-value cache while speeding up inference.

  • FP8 Quantisation: DeepSeek’s use of 8-bit floating-point precision cuts memory usage by up to 75% compared to traditional FP32, allowing the model to run on just 2,048 GPUs while maintaining performance.

The Role of Mixture of Experts (MoE)

Among these techniques, MoE is the most significant contributor to DeepSeek’s efficiency — and the one that sets it apart. Introduced in 2017 through a collaboration between Google Brain and Jagiellonian University, MoE divides a model’s parameters — the “knobs and dials” it adjusts during training to identify patterns and relationships — into specialised “experts.” For each task, only the most relevant experts are activated, significantly reducing computational load while maintaining high performance.

The challenge with MoE has always been its complexity. Routing tasks to the right experts, coordinating outputs, and maintaining consistent performance have made it notoriously difficult to implement effectively. Many in the industry had moved away from MoE, deeming it too unwieldy.

DeepSeek, however, managed to overcome these challenges. By fine-tuning its routing mechanisms and integrating MoE with MLA and FP8, it demonstrated that MoE can work at scale — and deliver results. The implications are significant: DeepSeek has shown that efficiency doesn’t have to come at the expense of performance, and that scaling down can be just as powerful as scaling up.

The Great Unhobbling

As I noted in my Trends Redux, this new models' success exemplifies a broader shift in the AI landscape — what Leopold Aschenbrenner described as “the unhobbling” in his seminal paper Situational Awareness. While the past few years have been defined by skyrocketing capabilities and shattered benchmarks, the (near) future lies in leveraging what we already have.

Rather than chasing raw computational expansion, the industry is beginning to pivot toward efficiency, practicality, and adaptability. DeepSeek embodies this shift, proving that innovation doesn’t always mean building bigger — it means building smarter. As open-source and proprietary models converge, 2025 will be the year enterprises finally harness what’s within reach.

Follow the lines…

Previous
Previous

AI Versus the Economy

Next
Next

Knowledge Graphs & Trust in LLMs