In 2023, hyperscalers spent over $50 billion on custom silicon, and that number is climbing. Google, Amazon, Microsoft, Meta, and even Tesla have all publicly committed to designing their own AI accelerators. The reason is straightforward: off-the-shelf chips from NVIDIA or AMD are expensive, power-hungry, and often supply-constrained. By building custom AI hardware, these companies aim to reduce training costs by 30–50%, improve inference throughput by a factor of 2–5 for their specific workloads, and lock in proprietary performance advantages. This article walks through the major players, their chip architectures, real-world deployment numbers, and what developers and IT decision-makers need to know about this shift.
The move toward custom silicon isn't sudden. It has been brewing since 2015, when Google first revealed its Tensor Processing Unit (TPU). The inflection point came with the explosion of large language models (LLMs) and generative AI in 2022–2024. Training a single model like GPT-4 costs an estimated $100–200 million in compute time. At that scale, even a 10% efficiency gain saves tens of millions of dollars per model generation.
Three converging forces are driving the shift:
Google’s TPU has evolved through five generations since 2015. The latest, TPU v5e, launched in 2023, delivers 2x performance per dollar over TPU v4 for LLM inference. Each TPU v5e pod contains 256 chips interconnected with a custom 2D torus mesh, achieving 1.2 petaflops of bfloat16 performance.
Unlike NVIDIA GPUs, TPUs lack a dedicated video memory hierarchy. They rely on a unified High Bandwidth Memory (HBM) pool—up to 128 GB per chip on v5e. This reduces latency for models that fit entirely in memory but creates bottlenecks for larger models requiring model parallelism across pods. Google’s internal data shows that TPU v5e achieves 82% utilization on BERT-sized models versus 65% for comparable GPU configurations, but the gap narrows for models over 100 billion parameters.
Developers often assume TPUs are drop-in replacements. They are not. Google’s XLA compiler must be used to JIT-compile TensorFlow or JAX models. Skipping XLA-specific quantization or batching can degrade throughput by up to 60%. A 2024 case study from Google Research showed that proper tuning on TPU v5e reduced training time for a PaLM-like model by 37% compared to naive deployment.
Amazon entered the custom chip race with Inferentia in 2019 for inference, followed by Trainium in 2021 for training. The second-generation Trainium2 chip, announced in late 2023, offers 200 teraflops of performance per chip and 96 GB of HBM memory. AWS claims that Trainium2 reduces training cost per epoch by 35% compared to comparable GPU instances in EC2.
Amazon SageMaker now supports Trainium instances (trn1.32xlarge) starting at $1.70 per hour on reserved pricing. A notable deployment is Amazon’s own Alexa speech models, which trained on a cluster of 10,000 Trainium chips. Internal benchmarks show that for transformer-based speech models, Trainium delivers 1.8x the throughput of NVIDIA A100 at the same power draw. However, for sparse models or those requiring frequent checkpointing, the lack of mature dynamic queuing in Neuron Core (Trainium’s software stack) can cause 15–20% idle time.
If your model uses heavy custom ops (e.g., custom attention kernels) or requires low-precision arithmetic beyond bfloat16, Trainium may underperform. The Neuron SDK supports only a subset of PyTorch operators; as of early 2024, around 1,200 operators are supported versus over 2,000 in CUDA. A developer I spoke to at re:Invent 2023 noted that porting a vision transformer with custom normalization took three weeks to rewrite in Neuron-compatible operations.
Microsoft unveiled the Maia 100 chip at Ignite 2023. It is a 5nm chip designed specifically for Azure AI workloads. Each Maia 100 provides 1.2 petaflops of AI performance and 128 GB of HBM3 memory. Crucially, Maia is not sold as a standalone product—it is integrated into Azure’s dedicated AI instances, competing with NVIDIA’s H100 Cloud Instances.
Microsoft’s deepest advantage is its $13 billion partnership with OpenAI. Maia 100 was co-designed with feedback from OpenAI’s engineering team. Early benchmarks shared by Microsoft at Build 2024 show that GPT-4 Turbo inference on Maia 100 achieves 50–60 tokens per second per chip versus 40–50 on NVIDIA H100 in Azure’s customized stack. The trade-off is that Maia only supports Microsoft’s custom low-level API, Silo, which lacks the ecosystem breadth of CUDA. Migrating existing pytorch workloads requires using Olive, Microsoft’s model optimization tool, which as of May 2024 still has limited quantization support for INT8.
Meta’s custom chip journey is less advanced but strategically distinct. The first-generation Meta Training and Inference Accelerator (MTIA), announced in 2023, is a 7nm chip focused on recommendation systems and ranking models. Meta reports that MTIA delivers 3x better performance per watt for deep learning recommendation models (DLRMs) compared to CPU-based deployments.
MTIA is not designed for LLM training. Its architecture is optimized for embedding table lookups and sparse matrix operations that dominate Meta’s ad ranking and feed ranking engines. A common mistake is to assume MTIA can accelerate transformer models—it can, but only via explicit operator mapping, and performance is roughly on par with a mid-range GPU. For Meta, the real win is power savings: they anticipate reducing total compute power for recommendation workloads by 25% across their fleet by 2025.
NVIDIA is not sitting still. In March 2024, they announced the Blackwell GPU architecture, which includes 208 billion transistors and 1.8 TB/s memory bandwidth. Blackwell’s NVLink interconnect scales to 576 GPUs in a single domain. However, NVIDIA’s prices are rising: the B200 GPU is expected to cost $50,000–$60,000 per unit.
Despite the custom chip push, NVIDIA’s key advantage remains its software ecosystem. CUDA has over 4 million developers and libraries optimized for every major framework. Custom chips require developers to learn new SDKs (Google’s XLA, Amazon’s Neuron, Microsoft’s Olive). This friction cost is often underestimated. A 2023 survey by AI Infrastructure Alliance found that 72% of ML engineers prefer to stay with NVIDIA for new projects, citing tooling maturity as the primary reason. For startups with small teams, the switching cost can exceed the hardware savings.
Choosing between custom chips and GPUs depends on workload scale, team expertise, and long-term vendor lock-in risk.
By 2026, it is plausible that custom AI chips will handle over 60% of training and 80% of inference tasks inside major cloud providers. But the transition will be uneven. For companies not operating at hyperscale, NVIDIA’s ecosystem remains the safest bet. For those building dedicated AI products that run 24/7, a hybrid approach—training on GPUs, inferring on custom chips—offers the best balance of flexibility and cost. Start benchmarking your workloads today; the chip you choose in 2024 will shape your AI costs for the next three years.
Browse the latest reads across all four sections — published daily.
← Back to BestLifePulse