Nvidia Diversification Isn't Only About Cost

In December 2025, Ben Thompson wrote that Nvidia had historically three moats relative to TPUs, “superior performance, significantly more flexibility due to GPUs being more general purpose than TPUs, and CUDA and the associated developer ecosystem surrounding it.” Ben goes on to make the point that since Gemini 3 was trained on TPUs and top of the leaderboards in terms of model quality (at the time), it illustrated that Nvidia chips weren’t required for building the “best” model and therefore the frontier labs and hyperscalers, given the astounding capex costs for Nvidia GPUs, might be more willing to wade through Nvidia’s other moats to use TPUs. Importantly, Ben points out that TSMC is the real bottleneck because TPUs and GPUs are both supply constrained so the debate might be premature.

While we are still supply constrained, I no longer think the debate is premature because we’re seeing continued signs of diversification. On April 6th, Anthropic published this press release announcing an expanded partnership with Google and Broadcom for gigawatts of TPU capacity, which ended with this statement:

“We train and run Claude on a range of AI hardware—AWS Trainium, Google TPUs, and NVIDIA GPUs—which means we can match workloads to the chips best suited for them.”

Another way to read that statement is that ‘Nvidia chips aren’t the best at everything and we’re willing to go through the trouble of using other chips because of that.’ We got more specificity on this a few days later when AWS CEO Matt Garman said, “all of [Anthropic’s] models that are out today are trained on [AWS] Trainium." While Garman doesn’t specify what he meant by “training,” the implicit takeaway is that Anthropic was willing to do at least some of their model training on Trainium, which is a big shift because training has been Nvidia’s speciality.

In addition, there are promising new AI chip suppliers entering the market. One of them is MatX, which announced a $500M Series B in late February. Their CEO, Reiner Pope, gave an interview on April 9 where he talked about the competitive dynamics they see as they start to sell their first product, MatX One:

“At this point, I think it’s proven that the lock-in is pretty weak. Barring Google, who has been on TPUs forever, all of the other frontier labs are multi-platform. OpenAI, Anthropic, Meta, X — they are all on Nvidia, many of them are on TPUs. There are Cerebras announcements, AMD, some Broadcom-developed chips as well. All of these players are multi-platform. That is the proof already that the software lock-in is not that great.

If you want to think about the first principles reasons why, it’s because software versus hardware lock-in is really a question of how much spend you’re putting on the hardware versus how much you’re putting on software engineering to support the hardware. This is really the first time that balance has changed, and it has violated a lot of people’s intuitions… All of the frontier labs are spending tens of billions of dollars on compute. The salaries of the people writing software for that compute are very high, but still small in comparison to the compute spend. So the rational choice is to do anything you can to get hardware costs down, be multi-platform, get the negotiating power that comes from that.”

Not only does Reiner make the same point as Ben Thompson about higher capex making the tradeoff to leave CUDA more worthwhile, he goes on to explain how their chip is specifically designed to handle the workloads that labs and hyperscalers want now:

“That’s where the joint hybrid SRAM-HBM design really shines. You spend none of your HBM bandwidth on loading weights. All of that bandwidth is spent entirely on KV cache. So you can get better use out of your HBM bandwidth than you can with Nvidia. But you also get the very low latency because the weights are stored in SRAM, like Cerebras and Groq.

Digging into that further: low latency means small batch sizes — that’s just Little’s law. The number of things in flight are smaller. The memory occupancy in HBM is proportional to batch size. So you can actually fit longer contexts in HBM than you could if the latency were larger. Low latency is not just a usability win, but it actually improves your throughput as well.

This is similar to what Nvidia is now doing with the Groq and Nvidia racks side by side, but there are some taxes you pay by them being in different packages. Putting the whole thing in one package is the first principles way to do that and gives you the most advantages.”

The takeaway here is that while we’re still living in an extremely chip constrained world, diversification is happening not just because Nvidia’s prices and margins are high but also because Nvidia isn’t A+ in all dimensions.