Nvidia has faced scrutiny this month because some servers with a whopping 72 Blackwell processors were overheating. The issue arose because some initial OEM deployments were not properly water-cooled, which Lenovo aggressively identified and mitigated with its Neptune warm water-cooling solutions.
As AI advances, we’ll need more highly dense, incredibly powerful AI processors, which suggests that air cooling in server rooms may become obsolete.
Let’s talk about Blackwell, water cooling, and why Lenovo’s Neptune solution stands out at the moment. We’ll close with my Product of the Week: Microsoft’s Windows 365 Link, which could be the missing link between PCs and terminals that could forever change desktop computing.
Blackwell
Blackwell is Nvidia’s premier, AI-focused GPU. When it was announced, it was so far over what most would have thought practical that it almost seemed more like a pipe dream than a solution. But it works, and there is nothing close to its class right now. However, it is massively dense in terms of technology and generates a lot of heat.
Some argue it is a potential ecological disaster. Don’t get me wrong, it does pull a lot of power and generate a tremendous amount of heat. But its performance is so high compared to the kind of load that you’d typically get with more conventional parts that it is relatively economical to run.
It’s like comparing a semi-truck with three trailers to a U-Haul van. Yes, the semi will get comparatively crappy gas mileage, but it will also hold more cargo than 10 U-Haul vans and use a lot less gas than those 10 vans, making it more ecologically friendly. The same is true of Blackwell. It is so far beyond its competition in terms of performance that its relatively high energy use is below what otherwise would be required for a competitive AI server.
But Blackwell chips do run hot, and most servers today are air-cooled. So, it shouldn’t be surprising that some Blackwell servers were configured with air cooling and those with 72 or more Blackwell processors on a rack overheated. While 72 Blackwells in a rack is unusual today, as AI advances, it will become more common, given Nvidia is currently the king of AI.
You can only go so far with air-cooled technology in terms of performance before you have to move to liquid cooling. While Nvidia did respond to this issue with a water-cooled rack specification that Dell is now using, Lenovo was way ahead of the curve with its Neptune water-cooling solution.
Lenovo Neptune
Lenovo was the first to realize this, mainly because it is currently the market leader in its class in terms of water cooling — a technology initially acquired from IBM, which has been doing water cooling for decades.
What is important with water cooling isn’t just the technology but the knowledge of how to deploy it safely. Mixing water and high-amperage electronics can be a disaster if you don’t know what you’re doing. As a result of the IBM server acquisition, Lenovo has decades of water cooling experience that it calls Neptune.
Given Nvidia has specified a water-cooled rack, what makes Neptune better? The answer is experience. Most that will use the Nvidia-specified solution, including Nvidia, don’t often deploy water-cooled solutions. As a result, particularly with these high-end Blackwell implementations, they’ll essentially be learning on the job.
It can be really dangerous when you mix water with high-amperage electronics. Water and electricity don’t mix. Not only can a leak fry an expensive part or even an entire rack, but if a person is present, it can fry them, too, if the breakers don’t set in. In a raised-floor environment, unless it has been designed with leaks in mind, terrible things can happen.
I observed this myself decades ago when I was at IBM, and it turned out they hadn’t stress-tested the water-cooling system for our massive (for the time) data center. The site lost a transformer that shut off the water-cooling system, which hadn’t been stress-tested for a sudden stop. The pipes burst, and the data center became a dangerous swimming pool. Most of the hardware, costing hundreds of millions of dollars, was lost, and the building was flooded, doing additional damage.
Through experiences like this, IBM became the leading OEM for safe water cooling, and Lenovo acquired that knowledge and experience when it bought the IBM x86 server group. Now, Lenovo, along with IBM, knows how to do water cooling better than most, which means that you can rest assured that a Lenovo Blackwell server won’t overheat or suddenly begin to leak.
Plus, Lenovo’s expertise is in warm water cooling, a far safer and far less expensive way to cool servers than cold water cooling, which requires huge, inefficient evaporators or chillers.
Implementing this technology is no trivial task. Unlike automobiles or PCs that are water-cooled, servers have to have hot swapping capabilities, which means you need exceptional and highly tested drip-free connections, aggressive alerting, preventive maintenance schedules based on past knowledge of components, and technicians experienced with working with this level of water-cooling tech.
Wrapping Up: A Future of Warm-Water-Cooled Data Centers
Blackwell is only the first of these incredibly powerful processors to hit the market because as AI pushes the envelope, Nvidia’s competitors will also have to push into something similar, suggesting all servers may eventually need to be warm water cooled.
That positions Lenovo nicely for a water-cooled future regardless of the technology while Lenovo’s competitors try to catch up. One benefit I expect techs to look forward to is the reduction in data center noise. The amount of air you have to push through air-cooled servers is massive and turns today’s data centers into a noise nightmare.
As warm-water cooling moves into the market more aggressively, these data centers will quiet down, making them far more pleasant places to work. That will make many of us who have to work in them very happy.