Building the sustainable HPC environments of the future

In this guest post, Mischa van Kesteren, sustainability officer at HPC systems integrator OCF runs through the wide variety of ways that large-scale computing environments can be made to run more energy efficiently.

Supercomputers are becoming more energy hungry. The pursuit of Moore’s Law and ever greater hardware performance has led to manufacturers massively ramping up the power consumption of components.

For example, a typical high performance computing (HPC) CPU from 10 years ago would have a thermal design power (TDP) of 115 Watts – today that figure is closer to 200.

Modern GPUs can exceed 400 Watts. Even network switches, which used to be an afterthought from a power consumption perspective can now consume over 1KW of power in a single switch.

And the race to achieve exascale has pushed the power consumption of the fastest supercomputer on the planet from 7.9MW in 2012 to 29.9MW in 2022.

In this era of climate chaos, is this justifiable? Ultimately, yes. Whilst 29.9MW is enough electricity to power 22,000 average UK households, the research performed on these large systems is some of the most crucial to how we will navigate the challenges we are facing and those to come, whether that’s research into climate change, renewable energy or to combat disease.

It is vital, however, that we continuously strive to find ways of running HPC infrastructures as efficiently as possible.

The push for energy efficiency

The most common method of measuring the power efficiency of a datacentre is through its power utilisation efficiency (PUE). Traditional air-cooled infrastructure blows hot air through the servers, switches and storage to cool their components and then air-conditioning is used to remove the heat from that air before recirculating it. And this all consumes a lot of power.

The air-cooling often has a PUE in excess of two, meaning the datacentre consumes twice as much power as the IT equipment. The goal is to reduce the PUE of the HPC infrastructure as close to one as possible (or even lower).

A more efficient method is to cool the hot air with water. Water transfers heat over 20 times faster than air making it far better for cooling hardware. Air cooled components can use water through rear door heat exchangers which place a large radiator at the rear of the rack (filled with cold water), cooling all the hot air that is exhausted by the servers.

Get the flow rate and water temperature right and you can remove the need for air conditioning all together. This can get the PUE down to closer to 1.4.

Alternatively, components can be fitted with water blocks on the CPU, GPU, networking etc, which directly cool the components, removing the need for air cooling all together. This is far more efficient, bringing the PUE down further, possibly to less than 1.1.

Reusing waste datacentre heat

Ultimately, we need to do something with the waste heat. A good option is to make use of free cooling. This is where you use the air temperature outside to cool the water in your system. The highest outdoor temperature recorded in the UK was 38.7 °C.

Computer components are rated to run at up to double that so as long as the transfer medium is efficient enough (like water) you can always cool your components for just the energy used by the pumps. This is one of the reasons why you hear about datacentres in Norway and Iceland being so competitive – they can make use of free cooling far more judiciously due to their lower temperatures.

Taking things one step further, the heat can be used for practical purposes rather than exhausted into the air. There are a few innovative datacentres which have partnerships with local communities to provide heating to homes from their exhaust heat, or even the local swimming pool. The energy these homes would have consumed to heat themselves has in theory been saved, which can bring the PUE of the total system below one.

The next step which is being investigated is to store the heat in salt, which can hold it indefinitely, to make allowances for the differences in heating requirements and compute utilisation. Imagine the knock-on effect of the traditional Christmas maintenance window where IT infrastructure is turned off just when those local households need heat the most.

One thing you may have noticed about all of these solutions is they are largely only practical at scale. It is not a coincidence that vast cloud datacentres and colocation facilities are the places where these innovations are being tested, that is where they work best. The good news is the industry seems to be moving in that direction anyway – as the age of the broom cupboard server room is fading.

The power consumption of the cloud giants

However, in the pursuit of economies of scale, public cloud providers are operating huge fleets of servers, many of which are underutilised. This can be clearly seen in the difference in price between on demand instances that run when you want them to (typically at peak times) and ‘spot’ instances which run when it is most affordable for the cloud provider.

Spot instances can be up to 90% cheaper. As cloud pricing is based almost entirely on the power consumption of the instance you are running, there must be a huge amount of wasted energy costed into the price of the standard instances.

Making use of spot instances allows you to run HPC jobs in an affordable manner, and in the excess capacity of the cloud datacentres, improving their overall efficiency. If you are running your workloads on demand, however, you can make this inefficiency worse.

Luckily HPC workloads often can fit the spot model. Users are familiar with the interaction of submitting a job and walking away, letting the scheduler determine when the best time to run that job is.

Most of the major cloud providers offer the functionality to set a maximum price you are willing to pay when you submit a job and wait for the spot market to reach that price point.

This is only one element of HPC energy efficiency, there is a whole other world of making job times shorter through improved coding, right sizing hardware to fit workloads and enabling power saving features on the hardware itself to name a few.

HPC sustainability is such a huge challenge that involves everyone who interacts with the HPC, not just the designers and infrastructure planners. However, that is a good place to start. Talking to those individuals that can build in the right technologies from the start ensures that they will provide you with a sustainable HPC fit for the future.