The generation gap: Why legacy datacentres are not a good fit for HPC
In this guest post, Spencer Lamb, vice president of Harlow-based colocation firm Kao Data, sets out why legacy datacentres may struggle to accommodate the growing demand for HPC workloads
The impact of high performance computing (HPC), GPU-powered artificial intelligence (AI) and machine learning in the datacentre are creating a capability gap between legacy and newer, custom-engineered facilities.
Exaflops are now the new measure for the supercomputing industry. Today, packing more processors into a server rack enables operators to squeeze more performance out of their technical space, but it also requires them to supply more power.
Advances in CPU technologies mean new processors consume higher volumes of energy and produce more heat, therefore placing greater pressure on the datacentre’s cooling equipment.
Now, one of the key challenges facing datacentre operators is the pressure to maintain the equilibrium of space, power and cooling, as rack densities, or the number of kW’s per rack, edges steadily upwards.
According to an Uptime Institute survey, rack densities across the industry averaged 8.4kW in 2020 – an increase from 5.6kW in 2017 and by now, some analyst predictions would have placed the sector into the 30kW+ territory.
However, advances in energy efficiency, power infrastructure and IT equipment have been successful at keeping the number relatively low. Densities are therefore rising, but not enough to drive wholesale site-level changes in power distribution or cooling technologies. So what is driving this change within the colocation space?
HPC-ready datacentres
The picture is somewhat different when we look at high density IT deployments synonymous with HPC. Just a few years ago, anything above 16kW would be considered ‘extreme’ density and this year, 16% of datacentre operators reported they were running many of their racks at above 20kW.
The latest types of GPUs, for example, can consume 50kW per rack or above and are targeted at industrial scale compute within automotive design, life sciences research, film rendering and financial modelling.
Now, as datacentres are beginning to provide hosting for 100kW+ deployments, this new approach to compute divides commercial colocation facilities in two categories: those who can accommodate the requirements of modern HPC and AI infrastructure, and the traditional ‘legacy’ datacentres that cannot.
Moreover, with greater focus on carbon emissions, operators are tasked with finding a balance between leaps in performance and offering major energy efficiency gains. This is a challenge that some operators, based among a hotbed of HPC and AI organisations in the UK Innovation Corridor, are beginning to solve.
HPC market growth drivers
Before the Covid-19 coronavirus pandemic, the HPC market saw years of uninterrupted growth and – despite the disruptions and challenges of 2020 – the industry remains buoyant. Intersect360 Research forecasts that this sector will continue to grow at a healthy rate of 7.1 percent between 2019 and 2024, reaching a global market value of $55 billion.
One of the major drivers for this is AI. Most AI applications rely on models that need to be ‘trained’ in an extremely compute-intensive process – and then constantly retrained to maintain accuracy.
Lambda Labs, for example, estimates that training OpenAI’s GPT-3, the largest language model ever made, would require $4,600,000 in compute resources, and this was projected using NVIDIA’s Tesla V100 – the lowest-priced GPU on the market.
Interestingly the prominence of commercial compute-intensive applications is only going to increase. According to analyst Omdia, in 2018 the global market for AI hardware totalled $10.1 billion and is expected to reach nearly $100 billion by 2025. The firm even downgraded its most recent forecast by 22% – in case the AI market is not immune to Covid-19.
NVIDIA, having successfully harnessed the interest surrounding AI, naturally emerged as the frontrunner for the broader adoption of HPC infrastructure. Its DGX-1, the first AI server designed by the company, consumes 3.5 kW of power (or almost half that of a ‘typical’ 42U server rack) in just three rack units. Furthermore, the latest iteration of the design, the DGX A100, doubles both the size and the power consumption of its predecessor, to 6U and 6.5kW.
Individual DGX units can be linked together into something called the DGX SuperPOD, which runs so hot that NVIDIA’s own reference architecture does not recommend placing more than four in a single rack (for 26kW without extras), regardless of the fact an end-user could theoretically accommodate seven.
These are not fringe cases any more. The SuperPOD is positioned as “bringing supercomputing to enterprises,” and enterprises have expressed plenty of interest. DGX systems, for instance, are used by Baker Hughes in oil and gas, Volvo in automotive car design and Accenture Labs in cyber security, while pharmaceutical giant GlaxoSmithKline recently announced it will be purchasing several machines for pharmaceutical research purposes.
Furthermore – at Kao Data – our latest HPC customer will be utilising 26kW racks incorporating NVIDIA DGX hardware, so we know first-hand how the technology is changing the way that data centres are designed.
Demand for HPC datacentres on the rise
The growing adoption of HPC infrastructure, and the prevalence of AI workloads, suggests demand for datacentres equipped to handle extreme rack densities will only continue to increase. The key to a successful HPC facility is scalability, and for that it requires flexibility in the site’s architecture, its power, space and cooling.
No longer can operators have a power and cooling ceiling baked into the design. Unfortunately, many existing datacentres – even those designed with an eye for the future – underestimated just how high the density requirements would become.
Indeed, most datacentres were designed and built to air-cooled specifications, and today’s HPC racks running intense AI applications are reaching the maximum limits of airflow capabilities. You simply cannot increase the air speed or reduce the air temperature to reject the volume of heat created by such power-hungry systems.
All datacentres considering HPC are now investigating liquid cooling, which in its various formats can minimise the requirement for air-cooled equipment and reduce cooling energy needs. Liquid cooling also eliminates the need for fans, further reducing server complexity or vulnerability.
However, as racks increase in size it is probable that legacy sites will not have accounted for the additional floor-loading or the height requirements. Here Open Compute Project (OCP)-ready facilities, such as Kao Data, offer customisable architectures and infrastructure built to accommodate the true demands of intensive compute.
Datacentres that can best accommodate the extremes of power consumption and cooling are the facilities designed for HPC on an industrial scale – those built to emulate hyperscale data centres, providing ample power supply, wide aisles and reinforced concrete floors, and the adaptability to accommodate liquid cooled technologies without the need to retrofit. The fact is that it remains commercially and financially unviable for colocation datacentre operators to re-engineer legacy facilities for liquid-cooled HPC.