Cern: Challenges of GPU datacentre management
Cern is a major user of Kubernetes. The container orchestration technology offers a way to democratise AI hardware
Earlier in March, Cern, the European organisation for nuclear research, was awarded the Cloud Native Computing Foundation (CNCF) Top End User Award during the KubeCon and CloudNativeCon event in Paris.
Cern has been a major user of Kubernetes, looking into how graphics processing units (GPUs) can be managed effectively in on-premise environments.
GPUs have become the de facto standard for running artificial intelligence (AI) workloads. CNCF used the Paris conference to launch a Cloud Native AI working group. Among the developments that have been taking place in cloud-native computing is that the Kubernetes Scheduler has evolved to integrate and support sharing GPUs.
Commodity hardware and the ever-increasing performance improvements offered by GPUs means people working at the Cern particle accelerator lab are considering the viability of using commodity hardware powered with GPUs to run machine learning. These are capable of replacing the custom hardware used in the accelerator’s detectors.
Addressing delegates at the event, Cern computing engineer Ricardo Rocha said: “I don’t know how many people are running on-premise infrastructure or just relying on external cloud providers, but the first challenge we have is that the pattern of usage of hardware is very different from traditional CPU [central processing unit] workloads.”
In his experience, datacentre power and cooling requirements increase dramatically when using GPUs. In fact, people requesting IT infrastructure to run these new workloads at Cern are also using computing resources that were traditionally associated with HPC, such as the need for fast network interconnects such as Infiniband to connect clusters of GPUs.
Rocha said the opportunity to use GPUs comes at a time when Cern is extending the life of hardware from five to eight years. “People want to have fancy new GPUs, but from our side, they’re extremely expensive,” he said. “We want to make them last longer, while people want to have a much faster turnaround because this is what the public cloud providers are giving them.” This means the IT team at Cern is tasked with offering the best of the internal infrastructure while being able to support more advanced use cases.
During his presentation, Rocha discussed the need to provide a platform to democratise AI and offer researchers the ability to access the GPU resources Cern has available.
He discussed the importance of understanding the different types of GPU workloads and patterns of usage. Some are interactive and typically require lower computational power and GPU usage, while others are much more predictable and run in batch mode. Rocha also said managing these predictable workloads borrows from HPC best practices, such as queueing and scheduling to make best use of the available IT resources.
“When you add GPUs [into the datacentre], the main lesson is to stay as flexible as possible in terms of the infrastructure you can support,” he said.
This means building the ability to run multiple clusters and hybrid workloads. “If you can get hold of GPUs, complement them by bursting into external resources,” said Rocha. “This is really important and is a design decision that has to be made at the start.”
Read more about GPUs
- Facebook’s parent company, Meta, said training clusters are part of its plans to grow its infrastructure and obtain 350,000 Nvidia H100 GPUs by the end of the year.
- GPU company Nvidia has seen record growth, driven by demand for AI acceleration in datacentres.