BillionPhotos.com - Fotolia

NHS Wales IT outage: What went wrong with its datacentres?

A country-wide outage of several core NHS Wales IT systems has prompted questions about the organisation’s datacentre failover procedures

A networking outage caused two NHS datacentres to fall offline on Wednesday 24 January, preventing healthcare workers across Wales from accessing patient data and core IT systems.  

According to the BBC, healthcare professionals working for NHS Wales were unable to access multiple IT systems for several hours, including those used to book patient appointments, retrieve test results, and log notes taken during consultations.

Email and internet usage is also thought to have been affected, along with the systems used by NHS Wales to access pharmaceutical information and administer drugs.

The NHS Wales Informatics Service (NWIS), which oversees the delivery of IT systems for health and social care organisations across the country, attributed the problems to network issues at two of its datacentres, in a brief statement on its website.

“Both NHS Wales national datacentres are now back online, following an earlier networking outage. All clinical systems are now available,” the statement said.

“NWIS will continue to monitor the situation and work with our equipment suppliers to investigate the root cause. We appreciate that this will have caused disruption to our service users and we apologise for any inconvenience caused.” 

Computer Weekly contacted NWIS for further guidance on the steps the organisation is taking to prevent a repeat of the reported problems, but had not received a response at the time of publication.

The facilities are about 30 miles apart, with one located in Blaenavon, Pontypool, and the other in Cardiff Bay. Collectively, they are home to the infrastructure used to deliver IT services to NHS Wales.

Guillaume Ayme, IT operations evangelist at big data analytics software supplier Splunk, raised concerns about the datacentres’ setup, given that running dual sites usually means that in the event of an outage, one will failover to the other.

“For the issue to be impacting two datacentres suggests it is severe, as one would normally be the backup for the other,” he said. “This may suggest there has been a problem in the failover procedure.

“Once the service is restored, it will be essential to find the root cause to avoid a potential repeat. This can be complex for organisations that do not have full visibility into the data generated by their IT environment.”

Read more about datacentre outages

NHS Wales is known to have undergone a rationalisation and upgrade of its datacentre estate for efficiency and resiliency purposes in recent years, resulting in the closure of a number of smaller facilities and server rooms, with its Blaenavon and Cardiff Bay sites taking up the slack.

The organisation has also moved to develop and roll out applications that run on a common, underlying infrastructure, known internally at NWIS as the National Architecture, to enable greater interoperability and data-sharing between various clinical IT systems.

Built using service-orientated architecture (SOA) principles, the National Architecture “enables information originally gathered in one user application to be reused in another”, states the 2017 NWIS Annual review document.

“It aims to provide each user with high-quality applications that support their daily tasks in the delivery of health and care services, while also ensuring that any relevant information created about the citizen is available safely and securely, wherever they present for care,” the document says.

The document credits the setup with breaking down boundaries between the various departments and organisations. In turn, this is giving clinicians working within NHS Wales “a national view” of the health of the country and its citizens.  

It also acknowledges the underlying complexity of the setup, which Dave Anderson, digital performance expert at application performance management software provider Dynatrace, suggested could be why the incident took as long as it did to resolve.

“While systems are now back up and running, the chaos it created shows why we need to move from hours to minutes to resolve problems like this,” said Anderson.

 “Ultimately, it comes down to our reliance on software and the need for it to work perfectly – and that’s difficult in IT environments that are getting more complex by the day.

“The challenge is that trying to find the root cause of the problem is like finding a needle in a haystack, and then understanding the impact and how to roll back from it is even more difficult.”

Read more on Datacentre performance troubleshooting, monitoring and optimisation