Nats highlights limits of automation and backup shortcomings
Looking at the August bank holiday disruption caused by an air traffic control technical issue at Nats, there are a number of immediate take-aways that CIOs should ponder.
There are reports that incorrect data of a flight plan submitted by a French airline may have triggered the Nats outage. Martin Rolfe, CEO of Nats said: “Initial investigations into the problem show it relates to some of the flight data we received.”
The very first thing that must be answered is how invalid data can cause such a catastrophic failure that both the main and backup systems are impacted. Data integrity is, arguably, the most important component of modern applications. Without the flow of accurate information, data-driven applications cannot function coherently. Nevertheless if some input data causes a malfunction, one would hope the backup system can quickly be called upon to revert the system back to a stable state. It should be possible to identify the malformed data that corrupted the primary system pretty quickly, and this can then be corrected. Clearly, in systems with high throughput of data, there is going to be a delay in unraveling the transactions that occurred after the invalid or malformed data was submitted. But this is why we keep backups and live mirroring to ensure system integrity can be restored as quickly as possible.
Although the technical issues affecting Nats’ flight planning system were identified and remedied within a few hours, the delay caused major travel disruption across airlines.
Reduced capacity when automatic processing switched off
The process Nats put in place was effectively a fail-safe, which maintained the integrity of the system and enabled air traffic control to operate, albeit at a reduced capacity. “Our systems, both primary and the back-ups, responded by suspending automatic processing to ensure that no incorrect safety-related information could be presented to an air traffic controller or impact the rest of the air traffic system,” Rolfe said.
This is the second aspect of the Nats incident that CIOs and IT leaders should consider when resourcing their own business continuity plans. Yes we should always have a means to resort to a fail-safe. But if the processes involved in keeping the system running after a failure cause this level of disruption, is it really working?
One cannot expect a fail-safe to operate with the same level of efficiency as the primary system, but stakeholders must fully investigate the worst case scenario when the fail-safe is deployed in order to assess what level of service degradation is acceptable.