alphaspirit - stock.adobe.com

Cloudflare apologises for major net outage

Incident from network engineering team’s routine router configuration update work sees major outage on backbone leading to drops in internet services in Europe, US and Latin America

With network resilience being a huge issue since the lockdown orders of March, leading to operators being pressed to maintain quality across complex networks and showing just how critical these infrastructures are, it has been revealed that a a 27-minute outage in the backbone network of the Cloudflare global cloud platform on 17 July has led to a 50% drop in internet traffic worldwide.

The Cloudflare global cloud platform is designed to deliver a range of network services to businesses of all sizes around the world, and the company claims that the platform makes these services more secure while enhancing the performance and reliability of their critical internet properties.

Yet an outage occurred because, while working on an unrelated issue with a segment of the backbone from Newark to Chicago, Cloudflare’s network engineering team updated the configuration on a router in Atlanta to alleviate congestion.

This configuration contained an error that caused all traffic across the backbone to be sent to Atlanta, which “overwhelmed” the Atlanta router and caused Cloudflare network locations connected to the backbone to fail, according to the company.

Because of the architecture of the backbone, this outage didn’t affect the entire Cloudflare network and was localised to certain geographies. The affected locations were San Jose, Dallas, Seattle, Los Angeles, Chicago, Washington, DC, Richmond, Newark, Atlanta, London, Amsterdam, Frankfurt, Paris, Stockholm, Moscow, St. Petersburg, São Paulo, Curitiba, and Porto Alegre.

The first issue occurred on the backbone link between Newark and Chicago, which led to backbone congestion in between Atlanta and Washington, DC. In responding to that issue, a configuration change was made in Atlanta, which then started the outage.

Once the outage was understood, the Atlanta router was disabled, and traffic began flowing normally again 27 minutes later. The company than saw congestion at one of its core datacentres that processes logs and metrics, causing some logs to be dropped. During this period, the edge network continued to operate normally.

Cloudflare said its backbone, made up of a series of private lines avoiding the public internet between its datacentres, allows for a “higher quality of service”, as the private network can be used to avoid internet congestion points. Cloudflare added, with the backbone, it has far greater control over where and how to route internet requests and traffic than the public internet provides.

Cloudflare has apologised for this outage and said that it has already made a global change to the backbone configuration that will prevent it from being able to occur again.

Overall, networks in Europe and the US have coped well with the added loads and demands from the new normal. In a June 2020 analysis of the network resiliency of residential broadband networks in Europe’s major economies during lockdown, customer experience measurement firm MedUX found that, after a shaky start, the UK’s broadband infrastructure have coped well with the massively added demands, since the early days of lockdown.

Read more about network resilience

Read more on Network routing and switching