Kaspars Grinvalds - stock.adobe.
Downtime deterrent: Trustpilot SRE on using infrastructure as code to prevent site outages
Online reviews site Trustpilot has a huge global readership, and here its site reliability engineering manager shares details of how his team help to ensure its services are always on when its users need them
With a user base that is not shy about telling companies what it thinks of how they operate, the technology team at business review website Trustpilot are only too aware that recurring and prolonged periods of downtime are unlikely to be tolerated.
Not by the consumers that rely on the site to guide their purchases, nor by the businesses that use the feedback within its published reviews to hone and improve the services they provide to the general public.
“Our end product is digital – it is a software-as-a-service (SaaS) product, so if we are down, our customers cannot use the site, so we have to be up at all times – that is crucial,” Morten Reinholdt Boelskifte, site reliability engineering (SRE) manager at Trustpilot, tells Computer Weekly.
According to the firm’s own user statistics, the site is home to 45 million user-generated reviews of 230,000 companies, which are read more than three billion times a month by consumers around the world.
“We are aiming for four nines of uptime and availability,” Boelskifte adds. “Do we hit it every single month? Most of the time, yes.”
Preparing for failure
As is often the case when mitigating downtime risks, preparation is key, and an example of that is how Trustpilot’s technology teams ready themselves for two of their biggest user traffic-generating events of the year – Black Friday and Cyber Monday.
“What we build for is current [demand] times ten,” says Boelskifte. “That’s the kind of ballpark [our infrastructure] has to be able to support.
“Every time it comes around, we do preparations for that within the team, and with all the feature teams by ensuring the individual site features they are responsible for will be ready to cope with the demand expected when Black Friday hits.
“Sometimes it can be in terms of double-checking that their architectures are set up correctly, and it might involve load balancing and testing services to pinpoint potential weak spots.”
Trustpilot operates a cloud-based, microservices-based architecture that comprises 600 individual services that, in turn, control 300 or so functions across the site.
“We are multi-cloud, because we are in both Amazon Web Services [AWS] and the Google Cloud Platform [GCP],” says Boelskifte. “We have multiple regions and environments.
“We are predominantly in AWS, but most of our big data is stored in GCP and also our machine learning models, because they use our big data to run queries on. Most of the consumer-facing part of Trustpilot is in AWS.”
Each site feature is taken care of by a distinct team of software engineers, who all operate on a “full stack ownership” basis, and are collectively responsible for deploying production code changes about 200 times a week.
“These can be bug fixes but also feature requests, and we can rapidly run through the changes, so that if we do a deployment that doesn’t succeed or work the way we thought, we can quickly implement a fix,” he says.
Introducing infrastructure as code
With such a high throughput of changes undertaken within an infrastructure containing so many moving parts, Boelskifte says there are tried and tested procedures in place to minimise the risk of disruption resulting from rogue code making it into production.
“We also have an unwritten rule in place that we are not deploying on a Friday, because it could be more of a hit and run, because if something goes wrong, it could affect the site over the weekend too,” he says.
The SRE team has also embraced automation and the principle of Infrastructure as code (IaC) to help simplify and streamline certain processes – so much so that some of its feature teams are also following suit.
This means it has shifted away from relying on manual processes to provision and manage the technology stacks underlying its services by moving towards a more software-defined setup.
“When we started out, there was a bit of a learning curve, and the output of the team declined slightly,” says Boelskifte. “But with infrastructure as code, you can say the benefit you get is the sum of investment you add to it. So the more you invest in it, the more you can take back from it.”
As an example of the benefits IaC has brought to its SRE teams, Boelskifte cites the work that recently went into overhauling the firm’s escalation policies, which dictate how assigning responsibility for resolving downtime incidents should be managed.
Read more about DevOps and on-call
- Enterprise DevOps adoption rests on securing buy-in at all levels of the company, but developer support is not a given, particularly when it means taking on responsibilities outside their traditional roles.
- Cultivating a supportive and collaborative business culture is considered central to getting DevOps to take hold in an organisation. We take a look at what this entails.
“We managed to do the entire change within seven lines of code and it took us about five minutes to write, whereas it would have taken about half a day or a full day to do in the past and troubleshoot because we are only human, and mistakes can be made,” he says.
The company is also an avid long-term user of PagerDuty’s SaaS-based incident monitoring platform, which notifies its feature teams as soon as any of the services they work on start to exhibit behaviour that suggests they are not, perhaps, acting as they should.
Then the PagerDuty system will send out incident alerts, categorised according to how mission-critical they are, for its feature teams to respond to.
“We call them P1s, P2s and P3s, with the first of these being the absolute most critical to respond to, which is usually reserved for an incident where something is really blowing up,” says Boelskifte.
When an alert hits, PagerDuty passes it along to the relevant feature team to deal with. If there is no response within a set period of time, it is resent but also escalated up to a member of the SRE team to deal with.
“The next step, if none of the feature team responds and the SRE does not either, is that the alert escalates to the vice-president,” he says. “And this is all set up within PagerDuty.”
The setup also allows people to tailor the alerts so they receive them via email, text message or phone call, or a mix of the three, according to their own preference.
“There is nothing more stressful than receiving an alert, so receiving it in a way that suits the user is quite powerful,” says Boelskifte.
Following in the footsteps
For organisations looking to follow in Trustpilot’s footsteps and with their similarly high uptime expectations to meet, Boelskifte says a move towards IaC should be a top priority.
“For us, it was something my team were really keen to do and we made the decision together to go down this route,” he says. “It was not easy to roll out, because there is a learning curve as you get to grips with the language and how it affects the infrastructure.
“It depends on how you as an organisation choose to embrace it. You can try to do everything yourself, relying on a programming language like Python, or you can opt into an already established setup like Terraform’s infrastructure as code software.
“There will still be a learning curve in figuring out how the system reacts when you do x, y and z. Here we treat IaC as regular software, so you have to test that what you say is happening, actually does.”