Designing software for efficient IT operations

Rapid development and deployment of applications is not enough to achieve digitisation – IT operations needs to be re-engineered too

Is digital transformation purely a question of speed? It would be easy to think so, given that transformation is commonly associated with a shift to more rapid deployment of software, features and services.

The sooner you get these into production, the theory goes, the sooner you can start delivering value. And it follows that the sooner you can remove any blockers to deployment, the sooner developers and designers can start weaving their magic, accelerating new businesses, or reinvigorating older, lumbering organisations.

This drive towards continuous, or at least rapid deployment, is usually bracketed together with a move to the cloud, or at least a cloud-like architecture, with a large amount of automation. It can seem that the best thing operations people can do is simply get out of the way.

But while this might seem a well-trodden path, as HashiCorp CEO Dave McJannet pointed out at the supplier’s HashiConf event in September, few have reached the end of it. Cloud might be driving modernisation, he says, but while some people are doing this really well, most are not, and some are doing it very poorly.

“What we’ve learned over the course of seven or eight years is how people do it successfully is incredibly consistent,” he told journalists and analysts. “I don’t particularly love the triptych of people, process and tools, but it turns out, that’s what has to happen.” Changing one or two elements is not enough. All three must be considered.

Needless to say, McJannet advocates standardisation around infrastructure as code as part of the recipe for success, ideally in the shape of its Terraform tooling. But, he said: “The second thing that is required for it all to work … is you need a particular organisational structure.”

The failure to grasp this, according to McJannet, explains why cloud adoption in the traditional commercial world has been as slow as it has been: “All the digital natives of the world ... use lots of cloud. But what percentage of [companies like] Wells Fargo are in the cloud? What percentage of Goldman Sachs are in the cloud? Almost nothing.”

Being able to “operationalise” going to the cloud means a switch from the – usually open source-powered – free-for-all typically seen when organisations being their journey, to a cloud product or platform engineering-type approach. “Any business group that’s successfully doing cloud delivery is organised this way,” he said.

That opens the way for the third element, process, which ultimately determines what gets deployed and how quickly. As McJannet says, deployments can stretch out because those driving transformation find they still face the same constraints, such as what security and networking teams require before they sign off on a change. Those policies might make perfect sense, but having them enforced through semi-manual, ticket-based systems doesn’t make for rapid deployments.

If there are 29 policies that have to be met before infrastructure can be provisioned, the answer is to turn them into codified rules that are run every single time something is provisioned. “Maybe the architectural sign-off I can’t do as a codified rule, but maybe 25 of those 29 rules I can actually codify,” said McJannet.

Having the right operational processes – whether traditional or highly automated as code – is still critical to achieving a successful transformation that results in quicker delivery.

Is rapid deployment the only aim of transformation?

Cyber security advocate and author Glen Wilson says it’s important to be clear about the objective. Is deploying more and more quickly really an end in itself? Taking security as a “sub-component” of both quality and performance, for example, he says: “When you see this plethora of tooling, with very little oversight, then you end up with a drop in quality, a drop in performance, in terms of the security.” He suggests that “diffusion of innovation is more important than speed of innovation”.

It’s important to create an environment for experimentation – securely and efficiently, of course – and give the teams the right degree of autonomy to enable this. This requires the organisation as a whole to have a cohesive view of its goals, and about the tools it is going to use to achieve them. “So, teams are able to choose their own products, tools, whatever technologies, as long as they stay within the framework that’s given by the organisation.”

This sounds something like the platform engineering approach advocated by McJannet. Wilson adds that a security specialist might need to sit in on multiple teams, being a team player on each. Which of course sounds eerily like the T-shaped, or even comb-shaped, individuals beloved of ITIL.

Whatever the (perceived) operational blockers to innovation and transformation, when it comes to change, it’s essential to understand the root of the problem before suggesting a meaningful solution that improves rather than simply replaces a process. 

Rob Reid, technology evangelist at Cockroach Labs, says: “If the process is understood by one person in the organisation, a technical solution will only exacerbate the single point of failure unless it scales to more people. Process familiarity is always a blocker to adoption.”
 
It’s also worth remembering that individuals might – albeit subconsciously – resist efforts to abstract and encode or improve the tasks they carry out. And this will have an impact on how likely they are to adopt any proposed solution.

“If a process is so well understood as to be second nature to those performing it, it’s our job as technologists to listen sympathetically to those performing it. That way, we’ll understand not only the problem, but how to solve it in a way that will improve the day-to-day of those using the solution,” he argues.

“We [also] have to recognise that our work (and existence) may be threatening to other employees in our organisation. We should therefore work collaboratively to improve the lives of our colleagues and help them to deliver value.”

But there’s a danger in drilling too far down and getting obsessed with individual processes. Each process is part of a bigger whole, and tinkering with one or two in isolation might have unexpected consequences for the wider organisation.

What about transforming ops?

Jon Collins, vice-president of engagement at GigaOm, and a former chief technology officer (CTO), says platform engineering certainly represents a maturing of DevOps, which was focused on speeding up the delivery of software and services, and indeed, making it continuous, without necessarily thinking about the underlying infrastructure. Platform engineering recognises that developers need to understand the underlying infrastructure and to architect for it in advance.

But, he continues: “The problem with digital transformation is ultimately that it directly affects operational models. It’s not as straightforward as saying, ‘Hey, we just need to change’.”

Cloud-native approaches may be sufficient for single applications, Collins argues, and even multiple applications done in a similar way. “They’re not sufficient for a massive complex infrastructure, with stuff that’s been around since the 1960s, stuff that’s been around since the 1990s, stuff that’s been around since last week but was built wrong.”

Policy as code is useful, he says, but isn’t enough on its own to address the problems operations staff face around lack of resources and time, skills deficits, and all the other challenges these teams face. Operations teams need advances in visualisation and observability, insight and automation, he says, if they are to make real progress.

“If your operational processes are inefficient or faulty, you can automate inefficiency and end up with inefficient automation. You’re just exacerbating the problem or putting a sticking plaster on it”
Jon Collins, GigaOm

“If your operational processes are inefficient or faulty, you can automate inefficiency and end up with inefficient automation. You’re just exacerbating the problem or putting a sticking plaster on it,” says Collins.

The trick is having the correct operational processes and policies for the entire organisation in the first place. Assuming they are correct, they can be encoded – but it also helps to have developers and designers taking them into account.

Having developers “becoming their own janitors” could impose an unnecessary overhead on innovation, says Collins, but “I would expect them to design for operations, to actually have an understanding, a familiarity with what operations is going to go through, and that’s just good training”.

A bit of self-examination on ops’ part might not go amiss, either. Operations people tend to “over-processify things”, he says. “Operational excellence isn’t about trying to achieve nirvana. It’s about having operations centricity in what you do.”

That means the operations function must be clear what it is trying to achieve, and how it can contribute to and enable transformation – whether that comes in the form of faster deployments into production, better quality, or more secure software.

Does that mean there will ultimately be no need for “ops”?  Of course not. Afterall, as Collins says, fires need to be fought. Things go wrong. Change is constant. But if digital products are “designed for operations”, that’s going to free up time for all those other, essential, tasks. And probably ensure they are deployed more rapidly, too.

Read more about ITOps for digitisation

Read more on IT operations management and IT support