Credit: Cesar Carlevarino Aragon (via Unsplash)

The DevOps trends and tools putting a renewed focus on maintenance and care in tech

This post is sponsored by 10KMedia.

After 12 months of uncertainty and crisis, resilience and stability have become more urgent and in-demand than ever. For a software industry that continues to be the focus of scepticism and criticism, this is particularly pointed. On the one hand the move fast and break things ethos has had a hand in dismantling the surety of the past, but equally it has been instrumental in ensuring continuity and connectivity at a time when our professional and personal lives were turned upside down.

This isn’t to say that we’ve suddenly entered a new era. The increased prominence of roles like SRE (Site Reliability Engineer) in the last few years and the demand to shift security left and bring security into DevOps practice and thinking demonstrate that things have been changing for a while. However, if things were moving in that direction, 2020 has emphasised that the industry needs to go further.

There was little good to come out of 2020, but if we can salvage anything at all, it’s that we should now recognise the importance of continuity and security over unfettered productivity and thoughtless innovation.

It won’t be easy for the tech industry to embrace this, however valuable this change in attitude might be. The power of the biggest corporations in the industry and the funding model that sustains the wider tech ecosystem are still, of course, set up for growth. The innovation habit will die hard.

However, if we look closely at the part of the industry where reliability and resilience are being encouraged – indeed, turned into products and tools – it’s possible to see that there are a few components which are integral in making this evolution possible.

These things are worth looking at – even if you’re not a DevOps or infrastructure engineer, indeed, even if you’re not an engineer at all – because they serve to illustrate how we can reassert the importance of maintenance and care in the work we do. In both our digital and real worlds, that’s what’s going to be crucial in 2021.

Finding out what’s happening inside complex systems

Complexity introduces unpredictability. If we don’t understand how or why something is happening, it’s going to surprise us. And, to make things worse, it’s going to be possible to fix. Thus there’s a direct line from complexity to failure and crisis. It makes maintenance almost impossible.

One of the challenges of complexity, moreover, is that not only can it sometimes feel impenetrable, it’s also surprisingly opinionated – it forces you to look at some things while making other things invisible. (This is, many of us will probably know, been one of the dilemmas of armchair epidemiologists over the past 12 months.)

Nowhere is this better demonstrated in software. The promise of transparency is at the heart of concepts like monitoring and observability, with many SaaS products liberally using those terms. However, achieving that in a way that’s meaningful and impactful isn’t straightforward: many products limit what you’re able to actually see by the very nature of the structures that enable data collection and storage.

This is something that Honeycomb, one the the key organizations in promoting observability, is eager to assert the importance of “high cardinality” data. As Co-Founder and CTO Charity Majors wrote back in 2017:

“All of the most useful fields are usually high-cardinality fields, because they do the best job of uniquely identifying your requests. Consider: uuid, app name, group name, shopping cart id, unique request id, build id… and yet you can’t group by them in typical time series databases or metrics stores. Grouping is typically done using tags, which have a hard upper limit on them.”

OpenTelemetry and Lightstep

OpenTelemetry has a similar aim. The open source project provides tools to help engineers break apart their data from the systems in which it is stored.

“Gathering more data can help make better decisions… but just gathering ‘all of the data’ doesn’t work.”

Observability startup Lightstep contributes to the OpenTelemetry project and builds on its core mission. Co-Founder and Chief Architect Spoons emphasises the importance of working on this problem, explaining that although “gathering more data can help make better decisions… just gathering ‘all of the data’ doesn’t work”

There are two reasons for this. “First of all,” Spoons continues, “it’s way too expensive from a cost standpoint. And second of all, the data is too large for human beings to analyze without serious assistance from an analytical engine.”

In other words, it’s too much complexity to handle.

This isn’t just a theoretical problem. There are clear practical applications to this approach. In one scenario Spoons provides “developers would use Lightstep to compare metrics for their service… before and after [a] release. If a metric like latency goes up, they can quickly roll the release back. If this is done as part of a canary or blue-green roll-out, that roll-back can be done before the majority of users are affected.”

Without wishing to perpetuate the endless hype cycle, the next phase or stage of observability could well be understandability. This is a nascent but intriguing attempt to go one step further, placing an emphasis on developers understanding the behaviour of their code line by line. Rookout, a platform that aims to help engineers debug applications more effectively to ensure resilience and reliability is particularly eager to get the industry thinking about understandability.

“Borrowed from the financial industry, understandability requires that a system’s information should be given or presented in such a manner that it can be easily comprehended by whoever is receiving that information. Translating this to the world of software development, understandability means that a developer creating an application should be able to easily find or receive data from that application in order to understand its behavior.”

Democratizing knowledge and empowering engineers

Understandability, then, extends observability to place an even greater emphasis on enabling and empowering developers.

Rookout makes this clear in its list of key principles for understandability. It must be, the company explains, organized (which means “the developer working on the system should be able to easily locate cross-referenced information within the system”), complete, clear (“Developers should make every effort possible to write simple code that is easy to understand”), and concise (“The developer reading the system source code shouldn’t feel as if they are buried under an extreme amount of detail”).

Lightstep similarly views itself as a company aiming to help developers take greater ownership over their code. Spoons says “for an engineer that gets paged for (say) a high error rate, correlating those errors with other aspects of the application – for example, a misconfigured database or an overloaded host – can help them mitigate the problem quickly.”

It’s easy to miss the significance of this, but I would suggest that the benefits are profound: it opens out knowledge, giving all developers access to aspects of a system so they can each decide how to act in a more informed manner.

This is perhaps one of the most important aspects of observability. Specialisms and internal hierarchies should, in theory at least, become less important. Their negative effects are minimized.

That’s good for engineers and it ensures that the work of maintaining software and ensuring its reliability can be done more easily. It isn’t, after all, data that helps you make better decisions: it’s context.

Incident response

The need to democratize knowledge and open up transparency within and across teams isn’t, of course, a political gesture. It’s really an economic one, necessitated by the emergence of the on-call engineer who must be available to respond to incidents at any time.

The success of the incident response platform and tool market – to some extent a relative of the monitoring and observability market – further underlines the importance of instant responses. As copy on incident response platform PagerDuty’s website puts it:

“Now more than ever, your digital services must run perfectly. Our real-time operations platform ensures less downtime and fewer outages, meaning happier customers and more productive teams.”

It’s worth noting that although the emphasis is on users and keeping things going, PagerDuty still acknowledges the importance of the humans who are actually doing the work. Selling to a different audience they might have even tweaked the final part: “more productive customers and happier teams.”

This underlines something important about this part of the software industry: while the obsession with automation and efficiency continues apace, only humans can properly and effectively respond to an incident. Yes, they require context and full transparency on a given problem, but they also bring their own contextual understanding, and their ability to judge a situation and how to act. This is, perhaps, the care that’s required to properly maintain complex distributed software.

Automation that works for humans

But if maintenance requires human actors to judge and assess a situation, this doesn’t mean that automation can’t play a part. In fact, the DevOps tool market demonstrates the way in which automation can be used to augment human work.

This is something demonstrated by incident response product Kintaba. It hasn’t been developed with a view to simply removing human decision making, but instead it has been built with a view to automating the things that can be automated. John Egan, Co-Founder and CEO explains that “Kintaba’s automation engine is focused on helping human responders be more effective by triggering repeatable workflows at opportune moments during the response process.” In short, it’s built to help teams focus on what really matters.

But as well as automating what it calls ‘response processes’ it also aims, like Lightstep, to help open up knowledge. “Automations can also do things like call an external webhook , update the owner and subscribers, and can even post comments in the activity log that provide custom instruction or advice to the response team based on how the incident evolves.” Egan says.

Given the debates about artificial intelligence over the last couple of years, it would seem that this corner of the industry has an acute sense of how automation technologies can be used effectively – not to replace humans, but instead to allow them to do more of what they’re good at.

Gaining control: predictability and security

Transparency and democratization are both in the service of retaining control over complex systems. They’re about returning some degree of agency to the engineers and architects that are responsible for developing and maintaining them.

The evolution of chaos engineering is good evidence for this. What began as a necessary hack at netflix to try and get to grips with a sudden and dramatic shift in complexity (thanks to a shift to AWS), has become something more considered and controlled.

This is exemplified by chaos engineering platform Gremlin. Gremlin explores this evolution in its recent State of Chaos Engineering report. The company has consistently argued that the discipline isn’t about wanton destruction, but instead creating hypotheses and measuring impact as if you were running a science experiment. This is reflected in the functionality of the platform: users can specifically manage what the company calls “the blast radius” of experiments in order to run them in a safe and controlled manner.

According to Gremlin, injecting some chaos into your systems is an act informed by curiosity. In turn it is an integral step in maintaining and securing complex systems.

Gremlin isn’t alone in this attitude. You can see an emphasis on control in the way that fellow chaos engineering company ChaosIQ presents its own offering: “Prove and Improve the Reliability of your systems from one Simple, Affordable, Integrated and Customizable Toolkit.” Implicit here is that a balance is being struck between the complexity of the systems and problems we’re dealing with and the platform that makes tackling and understanding it possible.

Platform usability isn’t just a cosmetic thing. Its simplicity is what ensures accessibility ; it’s what makes knowledge-sharing possible. Is transparency really transparency if it only appears to one person? Do we really have a handle on predictability if it’s just one person making a judgement call?

Securing complexity

This control is particularly important in the context of security. If DevOps had in the past focussed on speed and agility, today the emergence of DevSecOps is a reminder that we can’t just approach software engineering with a view to deliver early and deliver often: we need to also ensure that what we’re delivering is reliable and secure.

Given all that we’ve already said about the complexity of much modern software this isn’t easy. But with high profile attacks such as the SolarWinds hack – which specifically targeted the build process – it’s today clear just how important it is for engineering teams to shift security left.

“Shifting any kind of security left means bringing it earlier in development workflows and bringing it closer to developers’ work.” says Idan Tendler, CEO of cloud security startup Bridgecrew. “In this way, you can prevent security errors from being deployed rather than react to them after the fact which, ultimately, has a positive impact on cloud security posture.”

Bridgecrew is another example of a company using automation intelligently. This maybe isn’t that surprising in the context of cloud security: as Tendler notes “the number 1 cause of data breaches is misconfigurations.” The ability to remove the opportunity for human error and allow humans to focus on other types of work – work that requires more judgement and analysis – is extremely valuable.

To a certain extent, the approach made possible by tools like Bridgecrew are a natural companion to chaos engineering. Pushing security upstream will necessarily give you more visibility over your systems, making it easier to predict how different things will react in different situations; chaos engineering allows you to double down on that, venturing further into the realm of possibilities that could be out there. Sure, it might be impossible to capture every eventuality, but you can at least be proactive in doing all you can to ensure both security and reliability.

Finding new stories in the things we build and how we build them

The stories we typically tell about Silicon Valley and the software industry are usually ones of restless growth and productivity. While the state of the world might have tempered our collective optimism, those stories won’t end. Indeed, the trends and companies mentioned above are all contributing to that – that’s one of the main reasons they’re valuable and successful.

But it’s a shame, I think, to ignore this other facet of the industry. This is one in which we adopt a more deliberate approach to ensuring resilience and reliability. It’s one in which the work of maintaining systems is treated seriously and as a collective effort – not rooted in the genius of rockstar or 10x engineers, but instead in the collaborative decision-making of teams that all have a stake in the things they’re building.