Credit: Crawford Jolly (via Unsplash)

The future of observability: smarter troubleshooting with change intelligence

Sponsored by 10KMedia.

Trends like observability have underlined that one of the key challenges facing engineering teams in the last few years is simply understanding what’s going on inside of your infrastructure and applications. The nature of distributed systems is such that it’s difficult to gain a single coherent perspective. That’s not just an academic issue – the fact that it’s so difficult to uncover a single point of truth has important implications at an organisational level, where different people possess different levels of expertise and access to necessary information and data.

But what makes things even harder is that the reason complex systems are so complex is because they are constantly changing. The rise of cloud and microservices means modern applications can have countless dependencies that make it impossible for any single engineer to understand. This means, in short, that while observing systems is vital, it’s only one piece of the puzzle when it comes to modern software engineering.

Troubleshooting

In fact, a much larger part of the puzzle – and what really matters to engineers – is plain old troubleshooting. Troubleshooting isn’t a term that sounds particularly cutting-edge. Indeed, it’s now a quotidian term that has managed to embed itself in our everyday technical lexicon. But when placed in the context of complex distributed infrastructure, it requires a level of knowledge and expertise that’s practically superhuman.

“Given an alert” Ben Ofiri tells me, “the poor on-call or DevOps team is forced to use different tools and to be an expert in all of those different components and different layers just to be efficient and just to be able solve an issue….”

Ofiri is the CEO and Co-Founder of Komodor, a platform for troubleshooting Kubernetes. He’s got a unique view on troubleshooting and is eager to help engineers become more efficient at it.

The 3 pillars of troubleshooting

Ofiri and Komodor have their own model of troubleshooting. He tells me we should see it in terms of three pillars: understanding, management, and prevention. Understanding is about identifying the problem; managing is about how you respond and mitigate the issue in real time; and prevention is about the process of learning from incidents to ensure the issue doesn’t repeat.

It’s a nice model. But the problem is that time and resources aren’t distributed equally amongst the three pillars.

“More than 80% of time and resources goes into simply trying to understand the problem,” he estimates. While that might not be that surprising given the complexity of modern distributed software, it does mean that other aspects of troubleshooting, like management, aren’t given the seriousness they deserve.

Ofiri notes that many organisations see automation as the solution to the management part of troubleshooting. However, he’s not convinced it will ever be completely successful. “I’m a developer… we know the real pain points… The biggest challenge is finding the right data and context for any given issue.”

This, ultimately, is what it comes down to: context. You can’t automate away problems that are incredibly context specific – however much we’d like to.

Read next: The DevOps trends and tools putting a renewed focus on maintenance and care in tech

From metrics to context to action

What we need instead is a way of better understanding the context of all the data that modern monitoring and observability tools make available so engineers are able to act. It’s not enough to simply provide data – it needs to be backed up by additional information that makes it actionable.

“The thing that frustrates me about the space is that you still see a lot of the conversation around observability is focused on quote unquote metrics logs and traces,” says Ben Siegelman, CEO and Co-Founder of Lightstep. “Those are not products. None of them are a product. Those are just types of data – period. And then you have to find some way to take that data and solve problems with it.”

Lightstep is, in its own words “the DevOps observability platform.” Like Komodor, it’s part of a small but important growing trend of companies that are working in what is sometimes called day 2 ops or change intelligence. These terms might sometimes seem more like sales and marketing tools than useful from an engineering perspective, but they do nevertheless signal that something has changed, that we’re no longer just doing normal monitoring or observability.

This is something Ofiri is keen to emphasise. “About five years ago, [if] you had an alert – and you’re an on-call developer – you probably won’t ask yourself… what changed? … you know you’re doing changes once a quarter, you know that the test coverage is almost 100%, you have a QA department, you have dev, you have staging, you have quarterly releases – so if a customer is complaining, it’s probably not related to a recent change. Today when you have an alert, in 85% of the cases, the root cause is an internal change.”

Siegelman makes a similar point. “If there’s a serious incident happening, you can find tons of evidence that something bad is happening – that isn’t difficult. In fact, it’s often everywhere… The trouble is that it’s not organized and ranked in a way to help you figure out which of these many things that are broken are the ones that you want to focus on.”

“It’s not that difficult to monitor the system right now. It’s just really hard to understand the changes that your monitoring will reveal. And that’s what change intelligence is about.”

The limits of monitoring and observability

Change intelligence, then, could be seen as a way of bridging the gap between observability (as the word suggests, observing things happening in your system), and troubleshooting – acting to correct and solve those issues you’ve identified.

To a certain extent, this is the way the market is going – it’s not enough just to know what’s happening inside your systems, you need to be able to act as well. Datadog, for example, possibly one of the most successful monitoring platforms in the cloud ecosystem, has a number of features aimed at ‘automated incident management’ and ‘actionable alerting.’

“Every alert is specific, actionable, and contextual—even in large-scale and highly ephemeral environments—which helps minimize downtime and prevents alert fatigue” reads the copy on the Datadog website. “And with native SLO and SLA tracking, you can prioritize and address the issues that matter most to your business.”

That’s compelling, but Siegelman is wary of solutions that promise a comprehensive solution to observability. “Many vendors today have a marketing message of unifying observability, but when you get down to it, they have a tab and their products for metrics, a tab in their product for logs that have through tracing or a PM or whatever they call it – which is completely backwards.”

Ofiri agrees. “The fact that every company has a monitoring solution, every company has a logging solution – obviously with all due respect to those tools – it’s kind of a commodity to have great monitoring or great logging.”

Now what?

While he concedes that “without monitoring you wouldn’t know you have high CPU – you wouldn’t know you have pods that are doing restarts,” he makes the point that there’s another question that will always follow: “now what? And what we know is that those tools are providing almost no value in the now what part. So now you, as a developer, need to put your detective hat on and start this quest of fetching these different pieces of information.”

Developer experience

I like Ofiri’s detective metaphor. It underlines a peculiar quality of the software engineer’s role that often gets overlooked – the fact that you’re often trying to uncover things and work out exactly what’s happened. This seems strange when it’s so easy to connect cutting-edge technology to pure, ever-increasing efficiency, but it highlights that there’s still a significant degree of labour in unpicking and unpacking how and why something happened inside a complex system.

This, he points out, is where Komodor can help. In the platform, “once you have an issue you already have a digested timeline of what happened” which helps to explain how it all “actually [affected the] specific issue you’re trying to troubleshoot… you don’t need to do all of this tedious and manual job of collecting all of the events.”

Red herrings

Siegelman used a similar sleuth-related metaphor when talking about change intelligence: red herrings.

He says that if you’re a developer, you need to be confident that you know why a given troubleshooting path is “the path worth investigating, because when you’re on-call and dealing with an incident, you don’t want to be chasing a red herring. “I think,” he continues, “that’s what we’re hoping to avoid with that sort of context. It’s like statistical evidence… that will help an operator understand that they’re not about to waste their time.”

Clearly, we’re in the realm of developer experience here. For both Komodor and Lightstep, they are there to improve the ways that developers work and the way they solve problems. And you could say that in recent years developer experience – particularly in this field – has actually got worse, more manual, more arduous, more complex.

Yes, that’s an interesting corrective footnote to the myth of technological progression, but it also signals that there’s a tension between the demands of business – for speed, agility, constant delivery – and the reality of keeping those systems up and running.

“Companies are trying to adopt CI/CD; companies are trying to adopt microservices architecture to allow this very fast movement, this very high velocity – and what we think it’s great, it’s the right thing to do,” Ofiri says. “But it also comes with significant challenges and downsides that you can’t just ignore. You can’t just say we’ll move 1000 times faster and we’ll use the same tools and processes that we used five years ago; it just doesn’t work. We know that it doesn’t work because we are the developers getting all of the alerts.”

This is a crucial point. It underlines one of the most valuable things about change intelligence: the fact that it restores a level of balance between the relentless speed of change and innovation and engineers’ ability to manage and monitor it to ensure it’s always up and running.

Change intelligence aligns developers with business goals

While there’s this apparent tension between innovation and resilience, ultimately, of course, resilience has real business value. Downtime costs money. Indeed, forcing on-call developers to follow an intricate hunt for clues and evidence just to fix a problem – just to get a site back up – costs money. “This time is super sensitive and critical” Ofiri says. “This time when you have an alert – it means that a customer isn’t satisfied or that you’re losing money as a business.”

It makes sense, then, to arm them with change intelligence capabilities rather than just expecting them to unlock a magic solution after glancing at some flashing dashboards.

From machines to features

Both Lightstep and Komodor – at least in how their CEOs talk about them to me – aren’t simply about solving things more quickly, they’re also about further enabling software developers to embrace a more product-oriented mindset – one that recognises business needs, and, indeed, end users.

Many existing monitoring tools, Ofiri says, force you to think in terms of machines – real or virtual. This isn’t, however, the way the world is today. “If you’re a developer you dont think in terms of machines. You don’t think, oh, this machine’s broken; this machine is suffering from high CPU – you think about features, you think about customers, you think about releasing new products.”

The contextual information the Komodor provides means engineers can continue to think about features – what the experience is like, what alerts actually mean from a user’s perspective. “We try to gather the information from all of the different tools – whether it’s Kubernetes, CI/CD, the code repository, or observability tools – and then we try to digest them and map them into the language or into the same processes that developers really think of when they’re approaching an issue.”

To a certain extent, by providing context, the platform reduces the amount of context-switching developers need to do (ie. from system architecture to product). The importance of this can’t be overstated – it immediately allows developers to focus and adopt a more singular mindset. However much the industry has talked about the need for commercial awareness and product nous in the past, the reality is that for developers working on hard engineering problems, it’s just not realistic – it’s too much to ask of someone. However, with change intelligence it might just become a little easier.

Complete system context

Siegelman talks about something similar when I ask him about a phrase I notice on the Lightstep website: complete system context. He makes the point that there’s an additional dimension to troubleshooting beyond metrics logs and traces.

He explains “you have lots of changes that aren’t just in the logs, metrics, and traces that are actually part of business processes that have been intentionally built out for compliance reasons… and actually having a very tight integration to that side of the house is an extremely important aspect of understanding changes in the context of the actual human beings that are operating systems.”

This mirrors what Ofiri was saying. As software developers working with complex systems it doesn’t make sense to only think in terms of machines and at low level parts of a system. Our decision making needs to also be connected to all these other things like product features and business processes.

So, what is change intelligence?

Change intelligence is just a concept. Whether it takes off or not remains to be seen. But the fact that it highlights an aspect of software engineering that has been generally underserved in the last few years means it’s useful in spotlighting something that matters to both developers and the organisations they work for.

“Like all new terms, sometimes it sticks, sometimes it doesn’t,” says Ofiri. “And our users are smarter than we are… they don’t fall [for] any gimmick… they care about real value. They care about if it really solves problems they are having on a daily basis. They care about if an issue took them one hour and with Komodor it can take three minutes.”

While it’s always tempting to present a new idea as somewhat antagonistic, it’s important to note that change intelligence is more an extension of observability, monitoring – it adds another dimension to incident management. Siegelman notes that Lightstep “has always tried to be friendly to anything that’s in the space,” and notes that Lightstep is designed to work alongside other tools that clients might be using to monitor and measure system reliability and performance.

However you see it fitting into the evolving cloud ecosystem, as change becomes faster and more constant, and as unpredictability increases, it’s going to be vital.

Change intelligence isn’t “something that’s specific to Lightstep” Siegelman remarks. But it is, he suggests, “something that anyone who’s thinking about observability should be prioritizing right now, because it’s actually the area where the tools are weakest.” That might not be the case forever – but it’s a reminder that the most interesting and urgent problems are the ones that are the hardest to commoditise.