Crowdstrike has a lot to teach about routine system maintenance

Cybersecurity

Crowdstrike has a lot to teach about routine system maintenance

The recent Crowdstrike outage has shown everything that can go wrong when doing a simple update.

Eric White@FEDERALNEWSCAST

July 26, 2024 2:16 pm

7 min read

The recent Crowdstrike outage has shown everything that can go wrong when doing a simple update a system that has a role to play in many systems and critical infrastructures. Routine maintenance is obviously a requirement for these mega platforms, but how they can be done without causing major disruptions? To find out more on the Federal Drive with Tom Temin, Federal News Network executive producer Eric White spoke to Michael Edenzon, co-founder and CEO for the software development firm Fianu.

Interview transcript:

Eric White
So why don’t we just start with your after action report, when you saw everything that occurred last week. What were your main takeaways? And then we can kind of get into the work that you all do.

Michael Edenzon
Absolutely. Well, I think it’s important to note that, obviously, this is something that we saw as front page news. It affected everybody. I even had to go take out money from the bank, and I couldn’t, because my bank was down. But I don’t want people to think that this is something that only happens to companies like CrowdStrike, this can happen to anybody, this can happen to any business. And your developers know that. You’re building tens of thousands of builds, each year, you’re doing millions of software artifacts, and tens of thousands of deployments each year. It only takes one mistake to break a business. In this case, it was a bug, and sometimes it could be a vulnerability. But the important thing to note is that this can happen. And especially as you dive down into exactly what happened here, it sounds like it was a computer science 101 nil pointer. And sometimes you wonder, well, how did that ever get allowed into production. And I think, when you look at the scale at which a firm like CrowdStrike operates, it’s comparable to most large enterprises in this country. And so the failures that happened, as part of this outage, are things that can happen to just about any business.

Get tips and tactics to make informed IT and professional services buys across government in our Small Business Guide.

Eric White
And so I imagine that human error played a large part in this, but just how big of a factor was that? Was this really just maybe the case of entering in the wrong code?

Michael Edenzon
It certainly was the case of entering in the wrong code. But that’s what developers are going to do. Even with AI generated code, we’re still going to have things that happen just like this. The important part is that there was a dependence on manual processes that weaved in automated tools. And the handoff between humans and automation, there was confusion about a specific measurement on some of the functional tests. And this is not something that’s unique that, like I said, this happens to every business. And the fact of the matter is that humans are error prone. And so the appropriate automation needs to be in place to accurately attest to all of the requirements that are needed pre release.

Eric White
I think if you looked up this conversation that people were having in the late 90s, you’d see a lot of similarities between this and the Y2k bug, the transformation of going from human to automation. What can be done to take a more proactive stance to make sure that something like this doesn’t occur?

Michael Edenzon
Absolutely. And I think that’s only going to be exacerbated. Obviously, this is a problem that was earlier discovered by the FAA with planes and the handoff between autopilot and manual flying. There is a handoff problem that occurs. And so especially, now more than ever, as we develop more and more abstractions, and we’re accompanied by AI generated code, I think you’re gonna see issues like this crop up even more. This is something that’s always happened, but it’s going to be exacerbated by the interweaving of humans and AI, especially in writing code.

Eric White
Yeah. And when so the term automated governance comes up, and that’s your guy’s bread and butter over it, what is it about that that can help with this problem?

Michael Edenzon
So we say automated governance is not automated compliance, you’re responsible for being compliant. Automated governance, is the evidence capture and policy enforcement that allows you to be compliant. And I think that’s where this kind of comes into play is that obviously, there are policies in place to say that things must be functionally tested before their release. And automated governance is about the automated evidence capture, the automated enforcement of policy based on that evidence from a centralized policy management system. And then immutable ad test stations that can be used to make go or no go decisions at the time of deployment. And that’s the enforcement mechanism. And so one of the main principles of that is taking the individual policies that more often than not live in this specific tool chain platforms. So you’ll have one scanner that has policies here, one scanner that has policies here, another testing engine with policies here, and sometimes those can be misconfigured and it’s very hard to trace that. And so a core principle of automated governance is centralized policy management, so that when you capture evidence, you have an independent authority that’s evaluating based on declarative narrative criteria to produce that attestation that can then block or allow deployments based on adherence to policies and procedures.

Eric White
We’re speaing with Michael Edenzon, he is co founder and CEO of Fianu. And so you had said that this sort of thing is going to keep on happening, it’s sort of inevitable. Is that a an attestation to the idea that maybe some of these critical infrastructure systems should not be automated in making those decisions like you’ve just laid out?

Michael Edenzon
Well yeah. I think there’s definitely a specific way in which these interdependent systems need to be treated with higher criticality, but you’re not going to take automation out of this process, they’re still businesses, they still need to compete, they still need to add new features and release updates on a regular basis, they’re still software companies. And so the important thing that we try to say is that look, whether it’s business critical, or it’s just some internal software that you’re using for an internal platform, they can all be treated the same way. It’s not about adding manual effort to review these things, because like we said, humans are error prone. So automation has to be part of the equation. And so the real objective here is to produce traceable evidence from each of the activities that are performed between code commit, and release and validation, so that not only can you do a very quick retrospective on what went wrong, but you can actually proactively prevent events like this from happening, because you have that independent authority that’s evaluating the evidence in an auditable manner.

Read more: Cybersecurity

Eric White
Sort of making the AI show its work, so to speak. Exactly. And so when you talk about adding in new features, such as what you just mentioned, and making sure that we have the necessary quality tests, and as you said, these are businesses, you can’t just test yourself into oblivion and make sure that everything is airtight. Just how well will traceability make sure that everything is bulletproof?

Michael Edenzon
Well, I think it’s the only way. Because, look, as I said before, you’re doing tens of thousands of builds a year. Some of these large corporations have hundreds of thousands of code repositories, they’re producing millions of artifacts that are deployed thousands of times every year. So evidence and traceability are the only way to put some structure into a process that was largely ad hoc, up until a couple of years ago. And I think the DevOps transformation made huge strides in automating the execution of the activities that needed to be formed. But automated governance is that next evolution to say, great, you have automated execution of your pre release activities, but we need automated evidence capture and policy enforcement so that we can take a largely subjective process, which was go or no go decision making, and turn it into something that’s completely objective.