Effective EDR: Balancing testing rigor and velocity

Commentary

Effective EDR: Balancing testing rigor and velocity

Various industries are still dealing with the consequences of the July 2024 technology outage which led to a "blue screen of death" and extreme disruptions.

Peter O’Donoghue

October 9, 2024 4:04 pm

5 min read

Various industries are still dealing with the consequences of the July 2024 technology outage which led to a “blue screen of death (BSOD)” and extreme disruptions. A defective update from CrowdStrike caused this issue, knocking critical systems offline, bringing the airline industry, medical practices and financial institutions to a grinding halt. It also forced those impacted to lean on call tree procedures, ensuring that communications were maintained using out-of-band methods.

This outage even impacted the government. The Department of Homeland Security and the Cybersecurity and Infrastructure Security Agency worked with federal, state, local and critical infrastructure partners to assess and address all essential outages. The outage not only sparked conversations about patch management practices and system update testing approaches but also highlighted the potential cybersecurity ramifications of pervasive outages.

To prevent future outages, it’s crucial to prioritize strategies for driver testing and patch management systems planning. Emphasizing rigorous testing, particularly for kernel mode applications, is essential. Patch management processes must evolve, focusing on iterative rollouts with finer-grained controls within products to minimize the impact of kernel-mode application updates. By enhancing the testing, deployment and overall management of updates, these strategies will support the continuity of operations.

Despite the initial chaos, comprehensive response, communication and mitigation plans helped some organizations to address what appeared to be a massive, unknown cyber event of unparalleled magnitude. There are numerous opportunities for technical and organizational improvement that will help stop, mitigate or recover from similar events. If followed appropriately, these precautions can also prevent cyberattacks that target software.

Balancing proper testing with frequency of updates

Programs that support driver testing, like Microsoft’s Windows Hardware Quality Labs (WHQL) certification process, ensure the compatibility and reliability of products that operate with kernel-level privileges. Hardware drivers and some software applications operating in kernel mode require a higher degree of testing to reduce the likelihood of system disruptions.

Regular software that runs without administrative or elevated privileges is less likely to cause severe system disruptions — but an application or driver running at the kernel level can cause major problems if not tested.

In the CrowdStrike incident, the core kernel-level application is WHQL certified and stable, but the regular detection and response content updates to that product are not individually certified due to the frequency of release. These content updates are needed frequently, as the endpoint detection platform requires rapid change in response to emerging threats.

Given that these content updates are released too frequently for each update to be WHQL certified, a certain level of trust is placed on vendors to conduct thorough testing. This tension between rigorous testing and the velocity of deploying updates to address the rapidly changing threat landscape is core to the recent incident. Current vendor practices and testing tactics need to be reevaluated to mitigate future issues.

Software checks to avoid outages

Modifications to software testing and deployment procedures will be essential to reduce the impact and likelihood of another massive outage. Pre-deployment extended compatibility testing, performed by a software vendor and customers, would reduce the risk of incidents.

The only expense is a delay before security updates are in effect — a small price to pay for a safer approach.

Some endpoint defense systems allow customers to set rolling deployments, allowing for monitoring and quick rollback if issues arise. Utilizing pilot groups to deploy initial updates to a small, controlled group of users or systems, and then monitoring the performance and stability before wider deployment, is another strategy to prevent severe outages.

These approaches are bolstered when deployed during low-traffic periods, such as after work hours or over weekends, because they minimize disruption and allow for immediate intervention if issues occur.

Automation and enhanced patch management practices

Stronger patch management practices are another tactic to defend against outages. These include extended compatibility testing and checks, and the use of artificial intelligence and machine learning for monitoring and rollback.

Increasing the scope of testing across a wider range of hardware and software configurations will identify potential issues before release. This can be used within the pilot groups — a smaller number of users across various configurations. Also, testing patches against a replicated lab or simulated production environment will ensure that updates are compatible with supported hardware and software configurations. This includes testing on different versions of the operating system and with various third-party applications.

AI/ML can play a major role via heartbeat monitoring. Since a BSOD-causing error may not allow a system to operate enough to send an error alert, using heartbeats and related up/down monitoring techniques can identify inoperable hosts. Using indirect techniques, like AI/ML, to detect a problem state (BSOD loops) and link it to a recent patch deployment activity would rapidly identify the problem.

Furthermore, when a group of hosts is in this loop state, they are unable to send normal heartbeat messages, but an automated, AI-driven monitoring solution can both identify the anomaly and link it to patch deployment logs. If a customer’s environment is designed to stagger updates, and multiple hosts stop communicating with a heartbeat server, there is likely either a network issue blocking communications or a host-based problem commonality.

When combined with a staggered or phased rollout approach, early AI-driven detection could reduce the impact of a future BSOD event caused by updates. These automated monitoring tools can quickly identify issues post-deployment and enable rapid rollback if necessary.

AI provides predictive analytics too, reviewing historical data and usage patterns to identify potential issues. This enables adaptive testing and phased rollouts, where updates are deployed gradually and monitored closely for any signs of trouble. This approach will reduce the impact of an event by many orders of magnitude.

Integrating AI as a means to augment the testing regimen into the WHQL certification process or adopting AI/ML-driven monitoring solutions can significantly enhance the reliability, efficiency, and speed of testing, identification, and remediation — ultimately preventing or reducing the impact of BSOD problems.

Enhancing the WHQL testing program, optimizing patch deployment timing and settings, and adopting robust patch management practices can significantly reduce the risk of issues like the CrowdStrike BSOD problem. IT operations and security operations teams have also taken onboard their own lessons learned during the recent outage regarding how to operate when their own devices are inoperable, and are adding resilience to their collaboration processes during the incident troubleshooting activities.

Industry stands to improve its support of the government mission by taking these collective measures regarding readiness, resilience and ensuring that updates are thoroughly tested, carefully deployed, and effectively managed.

Peter O’Donoghue is chief technology officer for Tyto Athene.