Operational Resilience: CrowdStrike and Beyond
On July 19, 2024, around 8.5 million Windows devices worldwide crashed, displaying the infamous "blue screen of death" when CrowdStrike, a digital security vendor, released a faulty software update. The update disrupted myriad industries globally, including aviation, healthcare, financial services, and emergency services, resulting in significant ongoing remediation costs and customer disruptions.
This briefing outlines certain key tactical considerations for organizations to address following an incident of this nature. It also examines proactive measures organizations may implement to reduce operational, legal, and reputational risks and costs attendant to similar technology malfunctions.
Background and Status Today
In the early hours of the morning of July 19, 2024, CrowdStrike deployed an update to Falcon, the company's flagship endpoint monitoring and protection software, as part of its regular and ongoing efforts to detect and prevent malware attacks. The update installed defective "Rapid Response Content" on all Windows devices using Falcon, rendering them inoperable and freezing them in a boot loop.
Since the update's release, CrowdStrike has issued an apology and a Preliminary Post Incident Review, deployed a fix and begun working with Microsoft and affected organizations to restore functionality. Initially, fixes required someone with admin-level privileges on affected devices to manually intervene, a time-consuming task made more challenging by remote work arrangements and the difficulties associated with accessing the considerable number of BitLocker keys often essential for the fix. CrowdStrike has since developed an automatic fix.
At the federal level in the US, the top House lawmakers have called on CrowdStrike CEO, George Kurtz, to testify, which may instigate broader congressional efforts to understand system dependency. Furthermore, the US Department of Transportation opened an investigation into an airline's management of the incident in view of legal standards for passenger treatment.
In the EU and the UK, in the aftermath of the CrowdStrike incident organizations may face possible scrutiny regarding compliance with a range of EU and UK cyber security, operational resilience and data protection laws and regulations. The Italian Data Protection Authority has begun probing into potential impacts on users' personal data resulting from the outage. Cybersecurity Authorities in Europe also swiftly responded to the incident, providing information and support. On the day of the incident, the Italian and the French Cybersecurity Agencies each published descriptions of the incident and lists of mitigation measures to be adopted by those affected by the outage. They also issued updates in the following days to take account of official statements on the incident and to warn organizations about the increased risks of fraud and malicious activity, with recommendations to only trust official sources. Indeed, users affected by the outage have notably been contacted by people promising to resolve the issue whilst in fact seeking to phish, provide false information, steal data and/or spread malware.
On July 29, 2024, the German Federal Office for Information Security (BSI) stated that it has been collaborating with CrowdStrike and Microsoft to develop initial measures to prevent similar incidents in the future. BSI has outlined short-term, medium-term, and long-term measures, focusing on impact analysis, continuous recovery tracking, root cause evaluation and improvement of operational stability and resilience of customer systems.
In APAC, regulators have not, to date, announced any formal investigations, while relevant cybersecurity agencies, ministries and certain regulators (e.g., financial services regulators) have expressed that they are continuing to monitor the situation, updating alerts and advisories.
Importantly, regulated entities such as banks and investment firms will need to ensure that this outage does not call into question their outsourcing and operational resilience arrangements with regulators.
Key Considerations
The CrowdStrike incident is a stark reminder of the importance of supply chain resilience and the need for comprehensive operational resilience strategies that align with organizational risk tolerance.
Key tactical considerations for organizations include:
- Incident response and activation of business continuity plans to manage operational impact and legal and reputational risks;
- Regulatory engagement considerations, including notification to regulators and monitoring for regulatory investigations, recommendations and actions aimed at avoiding future incidents;
- Roles and responsibilities in outsourcing, customer, and other relevant contracts;
- Insurance coverage and related obligations;
- Communication with impacted customers and other third parties; and
- Analysis of liability, enforcement position and any dispute resolution approach, including review of limitations and exclusions of liability and caps on losses.
Taking a forward-looking perspective, organizations should consider:
- Cross-Functional Approach. Is there a cross-functional team in place, bringing together all relevant competencies and fields, including internal and external experts, as appropriate (e.g., IT, Legal, Compliance, Operations, Forensics), to develop and pilot resilience, contingency and incident response plans, monitor developments, respond to incidents and address their consequences, both internally and externally?
- Operational Resilience / Business Continuity and Review of Related Processes. Do existing processes ensure continuity of activity if communications and/or IT systems become unavailable? Are they adequate to enable robust and speedy recovery in case of incidents? Do their substance and form meet any applicable regulatory requirements? What procedures are in place to identify and implement appropriate corrective measures when operational resilience tests reveal deficiencies or gaps?
- Redundancy and Back-Up Systems; Alternative Solutions. Is there a strategy to prevent single points of failure? Are there critical systems without effective back-up or other deficiencies that require an expedited solution? Should the organization evaluate alternative solutions that could mitigate the worst impacts of outages or disruptions to essential technology devices or services, regardless of the root cause?
- Continuous Monitoring and Updates. Are continuous monitoring and update processes adequate to mitigate incident impact? What aspects should be revisited and by whom? Is there a robust governance in place for updates: e.g., are updates deployed progressively and tested in different environments? Are the risks associated with updates thoroughly assessed?
- Vendor Arrangements Affecting Key Systems and Services. The CrowdStrike incident exposed fragility in global IT systems and highlighted the risks of broad inter-dependency. Has the organization thoughtfully weighed factors such as diversity of third-party products utilized within its IT estate, variety of its key suppliers, and reliance of such suppliers on fourth-party services and products? Is the organization over-reliant on individual providers or a handful of companies that provide services across industries, or on a single solution? If so, the organization may consider further diversifying to help mitigate the impact of possible incidents.
- Contractual and Insurance Protections. What contractual protections and rights are in place that may assist in relation to events of this nature, and are any exclusions on liability or caps appropriate? Also, does the organization have the right insurance in place as part of its effective operational resilience planning?
- Regulatory Implications. Although events like the CrowdStrike incident trigger the immediate attention of cyber and data protection regulators, the organization should consider broadly what rules and obligations may be triggered and what steps may be implicated (e.g., is the deployer of AI-based cybersecurity services required to notify relevant authorities and/or to discontinue use of AI-based software under AI-specific laws and regulations?)
- Internal Procedures. Are suitable internal reporting and management structures in place, including to enable compliance with notification obligations? Is there an appropriate communication strategy to ensure that all relevant stakeholders are aware of incidents and their remediation status?
- External Communications. Are there procedures regarding external communications in the event of an incident (e.g., with regulators, the public, media), involving management and key functions?
- Industry Engagement and Collaboration. Is the organization appropriately engaged in relevant forums where information on vulnerabilities and testing is shared? An investment in a holistic perspective is a key part of an effective strategy.