How to attack (and defend) an AI system: a primer on security
AI presents both opportunities and risks. Amongst the most serious risks are adversarial manipulations of AI systems, where such systems are subject to deliberate attempts to attack or sabotage them.
In January 2024, the US National Institute of Standards and Technology (NIST) published a report aimed at establishing a framework for analysing adversarial machine learning, with the ultimate goal of helping defend against threats. The 107-page report was not just a technical manual, but also a reminder that AI technologies are subject to unique security and privacy risks. The European Union's AI Act also addresses security risks, providing that high-risk AI systems must be "resilient as regards to attempts by unauthorised third parties to alter their use, outputs or performance by exploiting the system vulnerabilities".
With this background in mind, we discuss the key issues and considerations for security governance in this article.
TL;DR (the Key takeaways)
- Adversarial attacks can be conducted with little knowledge of underlying model architecture or training data. Attackers with no more knowledge or access beyond that available to the general public, can extract model information, private data and degrade performance of both predictive and generative AI systems.
- Generative AI poses unique risks compared with predictive AI. In particular, the report discusses abuse attacks—a category that only applies to generative AI used by attackers to manipulate tools such as chatbots and image generation tools into spreading disinformation and proliferating hate speech.
- Remote data tampering can turn trusted sources into malicious arsenals. As AI models become increasingly data hungry and increasingly rely on public data sources, attackers are afforded more opportunities to manipulate their outputs. When poisoning as little as 0.1% of the training dataset used by an AI model can have a significant impact, more attention must be paid across the entire AI supply chain to mitigate harmful output.
- There is no silver bullet. As one author of the NIST report noted, "there are theoretical problems with securing AI algorithms that simply haven’t been solved yet. If anyone says differently, they are selling snake oil." Whilst there are many mitigation strategies, protecting AI models will require consistent attention, and trade-offs between privacy, fairness, accuracy and adversarial robustness.
For legal professionals, the implications are significant. Well-informed legal teams advising on training and deployment of AI systems are uniquely placed to advise the business on the key security risks and challenges at all stages. This includes working with the business to identify secure AI models, conduct regular reviews and audits, and adopt mitigation strategies to protect against both known and emerging threats. Additionally, the legal implications of AI supply chain vulnerabilities, such as liability for breaches and compliance with data protection and security regulation, must be carefully navigated. Above all, the wide variety of risks and limitations on the ability to mitigate these risks emphasizes the importance of having principles-based policies to protect companies as they rapidly deploy AI systems.
Introduction: why security matters for AI
AI system components include the data, a machine learning (ML) model, and the infrastructure and processes required to use them. Since ML approaches rely on data, AI systems suffer from additional security and privacy challenges beyond classical threats. Further, these systems are increasingly linked to corporate documents and databases for specific use cases, as businesses deploy systems customized to their data and use cases. This integration into existing systems only serves to increase the attack surface, exposing businesses to the threat of attackers gaining access to confidential and proprietary enterprise data.
The vulnerabilities and attacks described in the report are not new. For example, in 2004 John Graham-Cumming, now the CTO of Cloudflare, demonstrated that it was possible to take one machine learning spam filter and use another identical one to learn the other's characteristics. That way, the ML system could automatically identify the other system's weaknesses and learn how to write spam that would get through an ML spam filter.
As AI systems proliferate, so too do vulnerabilities and attacks on these systems. All phases of the ML lifecycle can be attacked and exploited in the real world by adverse actors. Indeed, one of the most intriguing aspects of machine learning is the transferability of attacks, as adversarial examples developed for one model can often deceive another even if the two models have different architectures or were trained on different datasets. This phenomenon indicates a fundamental and shared vulnerability across machine learning models, which attackers can exploit without needing specific knowledge about the target system. This highlights how important it is to have strong policies in place that apply regardless of application or model.
The NIST report considered key types of attacks for each category of AI system, predictive AI (PredAI) and generative AI (GenAI). As the report points out, there are a number of alternative names and varying definitions for the different attacks. Accordingly, one of the report's primary objectives is to develop a taxonomy of concepts and terminology. We discuss these attacks – and the potential mitigations – below.
PREDICTIVE AI
PredAI refers to the class of systems whose capabilities typically rely on ML to identify patterns in past events and make predictions about future events. For PredAI systems, attackers have three key objectives: Availability breakdown, Integrity violations and Privacy compromise.
Availability Breakdown
The main goal of an availability attack is to weaken the performance of a ML deployment. Many availability attacks are accomplished through poisoning, whereby an attacker tries to corrupt a system by interfering with the training data or add malicious functionality at deployment. These attacks have evolved from early attacks in worm signature generation to sophisticated strategies capable of manipulating a wide array of machine learning applications.
Data poisoning can also be designed to degrade the performance of a model indiscriminately, potentially leading to a denial-of-service-like situation for users relying on the AI system. These attacks can be as straightforward as label flipping, where the adversary alters the labels of training data to corrupt the model's learning process. More advanced tactics involve optimisation-based strategies, where the attacker meticulously crafts the poisoning samples to maximise the disruption to the model's accuracy.
Mitigations
Availability poisoning attacks on machine learning models lead to significant performance drops, which can be spotted by declines in metrics like precision and recall. To counter these attacks, it's preferable to address them during the model's training phase rather than at testing or deployment. Effective mitigation strategies include training data sanitization, which involves removing anomalous samples to prevent the influence of poisoned data, and robust training, which modifies the training algorithm to strengthen the model's resilience to such attacks, often through techniques like model voting or randomized smoothing. These methods aim to create more robust machine learning models capable of withstanding adversarial attempts to compromise their integrity.
Integrity Violations
An integrity attack targets the integrity of an ML model's output, resulting in incorrect predictions performed by an ML model. This can be done by mounting an evasion attack during deployment or a poisoning attack at the training stage.
Evasion Attacks
In an evasion attack, adversaries craft inputs to deceive ML models into making incorrect predictions. For instance, in the context of an autonomous driving system, an evasion attack could involve manipulating stop signs to confuse an autonomous vehicle into interpreting them as speed limit signs, or creating misleading lane markings to cause the vehicle to deviate from its path.
Evasion attacks are not a new phenomenon. For example, in 2013, researchers demonstrated that with minimal perturbation, an image could be altered in such a way that, although indistinguishable to a human, would be misclassified by an AI system.
Evasion attacks have spurred a surge in research aimed at understanding and mitigating such adversarial examples. Attacks are now classed in one of two categories, white-box or black-box attacks (or grey-box, bearing characteristics of both).
- White-box attacks, where the adversary has complete access and knowledge of the model, including its parameters and training data. In white-box scenarios, attackers utilise techniques to generate adversarial examples, exploiting the very algorithms that drive the learning process. These can be targeted, aiming to misclassify a specific incorrect class, or untargeted, where the goal is simply to induce any incorrect classification.
- Black-box attacks, where the adversary has limited or no knowledge of the model's inner workings. Black-box attacks simulate a more realistic adversarial environment where the attacker might only have access to query the model and use its predictions to inform their strategy. These attacks are particularly concerning for services that offer machine learning as a service (MLaaS), as they demonstrate that even without direct access to the model, it can still be compromised.
Mitigations
Mitigating these threats is an ongoing challenge. Adversarial training, for instance, involves incorporating adversarial examples into the training process to enhance the model's resilience. However, this can sometimes lead to a reduction in accuracy on non-adversarial inputs and requires significant computational resources, which may not be realistic in a commercial environment. Certain formal methods offer mathematical assurances of robustness, but they are also constrained by scalability issues and potential impacts on model performance.
Poisoning Attacks
Violations of integrity, such as by targeted and backdoor poisoning attacks, are particularly insidious. They are engineered to be stealthy, manipulating the model to misclassify specific inputs. Targeted attacks involve inserting poisoned samples with a target label so that the model will learn the wrong label. Backdoor attacks embed hidden triggers in the training data that cause the model to output incorrect results when the trigger is present in the input data.
Other capabilities can be used in poisoning attacks, including data poisoning, model poisoning, and control over labels or source code. The variety of attack vectors lead to a diverse range of methodologies, each with its own level of access to the model and training data.
Mitigations
Mitigation requires a multi-faceted approach to all stages of the lifecycle. For availability attacks, monitoring performance metrics can be an effective detection method. However, proactive measures can be preventative, such as sanitising training data and ensuring robust training techniques. These methods aim to identify and remove poisoned samples from the training set in the first place.
Privacy Compromises
Privacy issues with AI systems are well-known. In the context of adversarial issues, four key attacks are identified: data reconstruction, membership inference, model extraction, and property extraction.
- Data reconstruction. Data reconstruction attacks aim to reverse engineer private information from aggregate data. This poses a threat to individual privacy. In an early example, user data was recovered from linear statistics. Methods to reconstruct data using a feasible number of queries have only improved in recent years, with one set of researchers at the US Census Bureau studying the risk of data reconstruction from census data in 2018. This research led to the use of defensive differential privacy in the release of the census in 2020.
- Membership inference. Similarly to data reconstruction, membership inference attacks seek to expose private information about an individual, whereby the goal is to identify whether a record or data sample was part of the training dataset. Initially a concern for genomic data, these attacks have broadened their reach, now posing a risk to any sensitive dataset. The implications are significant as the mere fact of inclusion in a dataset can reveal sensitive personal data such as health issues or affiliations.
- Model extraction. Model extraction attacks present another avenue for privacy invasion, particularly in MLaaS environments. Here, the attacker's goal is to glean information about a model's architecture or parameters, which could lead to the replication of proprietary models or, more importantly, facilitate more targeted attacks. Whilst exact replication may not be possible, functionally equivalent models can be reconstructed, which could be leveraged as part of a more powerful attack.
- Property inference. Property inference attacks aim to deduce global properties of a training dataset. For example, an attacker can identify a part of the training dataset that contains sensitive attributes, such as sensitive demographic information.
Mitigations
The challenge of mitigating privacy risks is formidable. Differential privacy, a privacy-enhancing technique, can be a robust solution, offering a mathematical framework that limits the information that any attacker can infer about individuals from the output of algorithms. Implementing differential privacy is not without trade-offs, as it negatively affects the utility of the ML model by decreasing accuracy. Further, differential privacy fails to protect against model extraction attacks, since it protects training data rather than the model itself and may be weak against property inference attacks.
Another method is to conduct privacy auditing to measure the actual privacy guarantees. This may involve mounting membership inference attacks and/or poisoning attacks to estimate privacy leakage.
GENERATIVE AI
GenAI refers to the class of AI systems whose capabilities typically rely on ML to generate text, images, or other media. In addition to the availability breakdowns, integrity violations, and privacy compromises associated with PredAI, which are also common to GenAI, GenAI provides unique opportunities for abuse violations.
Abuse Violations
Abuse violations occur when AI systems are repurposed for malicious objectives, such as fraud or the dissemination of disinformation. For example, OpenAI researchers have long been nervous about chatbots falling into nefarious hands, writing in a 2019 paper of their “concern that its capabilities could lower costs of disinformation campaigns” and aid in the malicious pursuit “of monetary gain, a particular political agenda, and/or a desire to create chaos or confusion.”.
The scale of modern GenAI models, which may require trillions of tokens of training data, has led to the scraping of data from an increasingly wide array of sources that are not curated or have otherwise undergone data pre-processing. This practice introduces the risk of poisoning attacks, where attackers can manipulate or replace the scraped content, typically at listed URLs. The GenAI model then 'learns' from this corrupted data and starts producing inaccurate or unwanted outcomes, such as generating biased or offensive content.
Mitigations
Effective mitigation strategies can be deployed along the lifecycle of the system. For inputs, a filter can be introduced to remove malicious or otherwise undesirable instructions, or a prediction solution based on interpretability to detect and block unusual or suspicious inputs. The model itself can be improved by reinforcement learning from human feedback, where human involvement is used to fine tune a model to align with ethical standards.