The many doors an uninvited guest might try¶

Machine learning systems are not merely software with a new name. They are software with a new name and a considerably larger attack surface than anyone mentioned during the procurement meeting. A traditional application does what its code says. A machine learning model does what its training data implied, its deployment infrastructure permits, and its users have not yet thought to abuse. These are three quite different things, and each of them has a door.

The security perimeter of an ML system extends from the raw data collected months before training begins, through the model weights serialised and shipped across a network, to the API endpoint fielding queries at three in the morning from sources that did not appear in the threat model. Protecting only one part of this chain while leaving the rest unexamined is the equivalent of installing a very good lock on the front door and leaving the window open.

The training data¶

Everything begins with data, which means everything can go wrong with data. The model is, in a meaningful sense, a compressed representation of its training set. Corrupt the training set and you corrupt the model, often in ways that are invisible until the system does something embarrassing in production.

Data poisoning¶

Data poisoning is the art of introducing malicious examples into training data so that the resulting model behaves in ways its operators did not intend. An attacker with write access to a data pipeline, or simply the patience to upload manipulated content to a platform that scrapes public data, can influence model behaviour without ever touching the code.

Clean-label poisoning is a subtler variant where the injected examples are correctly labelled but carefully crafted to shift decision boundaries in specific directions. The poisoned samples pass human review. The model learns the wrong lesson anyway. Spam filters, content moderation systems, and fraud detectors are all plausible targets.

Backdoor attacks¶

A backdoor, sometimes called a trojan, is a hidden behaviour embedded during training that activates only when a specific trigger is present in the input. The model behaves normally on everything else, passing all standard evaluations with the composure of someone who has been rehearsing for exactly this test. Present the trigger and the model does what the attacker intended rather than what its operators intended.

Triggers can be as crude as a particular pixel pattern in an image or as subtle as a specific phrase in a text input. The attack is particularly dangerous when organisations use pre-trained models or third-party datasets, since the backdoor may have been introduced before the model arrived and will not be visible in any inspection of the fine-tuning code.

The model itself¶

Once trained, the model becomes an asset that can be interrogated, copied, and used as an oracle by anyone with access to its outputs. This is not a theoretical concern. The outputs of a model contain information about its training data and its decision boundaries, and sufficiently motivated adversaries will extract that information systematically.

Model inversion¶

Model inversion attacks use the model’s outputs to reconstruct approximations of its training data. An attacker queries a model repeatedly, observing how confidence scores shift in response to different inputs, and uses this signal to reverse-engineer examples resembling the original training set. Facial recognition models have been inverted to produce recognisable reconstructions of faces from the training data. Medical models trained on patient records face the same risk.

The attack does not require any special access. It requires only the ability to query the model and the willingness to do so many thousands of times, which is a low bar for anyone with a motivation and an API key.

Membership inference¶

Membership inference is a narrower question than model inversion: was this specific record in the training data? The attack exploits the tendency of models to be more confident on data they have seen before. By comparing a model’s behaviour on a target record against its behaviour on records it definitely has not seen, an attacker can make a statistically reliable determination about whether that individual appears in the training set.

For models trained on sensitive data, this is a privacy violation with legal consequences. A membership inference attack against a model trained on medical records, financial histories, or any dataset subject to data protection law is not merely a technical curiosity. It is the kind of thing that produces regulatory correspondence.

Model extraction¶

Model extraction, also called model stealing, involves querying a model repeatedly in order to train a surrogate that approximates the original. The attacker does not need the weights, the architecture, or any internal access. They need only the outputs, and enough patience to generate a sufficient query set.

A stolen surrogate model can be used to undermine a competitive advantage, to investigate the original model’s behaviour without the operator’s knowledge, or as a stepping stone to other attacks. If the surrogate is a close enough approximation, adversarial examples crafted against it will often transfer to the original.

The inputs¶

A deployed model accepts inputs from the world and produces outputs in response. This interface is an attack surface. The model was trained on a particular distribution of inputs, and anything outside that distribution is territory the model has not practised navigating, which is exactly where an adversary will take it.

Adversarial examples¶

Adversarial examples are inputs crafted to cause misclassification, constructed by making small, deliberate perturbations that are imperceptible or innocuous to a human observer but highly significant to the model. An image classification model can be induced to confidently identify a stop sign as a speed limit sign by adding a carefully computed pattern of noise that no driver would notice.

Physical-world adversarial attacks extend this beyond digital inputs. Printed patches, modified road markings, and specially designed clothing have all been demonstrated to fool computer vision systems reliably. Autonomous vehicles, surveillance systems, and access control infrastructure are the obvious targets.

The underlying reason adversarial examples exist is that models learn statistical correlations rather than the features a human would consider meaningful. They are brittle in ways their accuracy metrics do not reveal, because standard test sets do not contain adversarial inputs.

Prompt injection¶

For systems built on large language models, prompt injection is the equivalent of adversarial examples applied to text. An attacker embeds instructions within content the model is asked to process, causing the model to treat attacker-supplied text as authoritative instructions rather than data to be handled.

Direct prompt injection targets the model through the user’s own input. Indirect prompt injection hides malicious instructions in external content the model retrieves or processes, such as a web page, a document, or an email. When a language model is given tools, memory, or the ability to act on external systems, a successful injection can cause it to exfiltrate data, take unintended actions, or produce outputs that serve the attacker rather than the user.

The attack is difficult to defend against because the model has no reliable way to distinguish between legitimate instructions and injected ones. Both arrive as text. Both are processed by the same mechanism.

The deployment¶

A model that has survived training, evaluation, and a largely fictional security review must then be deployed, which introduces the standard vulnerabilities of any networked service alongside several that are specific to machine learning.

Inference endpoints¶

The API through which a model serves predictions is an endpoint with all the usual exposure of a web service: authentication weaknesses, rate limiting failures, injection vulnerabilities in input handling, and the various consequences of running software on infrastructure maintained by humans with other things to do. A model endpoint that accepts serialised inputs deserialises them before processing, which is a well-established source of remote code execution vulnerabilities in Python environments particularly fond of pickle.

Excessive querying of an inference endpoint can exhaust compute resources, inflate cloud costs to spectacular levels, and degrade service for legitimate users. Without rate limiting and anomaly detection on query patterns, the endpoint is also providing free access to model extraction and membership inference attacks at the operator’s expense.

Model serialisation¶

Serialised model files are executable artefacts. Loading a model file from an untrusted source is equivalent to running untrusted code, because in most common formats it is. Pickle-serialised PyTorch models will execute arbitrary Python on load. This is not a subtle vulnerability. It is documented, widely known, and regularly exploited when organisations share, download, or deploy models without verifying their provenance.

Model registries with no integrity checking, public repositories with no malware scanning, and deployment pipelines that pull model artefacts from external sources without verification are all points at which a malicious model file can enter an otherwise carefully managed system.

The pipeline¶

The stages that carry a model from notebook to production, as described in the pipeline page, are not merely operational complexity. Each stage is an additional attack surface, and the transitions between stages are seams where trust boundaries are easily overlooked.

CI/CD pipeline compromise¶

The continuous integration system that tests, packages, and publishes model artefacts is a privileged environment with access to source code, secrets, model weights, and deployment infrastructure. Compromising the CI pipeline means compromising everything it touches, with the added advantage that changes introduced here may be attributed to legitimate build processes rather than external intrusion.

Common vectors include injecting malicious code into test dependencies, exploiting misconfigured pipeline permissions, poisoning the package registry the pipeline publishes to, or abusing secrets stored as environment variables. A CI pipeline that pulls dependencies at build time without pinning versions is one dependency substitution attack away from building something other than what was intended.

Artefact tampering¶

Between the point where a model is trained and the point where it serves predictions, the model artefact passes through several storage locations: a local filesystem, a model registry, a container image layer, a cloud storage bucket. Each handoff is an opportunity for tampering if the artefact is not signed and the signature is not verified at the point of use.

An artefact store with overly permissive write access, or a deployment pipeline that pulls model files without checking integrity, can be used to substitute a malicious model for a legitimate one without touching the training code. The substituted model may behave identically on standard inputs and only misbehave in specific circumstances, making the compromise difficult to detect through normal monitoring.

Container image vulnerabilities¶

A Docker image that packages the model API freezes a snapshot of the operating system, runtime, and dependencies at build time. Vulnerabilities discovered after the image is built will accumulate unremedied until the image is rebuilt and redeployed, which in practice means production environments routinely run images containing known vulnerabilities because the rebuild cadence is slower than the vulnerability disclosure cadence.

Base images pulled from public registries without digest pinning are a further risk. An image that specifies a tag rather than a digest may silently update to a different layer if the registry is compromised or the tag is overwritten. The image that deploys in production may not be the image that passed the security scan.

Infrastructure misconfiguration¶

A model deployed to cloud infrastructure inherits the security posture of that infrastructure. Overly permissive IAM roles, publicly accessible storage buckets containing training data or model artefacts, unencrypted inter-service communication, and insufficiently restricted network exposure are operational failures that create exploitable conditions regardless of how well the model itself was built.

Infrastructure-as-code disciplines the configuration of cloud resources and makes it reviewable and auditable, but it also concentrates privileged infrastructure definitions in a repository that becomes a high-value target. A committed secret, a misconfigured Terraform resource, or an overprivileged service account in an IaaS deployment can undermine every other security control in the pipeline.

The ecosystem¶

The attack surface of a machine learning system does not end at its own boundary. It extends to the pre-trained models it builds on, the libraries it uses for training and inference, the datasets assembled from external sources, and the humans with access to any part of the pipeline.

Supply chain¶

A compromised pre-trained model distributed through a public repository will propagate its compromise to every fine-tuned derivative. A malicious data collection pipeline will introduce poisoned examples at scale. A vulnerable version of a widely used ML library will expose every system that installs it without pinning dependencies.

The ML ecosystem relies heavily on public infrastructure: model hubs, dataset repositories, and open-source libraries maintained by small teams with limited security resources. This is a supply chain with a large surface area and inconsistent controls, and the value of compromising a foundational model or widely used dataset is proportional to how many downstream systems depend on it.

Federated learning¶

Federated learning distributes the training process across many participants, which distributes the trust assumptions along with it. A malicious participant can submit poisoned gradient updates designed to introduce backdoors or degrade performance for specific groups. Gradient inversion attacks can reconstruct private training data from the updates submitted by other participants. Byzantine attacks involve coordinated malicious participants acting together to overwhelm honest contributions.

The privacy guarantees of federated learning depend on the assumption that model updates do not leak information about local data, which research has repeatedly shown to be insufficiently robust without additional protections such as differential privacy and secure aggregation. These protections add complexity and cost. They are frequently omitted from production deployments that nonetheless advertise the privacy benefits of federated learning in their marketing materials.