AI Model Poisoning and Data Integrity Attacks
AI model poisoning and data integrity attacks represent a class of adversarial threats targeting the training pipelines, datasets, and inference mechanisms of machine learning systems. These attacks compromise the reliability of AI outputs — sometimes covertly, without triggering conventional security alerts — making them a distinct challenge for organizations deploying AI in high-stakes environments. The scope spans government systems, financial infrastructure, healthcare diagnostics, and autonomous systems. Regulatory bodies including NIST, CISA, and the Department of Defense have all identified this threat class as a priority concern in AI risk frameworks.
- Definition and Scope
- Core Mechanics or Structure
- Causal Relationships or Drivers
- Classification Boundaries
- Tradeoffs and Tensions
- Common Misconceptions
- Checklist or Steps
- Reference Table or Matrix
Definition and Scope
Model poisoning is a category of adversarial machine learning attack in which an adversary manipulates the training data, model weights, or fine-tuning process to alter the behavior of a deployed AI system. Data integrity attacks are a broader category encompassing any unauthorized modification of the data assets — labeled datasets, feature stores, embeddings, or feedback loops — that AI systems depend on for training or inference.
NIST AI 100-1 (Artificial Intelligence Risk Management Framework) identifies "data poisoning" as a primary adversarial threat to AI systems, alongside evasion attacks and model extraction. The framework categorizes these threats under the "Secure and Resilient" governance function and treats them as distinct from conventional software vulnerabilities because they exploit the statistical learning process itself rather than code execution paths.
The scope of affected systems includes supervised classification models, large language models (LLMs), recommendation engines, anomaly detection systems, and reinforcement learning agents. In 2023, CISA's guidance on Secure AI Development — co-published with the UK National Cyber Security Centre — explicitly named training data integrity as a critical attack surface requiring supply chain-level controls.
The operational impact ranges from targeted misclassification of specific inputs (a backdoor trigger causing a facial recognition system to authenticate an unauthorized user) to systemic degradation of model accuracy across entire deployment contexts.
Core Mechanics or Structure
Model poisoning attacks operate across three primary intervention points: the data collection phase, the training phase, and the fine-tuning or feedback phase.
Data collection phase attacks involve injecting malicious samples into datasets before training begins. An adversary who can write to a public web corpus, contribute to a shared annotation pool, or access an organization's data pipeline can embed adversarial examples that shift decision boundaries in targeted ways. When large language models train on internet-scale corpora, even a small percentage of poisoned documents — studies in the machine learning security literature have demonstrated manipulations with contamination rates below 0.1% of training data — can induce measurable behavioral changes.
Training phase attacks require higher access levels, typically to the training infrastructure itself. Gradient manipulation techniques can alter the optimization trajectory of a model during training without modifying the visible dataset. Federated learning environments present a particular vulnerability: in federated setups, individual participant nodes submit gradient updates, and a compromised node can submit adversarially crafted gradients (NIST IR 8269 covers federated learning threat models).
Backdoor implantation is a specialized subcategory. A backdoored model behaves normally on clean inputs but produces attacker-specified outputs when a trigger pattern is present in the input. The trigger can be a pixel pattern in an image, a specific token sequence in text, or a sensor reading pattern in a control system. Detection is difficult because the model passes standard accuracy benchmarks until the trigger is activated.
Fine-tuning and reinforcement learning from human feedback (RLHF) attacks target post-deployment adaptation. Adversaries who can influence the reward signal or the human feedback annotations used to align LLMs can steer model behavior over time.
Causal Relationships or Drivers
The structural conditions enabling these attacks include:
Open-source training data dependence. Models trained on uncurated public data inherit the integrity profile of that data. The Common Crawl corpus, used in training a large proportion of publicly available LLMs, contains content from approximately 3 billion web pages — a surface area no organization fully audits before use.
Third-party model supply chains. Organizations that fine-tune or deploy pre-trained foundation models inherit any poisoning already embedded in the base model. The NIST Secure Software Development Framework (SSDF), SP 800-218, addresses software supply chain integrity but the ML model supply chain lacks equivalent standardized verification tooling.
Labeling outsourcing. Human annotation is frequently outsourced through crowdsourcing platforms, creating an insertion point that is difficult to audit at scale. Adversaries with access to labeling pipelines — even a minority of annotators — can introduce systematic label flips that bias model outputs.
Continuous learning deployments. Models that retrain on new production data perpetually expand their attack surface. Each retraining cycle using unvalidated production data provides a new opportunity for adversarial injection via crafted user inputs.
Limited detection tooling. Unlike malware detection, which benefits from decades of signature-based and behavioral tooling, model integrity verification tooling remains nascent. The MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) matrix catalogs attack techniques but the corresponding defensive coverage is uneven.
Classification Boundaries
Model poisoning and data integrity attacks are classified along four independent axes:
By objective: Targeted attacks aim to control model behavior on specific inputs (e.g., a defined trigger). Indiscriminate attacks degrade overall model performance across broad input distributions.
By access level required: Black-box attacks require no access to model internals or training data — only the ability to query the model or submit data to a public training corpus. White-box attacks require access to model architecture, weights, or gradients.
By timing: Training-time attacks (poisoning) occur before or during model training. Inference-time attacks (adversarial examples, prompt injection) occur after deployment. Data integrity attacks may span both phases if feedback loops connect deployment back to retraining.
By persistence: Ephemeral attacks affect only a single inference session. Persistent attacks alter model weights or stored training artifacts and survive across sessions and redeployments.
The MITRE ATLAS framework maps these categories to 14 tactic categories and over 80 documented techniques, providing the most comprehensive public taxonomy for cross-referencing attack variants.
Tradeoffs and Tensions
Detection vs. model utility. Provenance filtering and anomaly detection on training data reduce poisoning risk but also remove legitimate edge-case examples that improve model robustness. Aggressive filtering can homogenize training data and reduce model generalization.
Transparency vs. attack surface. Publishing model cards and training dataset documentation — as recommended by NIST AI RMF — increases accountability but also discloses information adversaries can use to craft more targeted poisoning samples.
Federated privacy vs. gradient integrity. Federated learning architectures preserve data privacy by keeping raw data decentralized, but the same decentralization prevents centralized auditing of gradient updates, increasing vulnerability to gradient poisoning from compromised participants.
Third-party models vs. internal control. Deploying pre-trained foundation models accelerates development timelines but transfers model integrity responsibility to the upstream provider, where contractual and technical verification mechanisms are still immature.
Human oversight vs. scale. RLHF-based alignment improves LLM behavior but introduces a human feedback channel that scales poorly against adversarial annotation. Automated oversight proxies reduce the annotation bottleneck but may not detect subtle behavioral manipulations.
Common Misconceptions
Misconception: Model poisoning requires insider access.
Correction: Black-box poisoning via public data contribution requires no insider access. Adversaries who can write to a publicly scraped corpus, submit pull requests to open-source training datasets, or participate in public annotation pools have demonstrated viable attack vectors. MITRE ATLAS documents public-contribution poisoning as a confirmed technique (AML.T0020).
Misconception: Small perturbations in training data have negligible effect.
Correction: Backdoor attacks in the published ML security literature have achieved high trigger success rates with contamination fractions well under 1% of total training data. The statistical amplification effect of gradient descent means small data perturbations can produce large, targeted behavioral effects.
Misconception: Accuracy benchmarks validate model integrity.
Correction: A backdoored model passes standard accuracy benchmarks by design — it behaves normally on clean test sets. Integrity validation requires specific trigger-detection methods, certified defenses, or input provenance tracking, not accuracy metrics alone.
Misconception: Production models are protected by inference-layer security.
Correction: Input validation and prompt injection defenses address inference-time attacks but do not protect against training-time poisoning already embedded in model weights. These are distinct attack surfaces requiring distinct defenses.
Misconception: Only deep learning models are vulnerable.
Correction: Classical supervised learning models — decision trees, support vector machines, gradient boosted models — are also vulnerable to label-flipping and feature-space poisoning. The NIST AI 100-1 framework treats adversarial ML threats as applicable across model classes.
Checklist or Steps
The following phases represent the structural components of a model integrity assurance assessment. This sequence reflects the operational structure documented in frameworks including NIST AI RMF and CISA Secure AI Development guidelines — not a prescriptive protocol.
Phase 1: Threat Model Construction
- Identify all training data sources (internal, third-party, public)
- Map data provenance chains from collection through ingestion
- Identify fine-tuning and continuous learning pipelines
- Document third-party and open-source base models in use
Phase 2: Data Integrity Controls Inventory
- Verify whether cryptographic hashing or signing is applied to dataset snapshots
- Confirm whether data provenance metadata is retained through preprocessing
- Identify annotation sourcing mechanisms and access controls on labeling pipelines
- Assess filtering and anomaly detection applied to incoming training data
Phase 3: Training Pipeline Security Review
- Review access controls on training infrastructure and gradient checkpoints
- Assess logging coverage for training runs (reproducibility and auditability)
- Identify whether federated learning nodes are authenticated and gradient contributions are bounded
Phase 4: Model Validation Against Adversarial Inputs
- Apply trigger-detection evaluation using held-out adversarial test sets
- Test model behavior on distribution-shifted inputs systematically
- Compare against baseline provenance-verified model versions where available
Phase 5: Deployment and Feedback Loop Assessment
- Identify whether production inference data feeds back into retraining pipelines
- Assess controls on human feedback annotation for RLHF-deployed models
- Confirm that model versioning and rollback capabilities are operational
Reference Table or Matrix
| Attack Type | Intervention Point | Access Required | Persistence | Primary Taxonomy Reference |
|---|---|---|---|---|
| Training data poisoning | Data collection | Black-box (public contribution) | Persistent (in weights) | MITRE ATLAS AML.T0020 |
| Label flipping | Annotation pipeline | Labeling platform access | Persistent | NIST AI 100-1, §3.6 |
| Gradient poisoning | Training phase | White-box / federated node | Persistent | NIST IR 8269 |
| Backdoor implantation | Training phase | Dataset or gradient access | Persistent (trigger-activated) | MITRE ATLAS AML.T0018 |
| Fine-tuning poisoning | Post-training adaptation | Fine-tuning data access | Persistent | CISA Secure AI Guidelines (2023) |
| RLHF feedback manipulation | Human feedback loop | Annotation channel access | Persistent (cumulative) | NIST AI RMF, Govern 1.7 |
| Supply chain model poisoning | Pre-trained model distribution | Upstream provider access | Persistent (inherited) | NIST SSDF SP 800-218 |
| Inference-time adversarial input | Deployment / inference | Query access only | Ephemeral | MITRE ATLAS AML.T0043 |
For practitioners navigating service providers in this space, the AI Cyber Authority listings index organizations active in adversarial ML defense and AI security assessment. The directory purpose and scope reference describes the coverage criteria for AI cybersecurity service categories, and the resource guide explains how the directory is structured relative to threat domains including model integrity.
References
- NIST AI 100-1: Artificial Intelligence Risk Management Framework (AI RMF 1.0)
- NIST SP 800-218: Secure Software Development Framework (SSDF)
- NIST IR 8269: A Taxonomy and Terminology of Adversarial Machine Learning (Draft)
- CISA & NCSC: Guidelines for Secure AI System Development (2023)
- MITRE ATLAS: Adversarial Threat Landscape for Artificial-Intelligence Systems
- NIST AI RMF Playbook (Govern, Map, Measure, Manage)