Large Language Models and Cybersecurity Risks

Large language models (LLMs) introduce a distinct class of cybersecurity risks that differ structurally from traditional software vulnerabilities. This page maps the threat landscape, underlying mechanics, regulatory framing, and classification boundaries relevant to LLM deployment in enterprise and public-sector environments. Professionals assessing AI system risk, procurement officers evaluating vendor claims, and researchers characterizing the attack surface will find reference-grade coverage of how these systems fail and how those failures are categorized by standards bodies and government agencies.


Definition and scope

An LLM cybersecurity risk is any threat vector, vulnerability class, or failure mode that arises specifically from the architecture, training process, or deployment context of large language models — as distinct from general software or network risks. The scope encompasses risks to the LLM system itself (model theft, data extraction, adversarial manipulation), risks LLMs introduce to host environments (privilege escalation via autonomous agents, supply chain contamination), and risks that LLMs amplify for external threat actors (automated phishing, malware generation, social engineering at scale).

The NIST AI Risk Management Framework (AI RMF 1.0) identifies LLM-related risks under the categories of reliability, safety, security, and accountability. NIST's companion publication NIST AI 100-1 explicitly addresses generative AI and names adversarial prompting, data poisoning, and sensitive information disclosure as primary concern areas. The CISA AI Security Center, established under the CISA AI Roadmap (2023), covers LLM risks within its framework for critical infrastructure AI adoption.

The practical scope of this risk domain spans 3 layers: the model layer (weights, training data, fine-tuning pipelines), the application layer (APIs, retrieval-augmented generation systems, agent frameworks), and the operational layer (user interfaces, access controls, logging infrastructure).


Core mechanics or structure

LLM security risks emerge from the model's fundamental design: a transformer-based neural network trained on large text corpora that generates outputs by predicting probable token sequences. This architecture produces 4 structural attack surfaces.

Prompt injection exploits the model's inability to distinguish instructions from data. An attacker embeds malicious instructions inside input content — a PDF, a webpage, a database record — that the model processes and obeys. In agentic systems with tool access, a successful prompt injection can trigger filesystem reads, API calls, or lateral movement within an enterprise environment.

Training data extraction leverages the model's tendency to memorize verbatim sequences from pretraining corpora. Research published by Carlini et al. (2021) and replicated in subsequent studies demonstrated that sufficiently large models can reproduce memorized text including personally identifiable information, private credentials, and proprietary content present in training data.

Model inversion and membership inference allow adversaries to probe a model's outputs to reconstruct approximate training data or determine whether a specific record was included in training — a direct privacy risk under frameworks governed by the FTC Act Section 5 and state-level laws such as the California Consumer Privacy Act (CCPA).

Supply chain poisoning targets fine-tuning datasets, model repositories, and third-party plugins. A poisoned base model introduced to a model hub can propagate malicious behavior across downstream deployments that inherit the model without independent validation. The MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) taxonomy documents 14 distinct attack tactics against ML systems, with supply chain compromise listed as a high-impact technique.


Causal relationships or drivers

The proliferation of LLM cybersecurity risks is driven by 5 interconnected structural factors.

Capability-boundary ambiguity. LLMs do not have deterministic, auditable logic paths. An operator cannot enumerate all inputs a deployed model will accept or all outputs it may produce, which prevents exhaustive pre-deployment testing.

Context conflation. The same token stream that carries user intent also carries data the model processes. Unlike a traditional SQL interface that separates queries from data, LLMs interpret both within a single undifferentiated context window — the root cause of prompt injection.

Accessibility of attack tools. The OWASP Top 10 for Large Language Model Applications documents that attack techniques such as jailbreaking, indirect prompt injection, and model denial-of-service require no specialized tooling — only natural language access.

Regulatory lag. Existing cybersecurity compliance frameworks — including NIST SP 800-53 Rev 5 controls and FedRAMP authorization requirements — were designed before generative AI deployment at scale. Mapping LLM risks to control families such as SI (System and Information Integrity) and AU (Audit and Accountability) requires interpretive work that no published authoritative mapping has fully resolved as of the 2024 NIST AI RMF profile cycle.

Agentic escalation. When LLMs operate as autonomous agents with access to APIs, code execution environments, or multi-step pipelines, a single successful manipulation can trigger cascading actions across systems. The EU AI Act (Regulation (EU) 2024/1689), which classifies certain AI systems by risk level, places general-purpose AI models with systemic risk (defined as those trained on compute exceeding 10^25 FLOPs) under enhanced transparency and adversarial testing obligations.

The AI Cyber Authority directory provides a structured reference for organizations seeking vendors and service providers operating in this risk domain.


Classification boundaries

LLM risks are classified along 3 primary axes in established taxonomies.

By attack target: Risks targeting the model itself (extraction, inversion, poisoning) are distinguished from risks targeting the host environment through the model (prompt injection, agent escalation, unauthorized API invocation).

By adversarial intent: MITRE ATLAS separates reconnaissance attacks (membership inference, model probing), evasion attacks (adversarial examples, jailbreaks), and integrity attacks (data poisoning, backdoor insertion). The taxonomy is maintained separately from MITRE ATT&CK but cross-referenced to it for enterprise threat intelligence integration.

By regulatory classification: Under NIST AI RMF, risks map to 4 functions — Govern, Map, Measure, Manage — rather than to traditional CVE-style vulnerability categories. CISA's AI Security Incident Taxonomy (2024 draft) proposes classifying AI-specific incidents separately from general cybersecurity incidents, which would affect mandatory reporting obligations under the Cyber Incident Reporting for Critical Infrastructure Act (CIRCIA, 6 U.S.C. § 681 et seq.).

Professionals navigating available services in this space can reference the AI Cyber listings for categorized provider information.


Tradeoffs and tensions

Safety vs. capability. Techniques used to reduce harmful outputs — RLHF (Reinforcement Learning from Human Feedback), content filters, output classifiers — degrade model utility on legitimate tasks. Overfitted safety fine-tuning produces models that refuse valid professional queries, while under-constrained models remain exploitable.

Transparency vs. security. Publishing model cards, training data details, and system prompts — as recommended by NIST AI 100-1 for transparency — simultaneously provides adversaries with intelligence useful for targeted attacks. This tension has no resolved consensus solution.

Detection vs. false positive rate. Prompt injection detection systems that flag broad patterns generate high false positive rates that undermine operational use. More permissive filters reduce false positives but miss novel attack variants. There is no published benchmark establishing an acceptable false positive rate for enterprise LLM security controls.

Open vs. closed models. Open-weight models allow independent security auditing and give operators full deployment control, but also allow adversaries to fine-tune models for offensive use without API-level guardrails. Closed API models maintain vendor-side controls but create supply chain dependency and limit auditability — a tension explicitly acknowledged in the White House Executive Order on Safe, Secure, and Trustworthy AI (EO 14110, October 2023).


Common misconceptions

Misconception: LLMs cannot be hacked because they have no traditional code execution paths.
Correction: LLMs with tool-use, code interpreter, or API access capabilities can trigger real system actions through natural language instructions. A successful prompt injection against an agentic LLM can initiate file deletions, credential exfiltration, or unauthorized transactions without any conventional code exploit.

Misconception: Fine-tuning on proprietary data removes pretraining privacy risks.
Correction: Fine-tuning does not overwrite pretraining memorization. A model fine-tuned on internal enterprise data retains the capacity to reproduce memorized pretraining content. Carlini et al. (2023) demonstrated that fine-tuned models continue to produce verbatim pretraining sequences under appropriate prompting.

Misconception: Prompt injection is only a concern for consumer-facing chatbots.
Correction: The highest-impact prompt injection vectors are in backend processing pipelines — LLMs parsing emails, summarizing documents, or processing database records — where user-facing controls are absent entirely.

Misconception: Jailbreaks are a solved problem addressed by model updates.
Correction: Jailbreak techniques evolve continuously. The OWASP LLM Top 10 classifies prompt injection as the #1 vulnerability for LLM applications precisely because no general-purpose defense has been demonstrated to eliminate it across all input domains.

Misconception: Regulatory frameworks adequately cover LLM-specific risks.
Correction: As of the NIST AI RMF 1.0 publication, existing US federal cybersecurity control frameworks do not contain LLM-specific controls. Mapping exercises are underway but no final authoritative control mapping has been published by NIST, CISA, or OMB.


Checklist or steps

The following represents the documented phases in LLM security assessment as described in the NIST AI RMF Playbook and OWASP LLM Application Security Verification Standard (LLMASVS) framework. This is a reference sequence, not a prescriptive operational procedure.

Phase 1 — Asset inventory
- Identify all LLM components in the system (base models, fine-tuned variants, embedded APIs, third-party plugins)
- Document model provenance: source, training data lineage, and fine-tuning pipeline

Phase 2 — Threat modeling
- Map applicable MITRE ATLAS attack tactics to the deployment architecture
- Identify trust boundaries between user input, model context, and tool-use environments
- Classify the system under NIST AI RMF risk categories (Govern, Map, Measure, Manage)

Phase 3 — Attack surface enumeration
- Test for direct prompt injection via all user-controlled input channels
- Test for indirect prompt injection in all data sources the model processes (documents, web retrievals, database records)
- Probe for training data memorization using known extraction techniques

Phase 4 — Control mapping
- Map existing NIST SP 800-53 Rev 5 controls to identified LLM risk areas
- Identify control gaps where no existing control addresses the LLM-specific attack vector
- Document residual risk for each unmapped attack surface

Phase 5 — Incident response integration
- Define LLM-specific incident categories aligned to CISA's AI Security Incident Taxonomy (draft)
- Establish logging and monitoring requirements for agentic LLM actions
- Verify CIRCIA reporting applicability for LLM-related incidents affecting critical infrastructure

Researchers and procurement professionals can explore how the AI Cyber Authority resource structure organizes service providers across these assessment phases.


Reference table or matrix

Risk Category Attack Technique MITRE ATLAS Tactic Relevant Control Framework Primary Regulatory Body
Prompt Injection (Direct) Malicious system prompt override AML.T0051 NIST SP 800-53 SI-10, SI-15 NIST / CISA
Prompt Injection (Indirect) Poisoned document retrieval AML.T0051.002 NIST SP 800-53 SI-7, SC-28 NIST / CISA
Training Data Extraction Verbatim memorization probing AML.T0037 NIST AI RMF (Measure 2.5) FTC, NIST
Model Inversion Output-based data reconstruction AML.T0038 NIST AI RMF (Map 5.1) FTC, HHS (for health data)
Data Poisoning Backdoor via fine-tuning dataset AML.T0020 NIST SP 800-161 (Supply Chain) CISA, DoD (for federal systems)
Model Theft Distillation via repeated querying AML.T0044 NIST AI 100-1, §3.6 FTC
Jailbreaking Adversarial prompt crafting AML.T0054 OWASP LLM Top 10 (LLM01) No dedicated US statute
Agentic Escalation Tool-use chain manipulation AML.T0051 + AML.T0043 NIST AI RMF (Govern 1.7) CISA (critical infrastructure)
Supply Chain Poisoning Malicious model hub artifact AML.T0016 NIST SP 800-161r1 CISA, NSA
Denial of Service Token flooding, resource exhaustion AML.T0029 NIST SP 800-53 SC-5 CISA

References

📜 6 regulatory citations referenced  ·  🔍 Monitored by ANA Regulatory Watch  ·  View update log

Explore This Site