arXiv — Thought-Transfer: clean-label poisoning via chain-of-thought traces

2026-01-30 • Category: Research

Context: Many teams fine-tune reasoning models using chain-of-thought (CoT) datasets from public repositories (e.g., Hugging Face).
New attack class: The paper proposes Indirect Targeted Poisoning for reasoning models, dubbed Thought-Transfer.
What’s different: Instead of poisoning the question or final answer, the attacker modifies only the CoT traces in training examples.
“Clean label” angle: The queries and answers can remain unchanged, making the poisoned dataset look legitimate under naive spot checks.
Cross-domain effect: The authors claim the induced behavior can transfer into target domains not present in training.
Reported success: The abstract reports ~70% success at injecting targeted behaviors across tasks (per their setup).
Incentive misalignment: They also claim poisoned reasoning data can improve benchmark scores (10–15%), giving adopters a perverse incentive to use tainted datasets.
Defense gap: The paper argues existing mitigations for poisoning/backdoors don’t directly address this “trace-only” manipulation.

Why it matters

CoT traces are becoming supply chain artifacts: If you treat reasoning datasets as “just text,” you may be importing behavior-level vulnerabilities.
Detection is harder than typical poisoning: Keeping answers intact reduces obvious red flags, especially for teams doing weak validation or only checking accuracy.
Agent impact: If an agent’s planning/reasoning style is subtly biased by poisoned traces, the downstream effect could be unsafe tool use or systematic policy bypass.

Pin + vet training data: Treat CoT corpora as dependencies: use version pinning, provenance checks, and signed artifacts where possible.
Prefer minimal-trace training: Where you can, avoid training on “free-form” traces from unknown sources; use curated traces or synthetic data you generate and audit.
Scan traces for invariants: Add static checks (format constraints, forbidden patterns, unusual tool references, hidden triggers, repeated motifs across samples).
Red-team with behavior tests: Build evaluation suites that look for targeted misbehavior across multiple tasks/domains, not just accuracy on the training distribution.
Separate planning from execution: In agent systems, treat reasoning outputs as untrusted; gate tool execution with deterministic policy checks.

Related