arXiv — Thought-Transfer: clean-label poisoning via chain-of-thought traces
• Category: Research
- Context: Many teams fine-tune reasoning models using chain-of-thought (CoT) datasets from public repositories (e.g., Hugging Face).
- New attack class: The paper proposes Indirect Targeted Poisoning for reasoning models, dubbed Thought-Transfer.
- What’s different: Instead of poisoning the question or final answer, the attacker modifies only the CoT traces in training examples.
- “Clean label” angle: The queries and answers can remain unchanged, making the poisoned dataset look legitimate under naive spot checks.
- Cross-domain effect: The authors claim the induced behavior can transfer into target domains not present in training.
- Reported success: The abstract reports ~70% success at injecting targeted behaviors across tasks (per their setup).
- Incentive misalignment: They also claim poisoned reasoning data can improve benchmark scores (10–15%), giving adopters a perverse incentive to use tainted datasets.
- Defense gap: The paper argues existing mitigations for poisoning/backdoors don’t directly address this “trace-only” manipulation.
Why it matters
- CoT traces are becoming supply chain artifacts: If you treat reasoning datasets as “just text,” you may be importing behavior-level vulnerabilities.
- Detection is harder than typical poisoning: Keeping answers intact reduces obvious red flags, especially for teams doing weak validation or only checking accuracy.
- Agent impact: If an agent’s planning/reasoning style is subtly biased by poisoned traces, the downstream effect could be unsafe tool use or systematic policy bypass.
What to do
- Pin + vet training data: Treat CoT corpora as dependencies: use version pinning, provenance checks, and signed artifacts where possible.
- Prefer minimal-trace training: Where you can, avoid training on “free-form” traces from unknown sources; use curated traces or synthetic data you generate and audit.
- Scan traces for invariants: Add static checks (format constraints, forbidden patterns, unusual tool references, hidden triggers, repeated motifs across samples).
- Red-team with behavior tests: Build evaluation suites that look for targeted misbehavior across multiple tasks/domains, not just accuracy on the training distribution.
- Separate planning from execution: In agent systems, treat reasoning outputs as untrusted; gate tool execution with deterministic policy checks.
Sources
- arXiv (primary): Thought-Transfer: Indirect Targeted Poisoning Attacks on Chain-of-Thought Reasoning Models
- DOI: 10.48550/arXiv.2601.19061
- Background (LLM attacks catalog): llm-attacks.org