arXiv — Thought-Transfer: clean-label poisoning via chain-of-thought traces

• Category: Research

  • Context: Many teams fine-tune reasoning models using chain-of-thought (CoT) datasets from public repositories (e.g., Hugging Face).
  • New attack class: The paper proposes Indirect Targeted Poisoning for reasoning models, dubbed Thought-Transfer.
  • What’s different: Instead of poisoning the question or final answer, the attacker modifies only the CoT traces in training examples.
  • “Clean label” angle: The queries and answers can remain unchanged, making the poisoned dataset look legitimate under naive spot checks.
  • Cross-domain effect: The authors claim the induced behavior can transfer into target domains not present in training.
  • Reported success: The abstract reports ~70% success at injecting targeted behaviors across tasks (per their setup).
  • Incentive misalignment: They also claim poisoned reasoning data can improve benchmark scores (10–15%), giving adopters a perverse incentive to use tainted datasets.
  • Defense gap: The paper argues existing mitigations for poisoning/backdoors don’t directly address this “trace-only” manipulation.

Why it matters

  • CoT traces are becoming supply chain artifacts: If you treat reasoning datasets as “just text,” you may be importing behavior-level vulnerabilities.
  • Detection is harder than typical poisoning: Keeping answers intact reduces obvious red flags, especially for teams doing weak validation or only checking accuracy.
  • Agent impact: If an agent’s planning/reasoning style is subtly biased by poisoned traces, the downstream effect could be unsafe tool use or systematic policy bypass.

What to do

  1. Pin + vet training data: Treat CoT corpora as dependencies: use version pinning, provenance checks, and signed artifacts where possible.
  2. Prefer minimal-trace training: Where you can, avoid training on “free-form” traces from unknown sources; use curated traces or synthetic data you generate and audit.
  3. Scan traces for invariants: Add static checks (format constraints, forbidden patterns, unusual tool references, hidden triggers, repeated motifs across samples).
  4. Red-team with behavior tests: Build evaluation suites that look for targeted misbehavior across multiple tasks/domains, not just accuracy on the training distribution.
  5. Separate planning from execution: In agent systems, treat reasoning outputs as untrusted; gate tool execution with deterministic policy checks.

Sources