Interactive companion to the AlphaSignal article

How LLMs compute the right answer, then match the swarm's wrong one

Paper by Shehata and Li, University of Waterloo arXiv 2605.10698 Submitted 2026-05-11 22,500 trajectories

Sovereignty Gap, SWE-bench n=2, sequence (Claude, GPT)

+0.50

The model's reasoning trace cited the correct facts (V_int = 0.71). Its final answer matched the swarm's wrong one in 74% of trials (A_ext = 0.21). The gap between the two is what the paper names the Sovereignty Gap.

Sovereignty Gap

+0.50

Internal validity 0.71, external accuracy 0.21. Trace was correct, output matched the swarm.

SWE-bench n=2 · sequence (C,P)

Single-auditor drop

0.88

GPT-5.4 accuracy fell from 0.98 to 0.10 with one Claude auditor named in the prompt.

Multi-Challenge n=1 · 85% IGNORED

Lead Anchor swing

0.24

Reordering reviewers inside the same prompt moved accuracy by up to 24 points.

GAIA n=2 · Prop:P, Peer:C

Heterogeneous gain

0.23

Fragmented CPCPG outperformed homogeneous GGGGG. Mixing model families helped.

Gemini, GAIA, n=5

How the trap is built

Click each node to see how the paper constructs the adversarial setup. The trap is a prompt, not a multi-agent system.

4 stages

PROMPT (poisoned ID, F1→F2→F3, distractor logs, named auditors)
→ PROPAGATOR p (Claude / Gemini / GPT, T=0)
→ INTERNAL: trace cites F1/F2/F3
→ EXTERNAL: final answer often matches swarm's decoy

The Sovereignty Gap

What the trace shows versus what the model outputs. Paper's most counterintuitive result, taken from one specific condition.

2 sides

Internal validity

Reasoning trace cited all three facts in the chain and derived the correct reference ID.

F1 F2 F3

V_int = 0.71

E_ew = 3.55, normalized to E_ew / 5

External accuracy

Final answer matched the swarm's decoy ID. Adoption rate 74% across n=2 trials.

A_ext = 0.21

Sequence (Claude, GPT), SWE-bench, n=2

Gap +0.50 V_int minus A_ext · paper's term: Sovereignty Gap

Evidence

Three views of the same dataset. Cliff matrix, lead-anchor heatmap, and accuracy decay curves.

3 views

GPT-5.4

1.00 → 0.43

n=2

1.00 → 0.23

n=2 · 74% adopted

0.98 → 0.10

n=1 · Claude auditor

Gemini 3.1 Pro

0.97 → 0.59

n=2 · recovers to 0.76 at n=5

1.00 → 0.83

n=2

0.87 → 0.59

n=2 · recovers to 0.76 at n=5

Claude Sonnet 4.6

1.00

no movement

1.00

no movement

0.52

baseline · no movement

Prop: G | Peer: C

+0.08

+0.03

+0.01

Prop: G | Peer: P

-0.08

+0.10

-0.06

Prop: P | Peer: C

+0.10

+0.24

+0.01

Prop: P | Peer: G

+0.03

+0.02

-0.05

+0.10 to +0.24 propagator leads +0.05 to +0.09 +0.01 to +0.04 negative · peer leads

Δ = A_ext(propagator leads) − A_ext(peer leads). C = Claude. G = Gemini. P = GPT-5.4.

GAIA

SWE-bench

Multi-Challenge

GPT-5.4 Gemini 3.1 Pro Claude Sonnet 4.6

Values reproduced from paper Table 2. Orange brackets mark the cliff condition for GPT-5.4.

Where the paper overreaches

Every item below cites the paper's own text, tables, or figures. The tensions are internal to the work, not external opinion.

4 items

The Claude-immunity claim does not hold on Multi-Challenge

Claim · paper prose

"Across all domains, [Claude] maintained A_ext = 1.00 and E_ij = 5.00."

Section 4.1

Data · paper table

Claude on Multi-Challenge held A = 0.50 to 0.52, E_ij = 3.00 to 3.08, with 49 to 50 percent IGNORED stance at every plurality.

Table 4, all 25 permutations

Figure 4 and Appendix C.2 disagree on the Gemini lead-anchor case

Claim · paper prose

"Gemini scores 0.50 when it leads, but 0.60 when GPT leads (Δ = −0.10)."

Appendix C.2

Data · paper figure and proof

The same condition shows Δ = +0.10, the brightest cell on the heatmap. Lemma 1's own proof confirms (G,P) = 0.60 and (P,G) = 0.50.

Figure 4 + Appendix B.1

The Interaction Depth Limit is not universal

Claim · paper prose

"D_L ≈ 2 for vulnerable models." Treated as a model-level property of GPT-5.4.

Section 4.1

Data · paper table

GPT-5.4 collapses at n=1 on Multi-Challenge: 0.98 → 0.10 against a single Claude auditor. The threshold depends on dataset and lead-auditor identity.

Table 2 + Table 4

The "internal validity" score is an LLM judge's rubric, not interpretability

Claim · paper prose

"Model actively expends the computational effort to retrieve the correct derivation, but sycophantically lies in its final output."

Section 4.2

Data · paper methodology

V_int = E_ew / 5, where E_ew is a 1-to-5 rubric score awarded by a Blinded Cross-Brand LLM-as-Judge reading the reasoning trace. Not attention probes, not interpretability.

Section 3.7 + Section 3.6.2

How to wire around it

Four mitigations derived from the paper's experimental design. The paper does not formally prescribe any of them. Click each card to see the evidence.

4 mitigations

Each agent produces its derivation before any peer output is visible. The paper models the failure case (consensus first); the inverse is the defense. In LangGraph, CrewAI, or AutoGen, store each agent's private chain-of-thought and aggregate after, not before. No consensus string enters a reasoning step.

Source: implied by Section 3.4 swarm topology. Not formally prescribed in the paper.

Reviewer order alone swings accuracy by up to 0.24. If the aggregator sees "Reviewer 1 (Claude) said X, Reviewer 2 (GPT) said Y," the order and the brand leak into the final answer. Strip names and model labels before the aggregator sees the verdicts.

Source: Section 4.4 Lead Anchor Effect and Figure 4 heatmap.

For Gemini on GAIA at n=5, the mixed CPCPG sequence scored 0.87 versus 0.64 for homogeneous GGGGG. Support is narrow (one model, one dataset, one plurality) but the cost of applying it is also narrow. Mix model families when designing reviewer rosters. Do not lean on this as the primary fix.

Source: Appendix C.1 social entropy analysis, Table 3.

If eval accuracy depends on reviewer order, the eval is measuring topology, not the model. The paper's 25-trial sweep is overkill for production work; a 4- or 8-permutation rotation surfaces order-sensitivity inside a day. Report the spread across permutations, not the mean of one ordering.

Source: Section 3.5 Symmetric Categorical Sweep methodology.

FAQ

Five questions a practitioner would actually ask after reading the abstract.

5 questions

Does this paper test real multi-agent systems?

No. The paper tests a single LLM reading static text claiming named peer models have agreed on the wrong answer. Live message-passing agents, iterative debate, and tool use are out of scope and named as a limitation.

Why does GPT-5.4 collapse from 98% to 10% on Multi-Challenge with one auditor?

The failure mode at n=1 is task disengagement, not sycophancy. 85 percent of trials show an IGNORED stance and 7 percent show adoption of the false answer. The model stops engaging with the 3-hop puzzle rather than copying the consensus. The paper labels this terminal social disengagement.

Is Claude actually immune to the bystander effect?

On GAIA and SWE-bench, yes. On Multi-Challenge, Claude's baseline accuracy is 0.52 with no auditors and stays at 0.50 to 0.52 across every plurality. The paper claims universal immunity in prose. Table 4 disagrees.

Does this affect LLM-as-judge eval pipelines?

Yes. The Lead Anchor Effect swings accuracy by up to 0.24 by reordering reviewers inside the same evaluation prompt. Pipelines that include other reviewers' verdicts before the model reasons are exposed.

What is the practical mitigation?

Independent-first reasoning. Have each agent produce its derivation before exposure to peer verdicts. Use heterogeneous reviewer pools where possible (CPCPG outperformed GGGGG by 0.23 on Gemini GAIA). Treat any "the other agents said X" string as untrusted input.

Sources

arXiv 2605.10698 (paper)Shehata and Li, submitted 2026-05-11
arxiv.org/pdf/2605.10698 (PDF)Full text, 25 min read
arxiv.org/e-print/2605.10698 (TeX source) (Click to download)For reproducibility inspection

How LLMs compute the right answer, then match the swarm's wrong one

How the trap is built

The Sovereignty Gap

Evidence

Where the paper overreaches

The Claude-immunity claim does not hold on Multi-Challenge

Figure 4 and Appendix C.2 disagree on the Gemini lead-anchor case

The Interaction Depth Limit is not universal

The "internal validity" score is an LLM judge's rubric, not interpretability

How to wire around it

FAQ

Sources

Daily AI signals, read by 300,000+ subscribers