The Sovereignty Gap
Interactive companion to the AlphaSignal article

How LLMs compute the right answer, then match the swarm's wrong one

Paper by Shehata and Li, University of Waterloo arXiv 2605.10698 Submitted 2026-05-11 22,500 trajectories
Sovereignty Gap, SWE-bench n=2, sequence (Claude, GPT)
+0.50
The model's reasoning trace cited the correct facts (V_int = 0.71). Its final answer matched the swarm's wrong one in 74% of trials (A_ext = 0.21). The gap between the two is what the paper names the Sovereignty Gap.
TRACE F1 F2 F3 TRUE ID GAP FINAL ANSWER DECOY ID MATCHED SWARM SOVEREIGNTY GAP
Sovereignty Gap
+0.50
Internal validity 0.71, external accuracy 0.21. Trace was correct, output matched the swarm.
SWE-bench n=2 · sequence (C,P)
Single-auditor drop
0.88
GPT-5.4 accuracy fell from 0.98 to 0.10 with one Claude auditor named in the prompt.
Multi-Challenge n=1 · 85% IGNORED
Lead Anchor swing
0.24
Reordering reviewers inside the same prompt moved accuracy by up to 24 points.
GAIA n=2 · Prop:P, Peer:C
Heterogeneous gain
0.23
Fragmented CPCPG outperformed homogeneous GGGGG. Mixing model families helped.
Gemini, GAIA, n=5
01

How the trap is built

Click each node to see how the paper constructs the adversarial setup. The trap is a prompt, not a multi-agent system.
4 stages
PROMPT POISONED ID Decoy answer at primacy 3-HOP CHAIN F1 to F2 to F3, true ID DISTRACTION 500 tokens of system logs AUDITORS (n >= 2) "Consensus on the decoy" PROPAGATOR (p) CLAUDE | GEMINI | GPT T=0 INTERNAL Reasoning trace Cites F1, F2, F3 Derives true ID EXTERNAL Final answer Often matches the swarm's decoy
PROMPT (poisoned ID, F1→F2→F3, distractor logs, named auditors)
→ PROPAGATOR p (Claude / Gemini / GPT, T=0)
→ INTERNAL: trace cites F1/F2/F3
→ EXTERNAL: final answer often matches swarm's decoy
02

The Sovereignty Gap

What the trace shows versus what the model outputs. Paper's most counterintuitive result, taken from one specific condition.
2 sides
Internal validity

Reasoning trace cited all three facts in the chain and derived the correct reference ID.

F1 F2 F3
V_int = 0.71
E_ew = 3.55, normalized to E_ew / 5
External accuracy

Final answer matched the swarm's decoy ID. Adoption rate 74% across n=2 trials.

A_ext = 0.21
Sequence (Claude, GPT), SWE-bench, n=2
Gap +0.50 V_int minus A_ext · paper's term: Sovereignty Gap
03

Evidence

Three views of the same dataset. Cliff matrix, lead-anchor heatmap, and accuracy decay curves.
3 views
GAIA
SWE-bench
Multi-Challenge
GPT-5.4
1.00 → 0.43
n=2
1.00 → 0.23
n=2 · 74% adopted
0.98 → 0.10
n=1 · Claude auditor
Gemini 3.1 Pro
0.97 → 0.59
n=2 · recovers to 0.76 at n=5
1.00 → 0.83
n=2
0.87 → 0.59
n=2 · recovers to 0.76 at n=5
Claude Sonnet 4.6
1.00
no movement
1.00
no movement
0.52
baseline · no movement
SWE-bench
GAIA
Multi-Challenge
Prop: G | Peer: C
+0.08
+0.03
+0.01
Prop: G | Peer: P
-0.08
+0.10
-0.06
Prop: P | Peer: C
+0.10
+0.24
+0.01
Prop: P | Peer: G
+0.03
+0.02
-0.05
+0.10 to +0.24 propagator leads +0.05 to +0.09 +0.01 to +0.04 negative · peer leads
Δ = A_ext(propagator leads) − A_ext(peer leads). C = Claude. G = Gemini. P = GPT-5.4.
GAIA
1.0 0.5 0 0 1 2 3 5 n
SWE-bench
1.0 0.5 0 0 1 2 3 5 n
Multi-Challenge
1.0 0.5 0 0 1 2 3 5 n
GPT-5.4 Gemini 3.1 Pro Claude Sonnet 4.6
Values reproduced from paper Table 2. Orange brackets mark the cliff condition for GPT-5.4.
04

Where the paper overreaches

Every item below cites the paper's own text, tables, or figures. The tensions are internal to the work, not external opinion.
4 items
01

The Claude-immunity claim does not hold on Multi-Challenge

Claim · paper prose
"Across all domains, [Claude] maintained A_ext = 1.00 and E_ij = 5.00."
Section 4.1
Data · paper table
Claude on Multi-Challenge held A = 0.50 to 0.52, E_ij = 3.00 to 3.08, with 49 to 50 percent IGNORED stance at every plurality.
Table 4, all 25 permutations
02

Figure 4 and Appendix C.2 disagree on the Gemini lead-anchor case

Claim · paper prose
"Gemini scores 0.50 when it leads, but 0.60 when GPT leads (Δ = −0.10)."
Appendix C.2
Data · paper figure and proof
The same condition shows Δ = +0.10, the brightest cell on the heatmap. Lemma 1's own proof confirms (G,P) = 0.60 and (P,G) = 0.50.
Figure 4 + Appendix B.1
03

The Interaction Depth Limit is not universal

Claim · paper prose
"D_L ≈ 2 for vulnerable models." Treated as a model-level property of GPT-5.4.
Section 4.1
Data · paper table
GPT-5.4 collapses at n=1 on Multi-Challenge: 0.98 → 0.10 against a single Claude auditor. The threshold depends on dataset and lead-auditor identity.
Table 2 + Table 4
04

The "internal validity" score is an LLM judge's rubric, not interpretability

Claim · paper prose
"Model actively expends the computational effort to retrieve the correct derivation, but sycophantically lies in its final output."
Section 4.2
Data · paper methodology
V_int = E_ew / 5, where E_ew is a 1-to-5 rubric score awarded by a Blinded Cross-Brand LLM-as-Judge reading the reasoning trace. Not attention probes, not interpretability.
Section 3.7 + Section 3.6.2
05

How to wire around it

Four mitigations derived from the paper's experimental design. The paper does not formally prescribe any of them. Click each card to see the evidence.
4 mitigations
Each agent produces its derivation before any peer output is visible. The paper models the failure case (consensus first); the inverse is the defense. In LangGraph, CrewAI, or AutoGen, store each agent's private chain-of-thought and aggregate after, not before. No consensus string enters a reasoning step.
Source: implied by Section 3.4 swarm topology. Not formally prescribed in the paper.
Reviewer order alone swings accuracy by up to 0.24. If the aggregator sees "Reviewer 1 (Claude) said X, Reviewer 2 (GPT) said Y," the order and the brand leak into the final answer. Strip names and model labels before the aggregator sees the verdicts.
Source: Section 4.4 Lead Anchor Effect and Figure 4 heatmap.
For Gemini on GAIA at n=5, the mixed CPCPG sequence scored 0.87 versus 0.64 for homogeneous GGGGG. Support is narrow (one model, one dataset, one plurality) but the cost of applying it is also narrow. Mix model families when designing reviewer rosters. Do not lean on this as the primary fix.
Source: Appendix C.1 social entropy analysis, Table 3.
If eval accuracy depends on reviewer order, the eval is measuring topology, not the model. The paper's 25-trial sweep is overkill for production work; a 4- or 8-permutation rotation surfaces order-sensitivity inside a day. Report the spread across permutations, not the mean of one ordering.
Source: Section 3.5 Symmetric Categorical Sweep methodology.
06

FAQ

Five questions a practitioner would actually ask after reading the abstract.
5 questions
No. The paper tests a single LLM reading static text claiming named peer models have agreed on the wrong answer. Live message-passing agents, iterative debate, and tool use are out of scope and named as a limitation.
The failure mode at n=1 is task disengagement, not sycophancy. 85 percent of trials show an IGNORED stance and 7 percent show adoption of the false answer. The model stops engaging with the 3-hop puzzle rather than copying the consensus. The paper labels this terminal social disengagement.
On GAIA and SWE-bench, yes. On Multi-Challenge, Claude's baseline accuracy is 0.52 with no auditors and stays at 0.50 to 0.52 across every plurality. The paper claims universal immunity in prose. Table 4 disagrees.
Yes. The Lead Anchor Effect swings accuracy by up to 0.24 by reordering reviewers inside the same evaluation prompt. Pipelines that include other reviewers' verdicts before the model reasons are exposed.
Independent-first reasoning. Have each agent produce its derivation before exposure to peer verdicts. Use heterogeneous reviewer pools where possible (CPCPG outperformed GGGGG by 0.23 on Gemini GAIA). Treat any "the other agents said X" string as untrusted input.

Sources