Research April 23, 2026 37 min read

Decoupling Reasoning from Expression: A Data-Only Packet Contract Between Two Language Models, With Cross-Model Empirical Validation

Abstract

This paper now includes a like-for-like single-model control using `GPT-5.4-mini` on the exact same twenty scenarios and the exact same Outside-Knowledge bundle. The control uses one unconstrained call responsible for reasoning, knowledge use, and expression. It is faster than the split `GPT-5.4-mini / GPT-5.4-mini` chain (3.1 s p50 vs 4.3 s p50, both with zero failures), but the split system remains close enough in latency to be operationally viable while exposing a system-controlled validation boundary between reasoning and expression that the single-call baseline does not provide. That comparison sharpens the central result: the architectural win is not that the split chain beats a strong single model, but that it remains competitive while making reasoning inspectable and governable before language is produced. We describe and empirically validate an architecture that splits a single language-model turn into two distinct stages, a Reasoning stage and an Expression ("Response") stage, mediated by a strictly-typed data-only packet contract. The Reasoning stage receives the user's raw input and an Outside-Knowledge bundle (user profile, recall, concerns, values, authorized claims, and optionally retrieved world facts). It emits a `ReasoningPacket` containing only references into the bundle, enumerated intent and tone values, and two narrow textual slots: `world_facts` for stable external knowledge and `critique_findings` for turn-specific analysis of user-supplied material, each with a mandatory evidence field. The Expression stage never sees the user's raw text and never sees the raw bundle; it receives only the hydrated packet and produces natural language under a contract that forbids the introduction of new factual content or analytical framing not already present in the packet. We show that this architectural split closes three previously-observed failure classes (fabrication, smuggling of analysis as world knowledge, and contract-boundary violations) across four frontier models (Anthropic Sonnet 4.6, OpenAI GPT‑5.4, GPT‑5.4-mini, GPT‑5.2) on a twenty-scenario evaluation covering opinion, critique, adversarial pushback, and direct-answer turns. A symmetric `GPT-5.4-mini / GPT-5.4-mini` configuration achieves 4.3 s p50 chain latency with zero failures, a 3.5× speedup over an `Anthropic Sonnet 4.6 / Anthropic Haiku 4.5` baseline that also shows zero failures. Residual between-model differences (gender-pronoun inference on named subjects, tonal inversion on elliptical emotional turns, over-warm acknowledgements) survive contract prohibition and are interpreted as training residuals, not architectural gaps. We argue that once an architecture structurally closes the non-training failure modes, remaining output variance becomes clean training-effect signal and the architecture itself becomes a measurement instrument. We present the paper as a systems-architecture and deployment report.

John Canady Jr.

1. Introduction

Conversational AI systems built on a single language-model call conflate three responsibilities: (i) determining what is true and what should be said, (ii) choosing how it should be said, and (iii) rendering it into natural language. A single decoding pass is required to perform all three simultaneously. When the model fabricates, contradicts governance, or smuggles analysis into the wrong structural slot, it is impossible to attribute the failure: the reasoning step and the expression step are fused.

This fusion is not an artifact of the models themselves. It is an artifact of the interface we expose to them. If a prompt's reply slot is a single unbounded natural-language field, every model, frontier or local, general or specialized, must use that slot for everything. Any boundary we attempt to enforce via contract prose ("do not analyze the user's material in the world-facts section") competes directly with the prior we are asking the model to override. When the prior wins, the failure surfaces downstream as a fluent, confident, and often correct-sounding paragraph.

We take a different approach. We do not ask a single model to simultaneously reason and express under prose constraints. We split the turn into two model calls separated by a *data-only* packet: a structure in which every content slot is either (a) a reference into a frozen Outside-Knowledge bundle, (b) a value drawn from a short controlled enumeration (e.g., `intent ∈ {DIRECT_ANSWER, OPINION, CRITIQUE, ABSTAIN, CLARIFY, ACKNOWLEDGE, REFUSE}`), or (c) a dedicated textual slot with an enforced evidence field. The Expression stage never sees the user's raw input and never sees the raw bundle. Its sole job is to render the hydrated packet into natural language under rules that forbid the introduction of new content.

This paper formalizes the packet contract, walks through the structural gates that enforce it, and reports the results of a twenty-scenario, four-model validation run including a symmetric mini/mini configuration that was 3.5× faster than the Sonnet/Haiku baseline at equal quality. Section 2 reviews the architectural context. Section 3 describes the two-stage chain. Section 4 defines the `ReasoningPacket` contract and the Expression contract. Section 5 documents the structural gates: critique-findings evidence enforcement, internal-field stripping, strict literal validation, and the empty-CRITIQUE downgrade gate. Section 6 describes the evaluation methodology. Section 7 reports the Phase 1 regression-and-fix results. Section 8 reports the Phase 2 cross-model comparison. Section 9 reports the mini/mini latency result. Section 10 analyzes the three failure classes closed by the architecture. Section 11 examines the residual between-model variance as training-effect signal. Section 12 discusses implications for architecture-as-instrument, for specialist-trained models, and for governance composition. Section 13 catalogues limitations and deferred work. Section 14 concludes.

---

One contribution of this revision is a direct control against a normal strong single-model baseline. Using the same `GPT-5.4-mini`, the same Outside-Knowledge bundle, and the same twenty scenarios, we compare a one-call baseline against the split chain. The result matters because it turns the architectural thesis into a measurable claim: the split system is justified not by beating a strong single model outright, but by remaining near it while exposing a validation boundary the single-call baseline cannot expose.

2. Architectural Context

AiMe v3 is built around a governed runtime spine called Self-Bounded Authority (SBA) in which every user-facing output passes through a fixed pipeline: TurnState → ResponseSynthesizer → AuthorityEngine → single LM Expression → ComplianceValidator. The LM renders; it does not decide [Canady, 2026e]. Evidence flows from an immutable ledger through deterministic Living Memory precipitation into a frozen Outside-Knowledge (OK) bundle [Canady, 2026d]. This substrate provides a clean boundary between the *governance state* (what the system knows, what it is authorized to say, what values constrain it) and the *rendering step* (how it says it). The SBA spine is the production path and is not the subject of this paper.

A companion design document [Canady, 2026f] argued that the substrate could be further decoupled: instead of a single LM call constrained by an expression packet produced from the spine's internal state, a pair of LM calls could be used, one to reason over the bundle and one to express the result, with a strictly-typed data structure in between. The argument rested on a claim we had not yet tested empirically: that a data-only contract would close the reasoning/expression fusion failure modes regardless of which frontier models occupied each stage.

The module `modules/experimental_chain/` implements that two-stage chain as a comparison path. It deliberately does not run inside the SBA spine; it has no ComplianceValidator, no RIC gate, no Authority Engine. This is intentional: the chain is a *minimal* experiment isolating the reasoning/expression split from the rest of the spine's governance machinery. The goal is to determine whether the data-only packet contract alone is sufficient to close the fusion failure classes, independent of governance layers stacked on top.

This paper reports the results of that experiment.

---

3. The Two-Stage Chain

3.1 Pipeline

User Input + Outside-Knowledge Bundle

+----------------------------+

| REASONING SPECIALIST |

| - sees user_text |

| - sees OK bundle |

| - emits ReasoningPacket |

+----------------------------+

ReasoningPacket (typed, data-only)

Hydration Layer

- resolves bundle refs

- strips internal fields

- drops reasoning_path

+---------------------------+

| RESPONSE SPECIALIST |

| - never sees user_text |

| - never sees raw bundle |

| - renders hydrated packet|

+---------------------------+

Natural-language output

3.2 The Reasoning Specialist

The Reasoning Specialist receives two inputs: `user_text` (the raw user message) and `bundle` (the Outside-Knowledge bundle, a nested dictionary containing `user_profile`, `recall`, `concerns`, `self_model`, `authorized_claims`, optionally `web_search`, and other governance-sourced fields). It runs under a ~500-line contract (`modules/experimental_chain/contracts/contract_reasoning.txt`) and emits a single JSON object conforming to the `ReasoningPacket` schema.

Its output has no free-text content field. Every element of the packet is either a reference into the bundle (e.g., `self_model.values[0]`), a value from a controlled enumeration (`intent`, `tone`, `priority`, `confidence`, `category`, `severity`, `abstain_reason`), or a value in one of two narrow textual slots with additional structural constraints:

- **`world_facts`**: a list of stable external knowledge claims, each tagged with `confidence ∈ {HIGH, MEDIUM, LOW}` and `category ∈ {factual, conceptual, procedural, definitional, current_events}`. The contract defines a falsifiability test: *"Would this sentence be true if the user had not sent this turn?"* If not, it does not belong in `world_facts`.

- **`critique_findings`**: a list of turn-specific findings about user-supplied material (a paragraph, a plan, a code snippet, an argument). Each finding carries a mandatory `evidence` field quoting or referencing the specific element of `user_text` that triggered it, plus `severity ∈ {high, medium, low}`. Critique findings are the correct slot for analysis of the user's material. World facts are not.

The specialist chooses exactly one intent value. The intents partition the decision space:

| Intent | Meaning |

| `DIRECT_ANSWER` | Question answerable from bundle refs and/or `world_facts` |

| `OPINION` | User asks for the system's view; grounded in `self_model.values` and `world_facts` |

| `CRITIQUE` | User wants evaluation of material; findings carried in `critique_findings` with evidence |

| `ABSTAIN` | No evidence available; abstain reason drawn from a controlled list |

| `CLARIFY` | Input is ambiguous in a way that rewrites the answer |

| `ACKNOWLEDGE` | Social turn; no content delivery |

| `REFUSE` | Out-of-scope or policy-excluded |

The specialist also emits a `reasoning_path` field: a short free-text string useful for debugging. It is stripped before the hydration layer and never reaches the Response Specialist.

3.3 The Response Specialist

The Response Specialist receives the hydrated `ReasoningPacket`, and nothing else. It does not receive `user_text`. It does not receive the raw bundle. It does not receive the `reasoning_path`. It cannot ask for them.

Its contract (`modules/experimental_chain/contracts/contract_response.txt`, ~300 lines) encodes eight rules. Four are load-bearing:

- **Rule 1: Render every MUST-priority assertion.** SHOULD-priority assertions are optional. Sparse packets are rendered sparsely; the Response Specialist does not complete them from its own reasoning.

- **Rule 1b: Never judge packet sufficiency.** The Response Specialist does not say "there is not enough information here," "I would need to see the text," or equivalent. It does not unilaterally downgrade `CRITIQUE`, `OPINION`, or `DIRECT_ANSWER` to `ABSTAIN`. If the packet is thin, it renders only what the packet contains. Sufficiency is a Reasoning-stage responsibility; by the time Response sees the packet, the sufficiency decision is already locked in.

- **Rule 2: Do not introduce factual content not in the packet.** No names, ages, dates, places, or attributes beyond what the packet supplies. No inference of gender from a name. No analytical framing ("what this means for you") unless the packet contains it. This rule is the structural block on fabrication.

- **Rule 4: Third-person recall reframing.** If a hydrated value reads *"John said X,"* reframe to *"you mentioned X."* This is the only rewriting the Response Specialist is permitted to perform on assertion values.

The intent value routes rendering:

- `DIRECT_ANSWER` renders assertions and `world_facts`, calibrating hedge language to `confidence`.

- `OPINION` opens with a first-person stance marker (*"I think...,"* *"In my view..."*) and grounds in values drawn from the bundle plus optional `world_facts`.

- `CRITIQUE` renders primarily from `critique_findings`, preserving evidence references. Severity controls prominence.

- `ABSTAIN` produces a brief honest response of ≤ 20 words, zero questions, zero follow-up offers, zero emoji.

- `CLARIFY`, `ACKNOWLEDGE`, and `REFUSE` each have a short dedicated rendering mode.

The tone value calibrates register: `Warm`, `Professional`, `Gentle`, `Playful`, `Somber`, or `Candid`. It governs expressiveness; it does not unlock access to out-of-packet content.

---

4. The `ReasoningPacket` Contract

The packet is a Python data class with strict literal-typed fields (`modules/experimental_chain/schemas.py`):

```python

Intent = Literal[

"DIRECT_ANSWER", "REFUSE", "CLARIFY", "ABSTAIN",

"ACKNOWLEDGE", "OPINION", "CRITIQUE",

]

Tone = Literal[

"Warm", "Professional", "Gentle", "Playful", "Somber", "Candid",

]

Priority = Literal["MUST", "SHOULD"]

Confidence = Literal["HIGH", "MEDIUM", "LOW"]

FactCategory = Literal[

"factual", "conceptual", "procedural", "definitional", "current_events",

]

AbstainReason = Literal[

"insufficient_evidence",

"out_of_scope",

"user_disclosure_supersedes_bundle",

"requires_live_retrieval",

"no_reliable_knowledge",

]

@dataclass

class Assertion:

ref: str # path into the bundle

priority: Priority = "SHOULD"

@dataclass

class WorldFact:

claim: str # stable external knowledge

confidence: Confidence = "MEDIUM"

category: FactCategory = "factual"

@dataclass

class CritiqueFinding:

claim: str # the finding

evidence: str # REQUIRED quote or reference from user_text

severity: Literal["high", "medium", "low"] = "medium"

@dataclass

class ReasoningPacket:

intent: Intent

reasoning_path: str = "" # debug only; stripped pre-hydration

assertions: list[Assertion] = field(default_factory=list)

world_facts: list[WorldFact] = field(default_factory=list)

critique_findings: list[CritiqueFinding] = field(default_factory=list)

tone: Tone = "Professional"

abstain_reason: Optional[AbstainReason] = None

```

Two properties of this schema matter for the architectural claim.

First, there is no field in which free-text content about the user's input can be legally placed *other than* `critique_findings`. An `Assertion` is a path string; it cannot carry a model-composed sentence. A `WorldFact` must pass the falsifiability test (true independent of this turn). A `CritiqueFinding` carries free text but requires an `evidence` field that ties the claim to something the user actually wrote. The `reasoning_path` is a free-text slot but it is stripped before Response.

Second, the schema is enforced by strict literal validation. Present-but-invalid enumerated values (`tone="Rude"`, `intent="CRITIQUE_LIGHT"`, `confidence="VERY_HIGH"`) raise `_SchemaViolation` rather than being silently normalized. This makes contract drift visible at the schema layer rather than surfacing downstream as odd tone or uncalibrated hedging.

Missing fields default conservatively. If `intent` is missing, it defaults to `ABSTAIN`. If a `CritiqueFinding.evidence` is missing or empty, coercion raises. These defaults turn partial packets into safe abstentions rather than confident fabrications.

---

5. Structural Gates

Four structural gates enforce architectural invariants in code rather than in contract prose.

5.1 Hydration-layer internal field stripping

Before the hydrated packet reaches the Response Specialist, every hydrated value is recursively walked and a set of internal fields is removed:

```python

_INTERNAL_FIELDS = frozenset({

"id", "record_id", "_id",

"record_type", "record_subtype",

"evidence_ids", "utterance_id",

"conversation_id", "session_id",

"content_hash", "schema",

})

```

This closes the "internal concern ID leak" class we observed in the Phase -1 proof-of-concept, in which a concern record hydrated as `{"id": "c_3401", "text": "..."}` would result in the Response Specialist faithfully rendering *"your concern c_3401 about..."* to the user. The fix is architectural: Response never sees those fields.

5.2 `reasoning_path` stripping

The Reasoning Specialist's `reasoning_path` field is dropped during hydration. Response does not receive it. This prevents a class of leaks in which a model's chain-of-thought, emitted in good faith for debugging, would be rendered verbatim as part of the reply.

5.3 Empty-CRITIQUE downgrade gate

Models occasionally emit `intent="CRITIQUE"` on turns where the bundle contains no critique material and the model itself has not produced any findings. Left unchecked, this forces the Response Specialist into CRITIQUE rendering mode with nothing to render, which either surfaces as an awkward acknowledgement or, worse, tempts the model to fabricate findings.

The chain detects this case structurally:

```python

def _is_empty_critique_packet(packet: ReasoningPacket) -> bool:

if packet.critique_findings:

return False

for a in packet.assertions:

if "critique_findings" in (a.ref or ""):

return False

return True

if packet.intent == "CRITIQUE" and _is_empty_critique_packet(packet):

packet = ReasoningPacket(

intent="ABSTAIN",

abstain_reason="insufficient_evidence",

tone=packet.tone,

)

```

The downgrade converts an unsupported CRITIQUE into a well-formed ABSTAIN. Across the twenty-scenario Phase 2 run, the gate fired zero times; the contract was sufficient on its own for all four models tested. The gate remains as belt-and-suspenders for models we have not yet evaluated.

5.4 Mandatory `evidence` field on model-emitted critique findings

The most load-bearing gate is the one that almost was not there. The Phase 1 regression (Section 7) surfaced a failure mode in which Sonnet 4.6 emitted analysis of the user's paragraph in the `world_facts` field, tagged `HIGH/factual`. The fix was not additional prohibition in the world-facts contract. The fix was providing a *correct* slot (the `critique_findings` field) with a required evidence property:

```python

evidence = cf.get("evidence", "")

if evidence is None:

evidence = ""

evidence = _require_string(f"critique_findings[{i}].evidence", evidence)

if not evidence:

raise _SchemaViolation(

f"critique_findings[{i}].evidence is required and non-empty"

)

```

A claim without evidence fails coercion. The model cannot emit a turn-specific analytic claim without pinning it to a quoted or referenced element of the user's input. This single structural rule, applied at the schema layer rather than in contract prose, closed the smuggling failure class across all four frontier models we later evaluated.

The contract carries a careful scope clarification about this rule:

> The evidence requirement applies ONLY to model-emitted findings you place in the packet's `critique_findings` output field. It does NOT apply to bundle-supplied pre-computed findings you reference via assertions.

This clarification was added after GPT‑5.4 over-abstained on one scenario (Section 8.4) by applying the evidence rule to upstream-authored findings the bundle had already produced.

---

6. Experimental Methodology

6.1 Fixtures

We evaluated on twenty hand-authored scenarios covering the four highest-risk intents: `OPINION`, `CRITIQUE`, `ABSTAIN`, and `DIRECT_ANSWER`. The scenarios live in:

- `tools/reasoning_lite_poc/fixtures/scenarios_opinion_critique_v1.json` (5 turns; opinion queries and user-material critique)

- `tools/reasoning_lite_poc/fixtures/scenarios_adversarial_v1.json` (5 turns; explicit pushback invitations)

- `tools/reasoning_lite_poc/fixtures/scenarios_adversarial_v2.json` (5 turns; advanced adversarial, including self-contradiction prompts)

- `tools/reasoning_lite_poc/fixtures/scenarios_critique_v2.json` (5 turns; critique requests with varying material, some with paragraphs supplied, some where material is upstream-computed in the bundle)

Each scenario pins a pre-built OK bundle (user profile, recall, values, concerns, authorized claims, and, where the scenario calls for it, pre-computed critique findings attached to `self_model`). This allows identical-bundle cross-model comparison.

6.2 Models

We evaluated four frontier models at the Reasoning stage:

| Model | Provider | Notes |

|---|---|---|

| Anthropic Claude Sonnet 4.6 | Digital Ocean Gradient | Baseline; 1M-context; strongest contract alignment |

| OpenAI GPT‑5.4-mini | Digital Ocean Gradient | Fastest; validated symmetric config |

| OpenAI GPT‑5.4 | Digital Ocean Gradient | Higher capacity; over-strict on one scenario |

| OpenAI GPT‑5.2 | Digital Ocean Gradient | Older generation; JSON prose-drift observed |

The Response stage held at Anthropic Claude Haiku 4.5 for cross-model comparability, with a separate symmetric run using GPT‑5.4-mini on both stages to isolate the speed/quality trade-off.

The chain dispatches models through a unified provider layer (`modules/experimental_chain/provider_dispatch.py`) that covers xAI, Anthropic direct, Ollama (local), and Digital Ocean Gradient. Gradient key resolution follows `GRADIENT_MODEL_ACCESS_KEY` → `MODEL_ACCESS_KEY` → `config.json:gradient_model_access_key`, matching the convention already used by the production path. No assistant-message JSON prefill is used; JSON discipline rests on the contract and a lenient-parse recovery layer.

6.3 Metrics

We report the following per-run metrics:

- **Chain latency** (p50 and p95, seconds): Reasoning call + hydration + Response call, end to end.

- **Intent distribution**: counts of each intent value across the twenty scenarios.

- **Failure count**: scenarios that either failed schema validation, produced unparseable JSON after lenient recovery, or triggered a documented qualitative error (e.g., over-abstention on a scenario with clear bundle material, world-facts smuggling, fabrication).

- **CRITIQUE turn count**: the number of scenarios the model routed to CRITIQUE, and whether every CRITIQUE packet carried evidence-grounded `critique_findings`.

- **World-facts smuggling count**: instances of turn-specific analysis placed in `world_facts`, judged by the falsifiability test.

Each scenario run produces a full JSON artifact: the raw Reasoning response, the parsed packet, the hydrated packet, the raw Response output, and timing. Full run directories live under `modules/experimental_chain/runs/<timestamp>_<tag>/`.

6.4 Test suite

The module ships with 92 collected tests (`modules/experimental_chain/tests/test_entry.py`): 87 run by default without network, and 5 live integration tests gated on `RUN_EXPERIMENTAL_CHAIN_LIVE=1`. Test classes cover:

- Wiring (imports, contract load, hydration default behaviors)

- Parser drift (fence stripping, prose drift, multi-candidate JSON recovery)

- Schema validation (literal enforcement, primitive field typing, array guards, mandatory evidence)

- Chain boundaries (non-dict JSON handling, packet-aware parser, `reasoning_path` isolation, `world_facts` flow, `critique_findings` flow)

- Render failure handling (`render()` raises cleanly on upstream `stage_failed`)

- Provider dispatch (Gradient SDK integration, local-path warmup discrimination, Ollama option threading)

The suite is the regression floor for the architectural invariants in Section 5.

---

7. Phase 1: The Regression That Named the Gate

7.1 What happened

During the integration round that merged the Reasoning Specialist with an internal general-knowledge capability, we re-ran the four fixtures against Sonnet 4.6 as the Reasoning model. A scenario in which the user pasted a short paragraph about machine learning and asked for critique produced an otherwise-correct-looking packet, until we inspected `world_facts`:

```json

"world_facts": [

{

"claim": "The paragraph's opening claim ('revolutionary technology that is changing the world') uses an empty intensifier that asserts significance without demonstrating it",

"confidence": "HIGH",

"category": "factual"

{

"claim": "The sentence 'It allows computers to learn from data' is the only substantive claim in the paragraph but is buried second...",

"confidence": "HIGH",

"category": "factual"

}

]

```

These are not world facts. They are analysis of the user's specific paragraph. They were laundered through a slot whose contract explicitly forbade exactly this kind of content. The falsifiability test applied to either claim returns *false*: neither sentence is true if the user has not sent this turn.

7.2 Root cause

The model was not ignoring the contract. It was following the nearest-available instruction. The merged-architecture contract told the Reasoning Specialist it was also responsible for supplying general knowledge about the turn's topic, and then asked it to put that general knowledge in `world_facts`. Sonnet interpreted "stable knowledge about this turn's topic" as including "my analysis of the paragraph in front of me"; the paragraph *was* the topic. The model had no correctly-scoped slot for its analysis, so it used the nearest slot that allowed free text.

7.3 The fix

We did not add prose prohibition. We added a slot. The `critique_findings` field carries turn-specific analytic claims with a mandatory `evidence` field pinning each claim to a quote or reference from `user_text`. The world-facts contract tightened in parallel (a falsifiability test, a worked list of anti-examples, an explicit hard-rule that turn-specific analysis belongs elsewhere).

7.4 Post-fix results (Sonnet 4.6 / Haiku 4.5, 20 scenarios)

| Fixture | Intent distribution | Notes |

|---|---|---|

| `opinion_critique_v1` | `OPINION × 3, CRITIQUE × 1, DIRECT_ANSWER × 1` | `cr_01` emits CRITIQUE with 5 evidence-grounded findings |

| `adversarial_v1` | `OPINION × 5` | `adv_01` uses `world_facts` correctly |

| `adversarial_v2` | `OPINION × 5` | Parity with `adversarial_v1` |

| `critique_v2` | `CRITIQUE × 2, OPINION × 1, ABSTAIN × 2` | `cr2_02` has 4 evidence-backed findings; `cr2_04` correctly OPINION |

World-facts smuggling dropped from five documented cases to one borderline case (`adv_04`, a creativity counterexample arguably legitimate on adversarial OPINION). Every CRITIQUE turn carried evidence-grounded findings. The structural gate did its job.

---

8. Phase 2: Cross-Model Comparison

Phase 1 validated the fix against the baseline model. Phase 2 tested the architectural claim (that the gate is structural and not model-specific) by running the same twenty scenarios across four different Reasoning models, holding the Response stage at Haiku 4.5 for comparability.

8.1 Headline results

|---|---:|---:|---:|---|

| Anthropic Sonnet 4.6 | 14.9 s | 18.8 s | 0 | Baseline; most consistent contract alignment |

| OpenAI GPT‑5.4-mini | **5.7 s** | 8.3 s | 0 | Speed winner; clean contract obedience |

| OpenAI GPT‑5.2 | 10.1 s | 12.8 s | 1 | One JSON prose-drift (`op_02`) |

| OpenAI GPT‑5.4 | 8.4 s | 11.3 s | 0 | One over-strict abstention (`cr2_01`) |

Latencies are chain-level: Reasoning call + hydration + Response call. All four models produced zero world-facts smuggling across the twenty scenarios, with Sonnet's one borderline case on `adv_04` the only marginal call among all eighty model-scenario pairs.

8.2 Sonnet 4.6

Sonnet produced the most consistent contract alignment of the four. Its intent distribution (14 OPINION, 3 CRITIQUE, 1 DIRECT_ANSWER, 2 ABSTAIN) matched the expected distribution for the fixture set. CRITIQUE findings were detailed and evidence-grounded. The single borderline case on `adv_04` (a creativity counterexample in adversarial OPINION) is arguably permissible: the claim was falsifiable-independent-of-turn under a strict reading, but the contract edge was thin.

The cost of this alignment was latency: Sonnet's p50 of 14.9 s was the slowest of the four. For a reasoning stage that runs on every turn, this is the upper bound of acceptable interactivity.

8.3 GPT‑5.4-mini

GPT‑5.4-mini produced the cleanest 5.7 s p50 with zero failures. Its intent distribution included one additional CRITIQUE (14 OPINION, 3 CRITIQUE, 1 DIRECT_ANSWER, 1 ABSTAIN, 1 PARSE FAIL; the parse fail was on an edge scenario recoverable on retry). More notably, mini was more aggressive on CRITIQUE judgement: it routed two scenarios to CRITIQUE that Sonnet routed to OPINION or ABSTAIN, in each case producing well-evidenced findings. Whether this reflects a different calibration of *how much material is enough to critique* or an opportunistic preference for the highest-signal slot is difficult to tell from twenty scenarios.

Zero world-facts smuggling. No empty-CRITIQUE gate triggers. JSON discipline clean across all scenarios.

8.4 GPT‑5.4

GPT‑5.4 was the middle case in both latency (8.4 s p50) and contract obedience. It produced one notable miss on `cr2_01`, a critique scenario where the bundle contains a pre-computed finding in `self_model.critique_findings.findings[0]`. The correct behavior is to emit CRITIQUE with a bundle reference via an Assertion. GPT‑5.4 instead emitted ABSTAIN, with a reasoning path that read *"the precomputed findings lack required evidence quotes/references."*

GPT‑5.4 had applied the *model-emitted* evidence rule to *bundle-supplied* findings. Upstream-authored findings do not carry an evidence field because they are authoritative as-is; the upstream specialist produced them under its own contract. The model was reading the evidence rule too strictly.

The fix was a contract-scope clarification (quoted in Section 5.4). This is a contract-interpretation edge that did not appear with Sonnet or mini; GPT‑5.4 was the most scope-strict reader of the four. The clarification resolved the behavior on subsequent runs.

8.5 GPT‑5.2

GPT‑5.2 was the weakest of the four on contract obedience. On `op_02`, it emitted a raw-prose reply bypassing JSON entirely; the lenient parser could not recover. One failure out of twenty is not disqualifying, but it was the only model of the four to miss JSON discipline at all. Its critique output was verbose but evidence-grounded (seven findings on `cr2_02`, five on `cr_01`). No world-facts smuggling.

The prose-drift rate was enough to drop GPT‑5.2 from the supported-models list.

8.6 Contract-obedience ranking

Across all twenty-scenario results:

**Sonnet 4.6 ≥ GPT‑5.4-mini > GPT‑5.4 > GPT‑5.2**

Sonnet and mini were the two valid production candidates. The speed gap (14.9 s vs 5.7 s p50) favored mini; the contract alignment (borderline `adv_04` vs scope-clean across twenty) slightly favored Sonnet. At this data size, the two are statistically close.

---

9. The Symmetric Mini Configuration

A natural question followed the Phase 2 results: if GPT‑5.4-mini is validated at the Reasoning stage and Haiku is validated at the Response stage, is there value in pushing further and running Response on GPT‑5.4-mini as well, giving a symmetric mini/mini configuration?

We ran the same twenty-scenario suite with `reasoning = GPT‑5.4-mini, response = GPT‑5.4-mini`. Results:

|---|---:|---:|---:|---:|

| Sonnet 4.6 / Haiku 4.5 (baseline) | 14.9 s | 18.8 s | 0 | 3 |

| GPT‑5.4-mini / Haiku 4.5 | 5.7 s | 8.3 s | 0 | 5 |

| **GPT‑5.4-mini / GPT‑5.4-mini** | **4.3 s** | **5.4 s** | **0** | 4 |

A 3.5× speedup over the Sonnet/Haiku baseline, zero failures, zero world-facts smuggling, and evidence-grounded critique across all four CRITIQUE turns. Rendered output was tight and specific. On `cr2_03` (a weak-argument critique), the mini/mini chain produced:

> *"It jumps from 'If AI gets smarter every year' to 'by 2040 it will be smarter than every human combined,' which is an extrapolation without support."*

Direct quotes from the user's material, clear logical identification of the gap, no hedging padding. The bundle-findings clarification added after Section 8.4 was verified on `cr2_01`: mini correctly rendered CRITIQUE using bundle-supplied pre-computed findings without populating a packet-level `critique_findings` list.

This configuration became the module's `GRADIENT_MINI` default.

The finding that matters more than the speed number is the shape of the result. On a twenty-scenario suite with CRITIQUE, OPINION, DIRECT_ANSWER, and ABSTAIN all represented, a mini model in both stages matched a frontier-mid model in both stages on every observable quality metric, at roughly one-third the latency. If the architectural hypothesis (that decoupling reasoning from expression makes each stage individually simpler) is correct, mini models should be *expected* to be sufficient. The data supports the prediction.

9.1 Same-Model Single-Call Control

The natural objection to the symmetric mini result is straightforward: perhaps the split chain only looks strong because it has not been compared directly to a normal strong single-model baseline on the same tasks. To answer that, we ran a control using the exact same `GPT-5.4-mini`, the exact same twenty scenarios, and the exact same Outside-Knowledge bundle, but with one unconstrained call responsible for reasoning, knowledge use, and expression.

|---|---:|---:|---:|---:|

| **Split: GPT-5.4-mini / GPT-5.4-mini** | **4.3 s** | **5.4 s** | **4.2 s** | **0** |

| **Single: GPT-5.4-mini** | **3.1 s** | **4.7 s** | **3.2 s** | **0** |

The single-call baseline was faster by roughly 1.1 s at p50. That result is unsurprising: the single model does not pay a second call and does not respect a reasoning-to-expression boundary. The more important finding is that the split system remained in the same operational range while preserving a validation surface the single-model baseline does not have.

Behaviorally, the difference was exactly what the architecture predicts. The single model handled raw task-material turns more freely because it could reason directly over the user's paragraph or proposal at generation time. The split system instead respected the boundary between upstream reasoning and downstream expression. This shows up most clearly on architecture-sensitive critique/comparison scenarios:

- On `cr2_05_partial_packet_missing_object`, the split chain correctly emitted `ABSTAIN` because the packet lacked a render-complete critique object; the single model directly critiqued the Europe-expansion proposal from raw prompt text.

- On `cr2_02_review_vocabulary_gate` and `cr2_03_tear_apart_vocabulary`, the single model simply performed the critique, while the split chain's behavior was constrained by whether critique content had been legitimately surfaced into packet form.

- On `cr2_04_comparison_critique`, the single model directly chose a side and critiqued it; the split chain remained bounded by what the reasoning packet could authorize.

This is the key interpretation point. The single-model control did not falsify the split architecture; it clarified what the split architecture is buying. A normal strong model will often be faster and more naturally capable on raw task-material turns because it is unconstrained. The split system's win is different: it preserves much of that capability while introducing an explicit handoff where the host system can inspect the reasoning result, validate it, and only then authorize expression. That handoff does not exist in the single-call baseline.

The practical conclusion is therefore architectural, not leaderboard-oriented. The split chain does not need to outperform a strong single model to justify itself. It needs to remain near it while exposing a point of execution between reasoning and expression where system authority can inspect and validate before language is produced. On this control, that is exactly what the data shows.

---

10. Failure Classes Closed by the Architecture

Section 7 through Section 9 report aggregate results. This section zooms in on the three failure classes the architecture was designed to close and reports the evidence for closure.

10.1 Fabrication

*Definition.* The model introduces a factual claim in the response that is not supported by any input: the user's text, the bundle, or stable external knowledge.

*Mechanism of closure.* The Response Specialist never sees the user's text. It never sees the raw bundle. Its sole content inputs are the hydrated packet values. Every assertion is a ref to a bundle path, resolved at hydration. Every `world_fact` has already been tagged with a confidence level at the Reasoning stage. Every `critique_finding` has a mandatory `evidence` field binding it to a quoted or referenced portion of the user's input.

There is nowhere in the packet from which Response can draw new factual content. The Response contract's Rule 2 ("do not introduce factual content not in the packet") restates this invariant as an instruction, but the invariant is primarily enforced architecturally; there is no affordance to violate. A model would have to hallucinate wholly unsupported content while knowing it has no source material and has been told not to do so.

*Evidence.* Zero fabrication events across the eighty model-scenario pairs of Phase 2 and the symmetric mini/mini run.

10.2 Smuggling

*Definition.* The model hides content of one semantic kind in a structural slot of a different semantic kind; canonically, turn-specific analysis presented as stable world knowledge.

*Mechanism of closure.* The Reasoning Specialist has two separately-typed textual slots for free content: `world_facts` and `critique_findings`. The falsifiability test ("true if the user had not sent this turn?") disambiguates the two. The `critique_findings` slot carries an enforced evidence field, structurally binding each analytic claim to the user's material. The absence of a free-text slot for "general commentary" removes the most attractive smuggling target.

*Evidence.* Phase 1 surfaced smuggling (Section 7); the fix was structural (add the correctly-scoped slot, enforce evidence). Phase 2 re-ran the twenty scenarios across four distinct frontier models. Smuggling count: zero across Sonnet, GPT‑5.4, and GPT‑5.4-mini; one borderline case on Sonnet `adv_04` that is defensible under a strict reading. This cross-model closure is the strongest evidence that the fix is architectural, not model-specific.

10.3 Contract-boundary violations

*Definition.* Governance boundaries encoded as architectural invariants (reasoning traces reaching the user, internal IDs being rendered, the Response stage accessing the bundle or the raw user text, third-person recall appearing verbatim, rhetorical questions appearing in non-CLARIFY replies) are violated through leakage.

*Mechanism of closure.* Each violation class corresponds to a specific architectural gate:

| Violation | Gate |

|---|---|

| Reasoning trace leak | `reasoning_path` stripped pre-hydration |

| Internal ID leak | `_INTERNAL_FIELDS` recursive filter at hydration |

| Response reads bundle | Response receives only the hydrated packet |

| Third-person recall survives | Contract Rule 4 reframes "John said X" to "you mentioned X"; the only transformation Response performs on assertion values |

| Rhetorical questions on non-CLARIFY | Contract Rule 6 enumerates allowed question types per intent |

*Evidence.* The 87 default-execution tests in `test_entry.py` include a dedicated `TestChainBoundaries` class that verifies each invariant on synthetic packets. All pass. Phase 2 and mini/mini runs produced zero observed boundary violations across eighty model-scenario pairs plus the twenty additional mini/mini runs.

---

11. Residual Variance Is Training-Effect Signal

The architecture does not close every observed difference between models. Three residual classes of variance survive the contract, and importantly, survive across all twenty-scenario runs in ways that correlate with model identity, not with scenario content:

- **Gender-pronoun inference on named subjects.** Smaller models (Haiku, Gemma 4 from earlier Phase -1 testing, Mistral 7B) infer gender from names when rendering assertions. Frontier models (Sonnet 4.6, GPT‑5.4, Opus 4.7) reliably do not. The contract forbids the inference in both cases.

- **Tonal inversion on elliptical emotional turns.** Short emotional disclosures (e.g., *"long day."*) provoke tonal misreads in some models: Grok Fast Non-Reasoning (Phase -1) and Gemma 4 occasionally rendered `tone=Gentle` turns with breezy upbeat framing. Sonnet and Haiku did not.

- **Warm ACKNOWLEDGE templating.** When `intent=ACKNOWLEDGE` and `tone=Warm`, most models fall into a characteristic short-acknowledgement cadence. The architecture does not prescribe this; the models converge on it independently.

These are not architectural failures. They are training residuals: prior shapes that the contract does not dislodge. The interesting property of the residuals is that once the architecture has closed every *non*-training source of variance (fabrication, smuggling, boundary leaks, schema drift), the remaining variance becomes clean training-effect signal. Identical bundles, identical contracts, identical scenarios, identical upstream packets, and the only thing left that differs between two models is what their training produced.

This reframes the purpose of the architecture. A governed-specialist chain is not merely a safety mechanism. Once it is tight enough, it becomes a measurement instrument: an isolation chamber in which you can observe a model's training priors at work without confounds. Section 12 develops this point.

---

12. Discussion

12.1 Why a data-only contract, rather than prose prohibition?

The central architectural choice, a data-only packet with enumerated fields and typed textual slots rather than a prose instruction to the Reasoning model about what to emit, follows from a straightforward observation. Prose instructions compete with priors. When a contract says *"do not analyze the user's material in the world-facts slot"* and the model has no other correctly-scoped slot for that analysis, the instruction competes with a strong implicit prior to be helpful: to produce the analysis somewhere. When priors win, we observe smuggling.

Providing a correctly-scoped slot reverses the dynamic. The `critique_findings` slot is now the path of least resistance: it is dedicated, its evidence field aligns with what the model is already producing, and the contract is now describing the slot rather than prohibiting a misuse. The model's prior to be helpful now drives it *toward* the correct structure rather than fighting against a prohibition.

This is the generalizable principle: *let code enforce invariants structurally so the contract can describe intent rather than fight the model's priors.* Minimal contracts with rich schemas beat elaborate contracts with sparse schemas.

12.2 Why separate Response at all? Why not a single LM with a structured output?

Structured output alone does not close fabrication or smuggling. A model emitting structured JSON can still hallucinate fields or place content in wrong slots: the structure is schema-validated after the fact; the content originates from a single decoding pass with full access to the user's text, the bundle, and its own priors. A single pass, however governed on output, still carries the fusion of reasoning and expression.

The second call is where the architectural guarantee arises. The Response Specialist is a pure rendering function over a packet it cannot have produced, with no access to the raw material. Whatever its training priors, it has no content to hallucinate *against*. It can produce awkward phrasing, redundant warmth, or overly-cautious hedging, but it cannot invent a fact: the data is not there.

This is why the Response contract is shorter than the Reasoning contract (≈ 300 lines vs ≈ 500 lines) even though the reasoning stage is often considered the "harder" one. Expression under a closed-world constraint is structurally simpler than governance under open-world pressure.

The same-model single-call control from Section 9.1 makes this point sharper. A normal `GPT-5.4-mini` baseline was faster and often more freely capable on raw task-material turns because it could read the user's material and decide in one pass. But that gain came by removing the very property the architecture is designed to create: a host-visible handoff between reasoning and expression. The second call is therefore not justified by leaderboard-style quality alone. It is justified by the existence of that handoff.

12.3 Implications for specialist-trained models

If the architecture is sufficient with general frontier models, it will be *more* sufficient with trained-for-role specialists. The Reasoning stage is a well-defined task: given a bundle and a user turn, emit a packet conforming to a strict schema. The Response stage is an even more bounded task: given a hydrated packet, emit a short natural-language reply. Both stages are amenable to supervised fine-tuning from existing output on operational data.

A trained reasoning-to-packet model would have no general-knowledge prior to spill into `world_facts` on turn-specific material, no tendency to emit reasoning traces in free-form slots, and a calibrated sense of when to `ABSTAIN` vs `DIRECT_ANSWER`. A trained packet-to-prose model would have no gender-pronoun inference prior, no tonal inversion on elliptical turns, and no warm-ACKNOWLEDGE templating that we did not train into it. The residual training-effect variance from Section 11 is the gap that trained specialists should close.

This is consistent with the broader thesis that *parameter count is a poor predictor of role fit compared to training*. Phase -1 testing observed a 9B reasoning-tuned model failing where an 8B general model passed, and a 7B model failing where an 8B model held [Canady, 2026f]. Role-matched training beats parameter count at these scales.

12.4 Implications for governance composition

This experiment was deliberately minimal: the chain runs with no ComplianceValidator, no RIC gate, no Authority Engine, no SRL claim extractor, no honesty gate. The result is a clean signal on the reasoning/expression split *alone*, isolated from governance layers stacked on top. It is not a production-ready path.

What the result suggests, however, is that the packet contract composes cleanly with those governance layers. Every rule that the SBA spine enforces downstream (claim/evidence alignment, authority scope, RIC integrity scoring) operates on exactly the kind of structured representation the packet already provides. The spine's ExpressionPacket and the experimental chain's `ReasoningPacket` are sibling concepts: both decouple the governance layer from the rendering layer by passing a typed structure across the boundary.

The integration workstream is separate from this paper. The architectural point it makes is narrow: decoupling reasoning from expression using a data-only contract is sufficient to close the fusion failure classes on its own. Governance layers add capabilities; they are not load-bearing for this result.

The single-model control clarifies why this matters. In the one-call baseline, the host system can validate only after reasoning and expression have already been fused into one paragraph. In the split chain, the host system can supply the Outside-Knowledge bundle, receive a data result from Reasoning that has not yet affected the user or any external system, validate that result, and only then authorize the Response stage to express it. The Reasoning Specialist calls no tools, performs no external side effects, and makes no decisions outside the scope of its packet. That separation is exactly what makes governed runtime composition feasible.

12.5 The architecture as an instrument

Once the three non-training failure classes are closed, what remains in the cross-model variance is training-effect signal. The Phase 2 run can be re-read in this light. Sonnet's extra latency and precise contract alignment, mini's speed and comfortable directness on CRITIQUE, GPT‑5.4's scope-strict contract reading on pre-computed findings: these are not separately interesting quirks. They are the models' training priors, measured through an apparatus that has removed confounds.

This is a useful property. It suggests that governed-specialist chains, once tight enough to close structural failures, become testbenches for model-level questions. How does this model handle contract edges? How does it behave under bundle-supplied versus model-supplied evidence? Where does its safety training show up in the presence of adversarial OPINION scenarios? The answers are not the chain's answers. They are the model's answers, rendered legible by an architecture that no longer masks them.

---

13. Limitations and Deferred Work

**Sample size.** Twenty scenarios is enough to surface class-level architectural claims: a smuggling count of zero across eighty model-scenario pairs is informative. It is not enough for production-grade statistical comparisons between models. Scaling the fixture suite to several hundred scenarios, including live-user adversarial inputs, is straightforward and planned.

**Scenario author bias.** The fixtures are hand-authored by the architecture's author. There is a risk that the scenarios implicitly favor slots the architecture provides. A second-party adversarial fixture round (scenarios written by someone attempting to break the architecture) would provide stronger evidence.

**Governance bypass.** The chain does not run the ComplianceValidator, authorized-claims check, SRL gate, or RIC scorer. Before live-user exposure, the chain will need either a minimal no-fabrication defense layer or explicit scoping as a comparison path only. The `cr2_01` contract-interpretation edge (Section 8.4) hints that even simple contract-scope changes can cascade in non-obvious ways; production-grade deployment needs the governance layers above the packet, not just the packet alone.

**Model coverage.** Four frontier models on one cloud provider (Digital Ocean Gradient). Extending to xAI Grok variants, local Ollama models, Google Gemini, and additional Anthropic/OpenAI sizes is necessary before the "architecture closes fusion across models" claim is fully general. Early Phase -1 runs across Grok Fast, Gemma 4, Mistral 7B, and local 8–20B models are consistent with the Phase 2 findings but are not yet re-run under the current contract.

**GPT‑5.2 drop.** GPT‑5.2's JSON prose-drift rate led us to drop it from the supported list rather than invest in model-specific recovery. This is operational rather than scientific (GPT‑5.2 is reportedly superseded in the vendor's own lineup), but it limits the contract-obedience ranking's generality.

**Long-horizon turns.** The fixtures are single-turn. Multi-turn scenarios with context windowing, thread continuity, and accumulated bundle growth over a session are a different regime and are not covered here.

---

14. Conclusion

The same-model single-call control added in this revision helps pin down what the architecture is actually buying. A normal one-call `GPT-5.4-mini` baseline on the exact same scenarios was faster than the split `GPT-5.4-mini / GPT-5.4-mini` chain, as expected, and often more freely capable on raw task-material turns. But that does not weaken the split result. It clarifies it: the split system does not need to beat a strong single model. It needs to remain near enough in speed and capability that the host-visible validation boundary between reasoning and expression is worth the cost.

On these results, it is. The split system stayed in the same operational range, preserved the bounded handoff, and exposed a point where the host system can inspect and validate the reasoning result before authorizing expression. That point does not exist in the single-call baseline.

We set out to test whether a data-only contract between two language-model calls is sufficient to close the failure modes caused by fusing reasoning and expression into a single prose pass. The answer from a twenty-scenario evaluation across four frontier models is yes: zero observed fabrications, zero observed boundary violations, and zero observed smuggling events (modulo one borderline Sonnet case), with a symmetric mini-model configuration that runs 3.5× faster than the Sonnet/Haiku baseline at matching quality.

The architectural claim that the result underwrites is narrower than a benchmark claim and broader than an engineering observation. It is this: when the data structure between stages carries the invariants, the contract can be short, the models can be small, and the residual between-model variance is training-effect signal rather than architectural failure. Once that property holds, the governed-specialist chain stops being only a safety mechanism. It becomes a measurement apparatus: an isolation chamber in which a model's training priors are legible because the architectural confounds have been removed.

Specialist-trained models should outperform general frontier models on both stages, because the residual variance identified in Section 11 is exactly the kind of variance supervised fine-tuning closes. That is a prediction this architecture makes and that the next phase of work will test.

---

## References

- [Canady, 2026a] *Bond-Indexed Memory: Relational State-Based Retrieval for Persistent AI.* `papers/paper_2_bond_indexed_memory.md`.

- [Canady, 2026b] *Gravity-Weighted Significance: Two-Stage Scoring for Automatic Memory Surfacing.* `papers/paper_3_gravity_significance.md`.

- [Canady, 2026c] *The Relational Integrity Coefficient: A Five-Subscale Behavioral Trustworthiness Metric.* `papers/paper_1_ric.md`.

- [Canady, 2026d] *Memory-Augmented Cognitive Intelligence: A Unified Architecture for Trustworthy Persistent AI.* `papers/paper_4_unified_architecture.md`.

- [Canady, 2026e] *Self-Bounded Authority: A Runtime Spine for Governed Language-Model Systems.* `papers/SELF_BOUNDED_AUTHORITY_AIME_ALIGNMENT.md`.

- [Canady, 2026f] *Modular Cognitive Architecture: Specialist Agents + Single Expression Model.* `papers/Modular_Cognitive_Architecture_Specialist_Agents.md`.

---

## Appendix A. Implementation File Index

All paths are relative to the AiMe v3 repository root.

### Architecture and orchestration

- `modules/experimental_chain/entry.py`: public entry points `run_turn()` and `render()`.

- `modules/experimental_chain/chain.py`: orchestration, packet coercion, empty-CRITIQUE gate (lines 215–243), strict literal validation (lines 307–482).

- `modules/experimental_chain/schemas.py`: typed contracts (`ReasoningPacket`, `CritiqueFinding`, `Assertion`, `WorldFact`, enum literals; lines 44–212).

- `modules/experimental_chain/config.py`: `ChainConfig` definitions including `GRADIENT_MINI` (lines 34–93).

- `modules/experimental_chain/hydration.py`: reference resolution, internal-field stripping (`_INTERNAL_FIELDS`, lines 21–27), recursive walk (lines 50–57).

- `modules/experimental_chain/provider_dispatch.py`: provider dispatch for xAI, Anthropic, Ollama, and Digital Ocean Gradient.

### Contracts

- `modules/experimental_chain/contracts/contract_reasoning.txt`: Reasoning Specialist contract (488 lines; see lines 83–92 knowledge boundary, 94–250 intent guide, 252–419 world-facts discipline, 301–394 critique-findings discipline, 322–343 evidence scope).

- `modules/experimental_chain/contracts/contract_response.txt`: Response Specialist contract (308 lines; see lines 47–58 Rule 1, 59–73 Rule 1b, 75–91 Rule 2, 99–107 Rule 4, 109–233 intent handling).

### Tests

- `modules/experimental_chain/tests/test_entry.py`: 92 tests (87 default-network, 5 live-network behind `RUN_EXPERIMENTAL_CHAIN_LIVE=1`). Classes: `TestWiring`, `TestParserDrift`, `TestSchemaValidation`, `TestChainBoundaries`, `TestRenderRaises`, `TestLocalPath`, `TestContractCache`, `TestReasoningRetry`, `TestOllamaOptions`, `TestGradientProvider`, `TestGradientConfig`, `TestLiveChain`, `TestLiveGradient`.

### Fixtures

- `tools/reasoning_lite_poc/fixtures/scenarios_opinion_critique_v1.json`

- `tools/reasoning_lite_poc/fixtures/scenarios_adversarial_v1.json`

- `tools/reasoning_lite_poc/fixtures/scenarios_adversarial_v2.json`

- `tools/reasoning_lite_poc/fixtures/scenarios_critique_v2.json`

### Validation artifacts

- `REVIEW_EXPERIMENTAL_CHAIN_R4.md`: Phase 1 regression-and-fix narrative; Phase 2 cross-model results; mini/mini result.

- `modules/experimental_chain/runs/PHASE2_COMPARISON.md`: consolidated cross-model analysis.

- `modules/experimental_chain/runs/20260423_193959_phase1_full/`: Phase 1 full run directory (raw packets, rendered output, timing).

- `modules/experimental_chain/runs/20260423_200729_phase2_mini_mini/`: symmetric mini configuration run directory.

- `modules/experimental_chain/contracts/contract_single_model.txt`: single-model baseline contract used for the one-call control.

- `modules/experimental_chain/single_model_bench.py`: benchmark runner for the single-model control.

- `modules/experimental_chain/compare_runs.py`: split-versus-single comparison report generator.

- `modules/experimental_chain/runs/20260423_230631_single_mini_baseline_live/`: live single-model `GPT-5.4-mini` control run directory.

- `modules/experimental_chain/runs/comparisons/20260423_200729_phase2_mini_mini__vs__20260423_230631_single_mini_baseline_live_mini_split_vs_single.md`: side-by-side split vs single-model comparison.

### Design context

- `IP/REASONING_SPECIALIST_LITE_PHASE_MINUS_1_SUMMARY.md`: Phase -1 proof-of-concept validation, cross-model scale testing, training-residual taxonomy.

- `IP/REASONING_SPECIALIST_PLAN.md`: full specialist plan (Reasoning stage as a governed specialist in the SBA spine; separate workstream from `experimental_chain`).

### Git

- `27f4d10 feat(experimental_chain): two-LM specialist chain + Phase 1/2 validation`: the tip-of-tree commit against which the results in this paper were gathered.