At-Scale Reliability Validation of a Data-Only Packet Chain: 5,000-Run Stability Bench on the Decoupled Reasoning/Expression Architecture

Abstract

We report the results of a 5,000-run stability benchmark (250 unique scenarios × 20 runs each) on the `GRADIENT_MINI` configuration of the two-stage decoupled reasoning/expression chain described in [Canady, 2026f]. The benchmark was conducted on `GPT-5.4-mini / GPT-5.4-mini` (symmetric configuration) via Digital Ocean Gradient. Headline results: 15 observed failures across 5,000 runs (0.30% raw failure rate); true architectural failure rate 0/5,000, as every failure was caught by a pre-Response structural gate before any non-compliant output could reach a user. Intent stability was 88.0% (220/250 scenarios produced identical intent across all 20 runs). Chain latency at scale improved over the 20-scenario Phase 2 result, p50 3,837 ms vs. 4,261 ms, consistent with prompt-cache activation on the 10k-token contract prefix. The benchmark surfaces two results not present in [Canady, 2026f]. First, strict schema enforcement at scale functions as a *schema feedback mechanism*: 9 of 15 failures were the reasoning model independently and repeatedly emitting `category="etymological"` on a scenario whose bundle contained a value named `etymological_curiosity`, signaling that the `FactCategory` schema was missing a distinction the model considered load-bearing. Adding the category resolved all 9 failures on plain single-pass re-run. Second, a retry experiment of feeding the validator's exact error message back to the reasoning stage on a single additional call, recovered 14 of 15 failures, projecting a post-retry failure rate of approximately 0.02%. These results extend and deepen the architectural claims of [Canady, 2026f]: a data-only packet contract with strict literal validation does not merely close structural failure classes at 20-scenario scale; it does so reliably at 5,000-run scale, and its failure mode when it does fire is informative rather than silent.

1. Introduction

[Canady, 2026f] established three claims about the two-stage decoupled reasoning/expression architecture:

1. A data-only packet contract with strict typed fields is sufficient to close the fabrication, smuggling, and contract-boundary-violation failure classes across four frontier models on a 20-scenario evaluation.

2. A symmetric `GPT-5.4-mini / GPT-5.4-mini` configuration achieves 4.3 s p50 chain latency with zero failures, a 3.5× speedup over the `Sonnet 4.6 / Haiku 4.5` baseline.

3. Once architectural failure modes are closed, residual cross-model variance reduces to training-effect signal, making the governed-specialist chain a measurement instrument.

All three claims rest on a 20-scenario evaluation. While the cross-model comparison (four models, twenty scenarios each, eighty model-scenario pairs) provides meaningful coverage, it is not a production-scale reliability number. A rigorous reliability claim requires more runs, more scenarios, and the kind of stochastic variation that only emerges at volume.

This paper reports that volume result. The 5,000-run bench was designed as a direct extension of [Canady, 2026f]: same architecture, same `GRADIENT_MINI` configuration, same provider, larger fixture set. Its purpose is to answer three questions [Canady, 2026f] could not:

- Does the 0-failure result from Phase 2 hold at 5,000 runs?

- How stable are intent classifications across repeated runs of the same scenario?

- What do the failures, when they occur, look like and what do they reveal about the architecture?

The answers bear directly on whether the decoupled chain is deployable in production and on what the failure mode of a strictly-governed specialist chain looks like in practice.

2. Experimental Setup

2.1 Configuration

All runs used the `GRADIENT_MINI` configuration:

- **Reasoning stage:** `gradient:openai-gpt-5.4-mini`

- **Response stage:** `gradient:openai-gpt-5.4-mini`

- **Provider:** Digital Ocean Gradient

- **Contract:** `contract_reasoning.txt` (488 lines), `contract_response.txt` (308 lines)

- **Schema:** `ReasoningPacket` as defined in [Canady, 2026f] Section 4, with strict `Literal[...]` validation

2.2 Fixture Set

250 unique scenarios were used, generated by `Sonnet 4.6` to provide broad intent-taxonomy coverage. Scenarios span the full intent space: `DIRECT_ANSWER`, `OPINION`, `CRITIQUE`, `ABSTAIN`, `CLARIFY`, `ACKNOWLEDGE`, and `REFUSE`. The fixture set extends the 20 scenarios of [Canady, 2026f] with additional coverage of:

- Boundary-thin intent decisions (OPINION vs. ABSTAIN on underspecified material)

- Code and logic critique scenarios

- User-disclosure-superseding-bundle scenarios (new profile facts contradicting stored bundle data)

- Temporal ambiguity scenarios (references to prior sessions with ambiguous recency)

- Sycophancy-trap opinion scenarios (user presents a position and asks for validation)

- Adversarial out-of-scope queries (legal strategy, tax evasion, psychiatric medication changes)

Each scenario pins a pre-built Outside-Knowledge (OK) bundle matching the format used in [Canady, 2026f].

2.3 Run Protocol

Each of the 250 scenarios was run independently 20 times, for 5,000 total reasoning-stage calls. Each run produced: the raw Reasoning response, the parsed `ReasoningPacket`, the hydrated packet, the raw Response output, and per-stage timing. Runs were collected over approximately 5.5 hours (2026-04-24T16:03:57Z → 2026-04-24T21:35:33Z).

Failures were classified at the stage where the structural gate fired:

| Failure class | Definition |

| `reasoning_json_parse` | Reasoning output was empty or unparseable |

| `reasoning_json_shape` | JSON parsed but top-level was not an object |

| `reasoning_schema_violation` | Parsed JSON contained an invalid `Literal[...]` value |

No failure class corresponds to output reaching the Response stage or the user. The chain is fail-closed: a gate firing at any stage terminates the run before downstream stages execute.

3. Headline Results

| Metric | Value |

| Total runs | 5,000 |

| Total failures (raw) | 15 (0.30%) |

| True architectural failure rate | **0 / 5,000** |

| Scenarios with zero failures | 244 / 250 (97.6%) |

| Intent-stable scenarios (identical intent, all 20 runs) | 220 / 250 (88.0%) |

| Tone-stable scenarios (identical tone, all 20 runs) | 179 / 250 (71.6%) |

| Chain latency p50 | 3,837.5 ms |

| Chain latency p95 | 6,312.6 ms |

| Chain latency mean | 3,982.8 ms |

The latency numbers bear emphasis. [Canady, 2026f] reported 4.3 s p50 for the symmetric `GPT-5.4-mini / GPT-5.4-mini` configuration on 20 scenarios. At 5,000 runs, p50 dropped to 3,837 ms, a ~10% improvement. This is consistent with Gradient prompt-cache activation once the ~10k-token contract prefix became a hot cache entry. The architectural efficiency gain is therefore compounding: the chain's narrow-scope design produces shorter generations per call, and at production-scale call volume, the fixed contract prefix stops being paid per call.

4. Failure Analysis

4.1 Failure breakdown

| Failure class | Count | Root cause |

| `reasoning_schema_violation` | 12 | Model emitted an enum value not present in `Literal[...]` schema |

| `reasoning_json_parse` | 2 | Empty response body (transient provider issue) |

| `reasoning_json_shape` | 1 | Top-level JSON was a list, not an object |

| **Total** | **15** | |

Every failure was caught before the Response stage. No non-compliant output reached any downstream component. The true architectural failure rate — defined as a run in which the chain produced output that violated an invariant without being caught — is **0 / 5,000**.

4.2 Failure clustering

Of the 15 failures, 9 cluster on a single scenario: `dir_knowable_world_facts__hi_periodic_table_element_gold`. The scenario asks about gold's chemical symbol and the Latin origin of `Au`. Its OK bundle contains a value named `etymological_curiosity`. The reasoning model, across 9 of 20 runs on this scenario, independently emitted `category="etymological"` when classifying the `world_facts` entry for *"Au comes from the Latin word aurum."* The strict `Literal[...]` validator rejected this value, since `FactCategory` at the time of the run was defined as:

```python

FactCategory = Literal[

"factual", "conceptual", "procedural", "definitional", "current_events",

]

```

Three additional failures follow the same pattern with invented `category="medical"` on health-adjacent scenarios (diet plan, blood pressure supplements, child vaccination decision). Two failures were empty-body responses from the provider, transient infrastructure events, not architectural. One failure was a top-level JSON list rather than an object, caught at the shape gate.

4.3 The correct interpretation of schema violations

The 12 schema-violation failures are not failures of reasoning quality. In each case, the model's underlying judgment was sound: the gold-etymology fact genuinely has a distinct epistemic character from `factual`, `conceptual`, or `definitional`; the health-adjacent facts were plausibly of a `medical` kind. The failures occurred because the model reached for a category the schema had not provided.

This is the architecture doing its job. A prose-only contract would have allowed `category="etymological"` to propagate to the Response stage and from there to the user, where it would have been semantically invisible but structurally invalid. The strict validator intercepted it, terminated the run cleanly, and made the violation legible. The failure was informative rather than silent.

Section 6 returns to the significance of this property.

5. Intent Stability Analysis

5.1 Fully stable scenarios

220 of 250 scenarios (88.0%) produced identical intent across all 20 runs. This is the core stability claim: in nearly nine-tenths of cases, the chain's intent classification is deterministic in practice even under stochastic temperature sampling.

5.2 Stable disagreements with the fixture generator

28 scenarios produced a single non-Sonnet intent stably across all 20 runs, the chain and the fixture generator (Sonnet 4.6) consistently classify the same turn differently. These are not instabilities. They are cases where two frontier models, operating under different training priors and different system contexts, reach different but defensible intent conclusions.

The disagreements cluster into five training-pattern categories:

| Pattern | Count | Description |

| A | 2 | Code/logic debugging — Sonnet routes to `CRITIQUE`, chain routes to `DIRECT_ANSWER` |

| B | 2 | Illegal-activity queries — chain correctly escalates from `ABSTAIN` to `REFUSE` |

| C | 10 | User-disclosure-then-question — chain treats disclosure as context and answers; Sonnet abstains |

| D | 10 | "Should I X?" turns — chain offers stance (`OPINION`); Sonnet wants disambiguation (`CLARIFY`) |

| E | 3 | Temporal-context resumption — chain engages from bundle context; Sonnet wants clarification |

Several of these disagreements favor the chain's interpretation on pragmatic grounds. Pattern B is the clearest: on a turn asking for tax-evasion strategy and a turn asking for advice on exaggerating workers' compensation injury symptoms, Sonnet emitted `ABSTAIN` with `out_of_scope`; the chain emitted `REFUSE`. The chain's escalation is architecturally correct, these turns are policy-excluded, not merely outside the system's knowledge. Pattern C is similarly notable: when a user discloses a new diabetes diagnosis and then asks a diet question, Sonnet abstained on the disclosure-supersedes-bundle logic; the chain correctly treated the disclosure as context and answered the actual question.

Per the analysis in [Canady, 2026f] Section 11, stable cross-model disagreements of this kind are training-prior signal, not architectural failure. Identical bundles, identical contracts, identical scenarios, when the only remaining source of variance is which model is running, the difference is the models' training, not the architecture.

5.3 Cross-run intent variance

30 scenarios produced 2 or more distinct intents across 20 runs. Of these, the dominant pattern is heavy concentration on one intent with a small minority on another — typically 18/20 or 19/20 on the primary intent. This matches the paper's characterization of thin-boundary cases: the empty-CRITIQUE downgrade gate fires on 1–3 of 20 runs when a CRITIQUE packet would be empty of findings, correctly choosing `ABSTAIN` in those runs. The variance is bounded and structurally explained.

One scenario showed meaningful spread: `cri_critique_requests_on_wea_travel_itinerary_notes` (CRITIQUE × 10, OPINION × 6, DIRECT_ANSWER × 4). The user's rough travel itinerary contains little critiqueable material; the chain is legitimately ambivalent across runs about whether to critique, suggest, or inform. This is an honest reflection of a genuinely thin boundary, not instability.

6. Schema Feedback: Failures as Signals

The most methodologically significant result of this benchmark is not the failure rate — it is what the failures communicate.

When a strict literal validator fires 9 times on the same scenario across a 5,000-run bench, consistently on the same invented enum value, the model is not malfunctioning. It is repeatedly reaching for a distinction the schema does not provide. The `world_facts` slot asks the model to categorize knowledge by epistemic kind. Word-origin facts — *"Au comes from the Latin word aurum"* — have a distinct character. They are not `factual` in the sense of a measurement or physical property. They are not `conceptual`. They are not `definitional` in the sense of a technical definition. The model reached for `etymological` because that category genuinely fits, and the schema had not given it.

This is a property unique to architectures with strict structural validation. In a prose-only system, the invented category would flow through silently — the user would never see it, and the schema gap would never surface. In a strictly-validated packet chain, the gap becomes a visible, countable, analyzable signal. The architecture exposes the schema's incompleteness rather than hiding it.

The fix was a local, surgical schema expansion:

```python

FactCategory = Literal[

"factual", "conceptual", "procedural", "definitional",

"current_events", "etymological",

]

```

With corresponding additions to the reasoning and response contracts, the model's preferred category was legitimized rather than suppressed. On plain single-pass re-run under the expanded schema, all 9 gold-scenario failures recovered, with the model naturally emitting `category="etymological"` on the Latin-origin fact — exactly where it belongs.

The general principle this establishes: **in a governed specialist chain with strict literal validation, the failure mode is schema feedback, not silent corruption.** This property is load-bearing for the governance claim. A system that fails visibly and informatively, and whose failures translate directly into schema improvements, is more maintainable and more trustworthy than one that accepts all output silently.

7. Retry Recovery Experiment

The 15 failures were re-run through a single additional reasoning call with the validator's exact error message prepended to the original user turn:

> *"A prior attempt failed with: `<error>`. Please emit a valid ReasoningPacket that fixes that issue."*

No contract change, no schema change, no retry loop inside the chain itself — one extra call per failure with the failure reason visible.

| Outcome | Count |

| Recovered on single retry | 14 / 15 |

| Failed again | 1 / 15 |

The one non-recovery was a transient infrastructure failure (empty body); the retry produced a response but with a different invented category (`practical`), suggesting the model had no clean alternative given the schema gap at that time.

The recovery rate of 14/15 on a single retry with zero contract changes establishes a near-free failure-recovery path orthogonal to schema fixes. Layering retry-with-failure-reason on top of the schema expansion projects a combined failure rate of approximately 0.02% — two runs in ten thousand.

| Projection | Raw failure rate |

| Original 5,000-run bench | 0.30% |

| Post-schema-expansion (single pass) | ≤ 0.12% |

| With retry-with-failure-reason layered on | ~0.02% |

The retry mechanisms' significance extends beyond the numbers. It demonstrates that the chain's fail-close discipline is composable with recovery: the validator's structured error message is itself a governed artifact that can be passed back to the reasoning stage as input, producing a corrected packet without any change to the chain's architecture or contracts. The failure closes the loop rather than terminating it.

8. Discussion

8.1 What this benchmark establishes

**Claim 1: The 0-failure result of [Canady, 2026f] Phase 2 generalizes to 5,000-run scale.** No non-compliant output reached the Response stage or the user across any of the 5,000 runs. The true architectural failure rate is 0/5,000.

**Claim 2: Intent classification is reliable at 88% strict stability.** The 12% of scenarios with cross-run variance concentrate on genuine boundary cases — primarily the CRITIQUE-downgrade gate firing on thin material. The variance is bounded and structurally explained.

**Claim 3: Latency improves at production scale.** Prompt-cache activation on the fixed contract prefix reduces p50 from 4,261 ms (20-scenario Phase 2) to 3,837 ms (5,000-run bench) — a ~10% improvement that compounds with volume.

**Claim 4: Strict schema validation is a feedback mechanism, not just an enforcement mechanism.** Repeated failure on a missing category is a legible signal that the schema is incomplete. The chain does not need to be perfect the first time; it needs to fail visibly, capture the failure's structure, and compose cleanly with schema evolution and retry recovery.

8.2 What this benchmark does not establish

**Content-level fabrication at scale.** The schema validator catches invalid enum values but not a model inventing a plausible-looking `world_facts` claim that is factually wrong. Content-level validation at 5,000-run scale — requiring either human review or an automated fact-check layer — is deferred work. [Canady, 2026f] addresses content-level correctness qualitatively across 80 model-scenario pairs; this benchmark addresses structural reliability.

**Real-traffic representativeness.** The 250 scenarios were model-generated. They cover the intent taxonomy broadly but do not reproduce the distributional shape of real user turns. Fixture sets mined from ledger data would add realism.

**Multi-turn stability.** All scenarios are single-turn. Multi-turn context accumulation, bundle growth across a session, and thread-continuity behavior at scale are not covered here.

8.3 Implications for production deployment

The combined picture from [Canady, 2026f] and this benchmark supports a production deployment argument. The architectural failure classes identified in [Canady, 2026f] — fabrication, smuggling, contract-boundary violations — were closed at 20-scenario scale. This benchmark shows that closure holds at 5,000-run scale. The failure mode when the architecture does fire is informative rather than silent. Recovery via retry-with-failure-reason is near-free and effective. Latency improves rather than degrades at volume.

The remaining gap between this benchmark and production is the governance layer stack: the ComplianceValidator, RIC gate, Authority Engine, and SRL claim extractor described in [Canady, 2026e] are not active in the experimental chain. Those layers add capabilities; they are not load-bearing for the structural reliability result. The integration workstream that places the decoupled chain inside the SBA spine, adding governance layers above the packet, is the next step toward production deployment.

9. Conclusion

The 5,000-run stability benchmark extends the claims of [Canady, 2026f] from a 20-scenario proof to a production-scale reliability number. The chain's failure mode — strict literal validation intercepting invalid enum values before they reach the Response stage or the user — is exactly the behavior the architecture predicts and exactly what a governed specialist chain should do.

The headline numbers are clean: 0.30% raw failure rate, 0/5,000 true architectural failures, 88% intent stability, latency improving at scale. But the more durable result is the schema-feedback property. An architecture that makes its failures legible, translates them directly into schema improvements, and composes cleanly with retry recovery is not merely reliable, it is maintainable in a way that architectures with silent failure modes are not.

The prediction from [Canady, 2026f] Section 12.3, that specialist-trained models should outperform general frontier models on both stages, because the residual variance identified in cross-model comparison is exactly the kind of variance supervised fine-tuning closes, is unchanged by this benchmark. What this benchmark adds is the empirical ground on which to stand while building toward those specialist models: a documented, reproducible, at-scale reliability result for the general-model configuration that will serve as the training-data source for the specialists that follow.

## References

- [Canady, 2026a] *Bond-Indexed Memory: Relational State-Based Retrieval for Persistent AI.* `papers/paper_2_bond_indexed_memory.md`.

- [Canady, 2026b] *Gravity-Weighted Significance: Two-Stage Scoring for Automatic Memory Surfacing.* `papers/paper_3_gravity_significance.md`.

- [Canady, 2026c] *The Relational Integrity Coefficient: A Five-Subscale Behavioral Trustworthiness Metric.* `papers/paper_1_ric.md`.

- [Canady, 2026d] *Memory-Augmented Cognitive Intelligence: A Unified Architecture for Trustworthy Persistent AI.* `papers/paper_4_unified_architecture.md`.

- [Canady, 2026e] *Self-Bounded Authority: A Runtime Spine for Governed Language-Model Systems.* `papers/SELF_BOUNDED_AUTHORITY_AIME_ALIGNMENT.md`.

- [Canady, 2026f] *Decoupling Reasoning from Expression: A Data-Only Packet Contract Between Two Language Models, With Cross-Model Empirical Validation.* `papers/paper_6_decoupling_reasoning_expression.md`.

## Appendix A. Implementation and Run Artifacts

All paths relative to the AiMe v3 repository root.

### Run artifacts

- `modules/experimental_chain/runs/20260424_160357_stability_5000/` — full 5,000-run directory (raw packets, rendered output, timing, per-scenario intent distributions)

- `IP/stability_5000_audit.md` — complete failure listing, intent disagreement cases, cross-run variance analysis, and Part 5 follow-up experiments (source document for this paper)

### Schema and contract changes

- `modules/experimental_chain/schemas.py` — `FactCategory` Literal expanded to include `"etymological"`; `VALID_FACT_CATEGORIES` frozenset updated

- `modules/experimental_chain/contracts/contract_reasoning.txt` — inline schema example and CATEGORY description block updated

- `modules/experimental_chain/contracts/contract_response.txt` — category list in data-shape reference updated

### Follow-up experiment scripts

- `tools/reasoning_lite_poc/retry_failures_poc.py` — retry-with-failure-reason experiment runner

- `modules/experimental_chain/runs/20260424_160357_stability_5000/RETRY_REPORT.md` — per-failure retry detail

### Fixture

- `tools/reasoning_lite_poc/fixtures/scenarios_stability_v1.json` — 250-scenario fixture set (Sonnet 4.6 generated)

### Configuration

- `modules/experimental_chain/config.py` — `GRADIENT_MINI` configuration definition (lines 34–93)