CWT_Paper_v56

We introduce the Cognitive Workspace Transformer (CWT), a neural architecture that replaces the standard transformer’s residual stream with a structured hub-and-spoke workspace featuring content-addressed decay gates, dual-system processing, and PonderNet-style adaptive compute. We train CWT at 57.8M parameters and evaluate it against two controlled baselines: a parameter-matched 8-layer standard transformer (57.9M) and a compute-matched 13-layer standard transformer (67.5M), all trained on identical data (FineWeb-Edu, 5.2B tokens) with identical hyperparameters. CWT achieves PPL 29.54 — beating the parameter-matched baseline (30.67, a 3.7% improvement) despite allocating only 22.9M parameters to core attention and FFN computation. Against the 13-layer baseline (PPL 29.04), which devotes 41.7M to attention and FFN, CWT comes within 1.7% in quality with 45% less core compute capacity, demonstrating that the workspace infrastructure makes each compute parameter substantially more effective. Comprehensive ablation analysis across 30+ interventions confirms that workspace structural components (hub, tags, decay gates) are load-bearing, with the hub shared region alone causing +8,114% degradation when zeroed. The architecture additionally provides smooth inference-time compute/quality tradeoffs, honest epistemic self-monitoring through hub delta dynamics, and robust long-context extrapolation up to 2× training length with under 10% degradation.

1. Introduction

Standard transformers ^[23] encode all information — content, positional context, routing signals, and layer-to-layer communication — into a single undifferentiated residual stream. Each layer reads and writes to this shared vector without structural guidance about what information belongs where, who wrote it, or whether it should persist. As models scale, an increasing fraction of parameters is spent on implicit routing and demultiplexing rather than useful computation.

We propose the Cognitive Workspace Transformer, which replaces the residual stream with a structured workspace organized into distinct memory regions: private per-layer spokes, permanent broadcast billboards, a decay-managed shared hub, and identity tags that drive content-addressed forgetting. This organization provides several advantages that compound with scale:

Structural addressing eliminates the need for layers to learn routing from scratch, freeing parameters for semantic computation. Our controlled baseline comparisons show that CWT matches a standard transformer that has 45% more core compute capacity (41.7M vs 22.9M in attention+FFN), demonstrating that the workspace infrastructure is not merely overhead but an active force multiplier on the parameters that remain.

Content-addressed decay gates enable selective forgetting based on who wrote what. Ablation shows hub write sensitivity increases from +76% to +547% degradation between steps 8K and 20K, revealing that the hub develops precise calibration over training — each dimension comes to carry specific information at carefully tuned magnitudes.

Dual-system processing with adaptive compute allows the model to allocate different amounts of deliberation to tokens of varying difficulty, providing a smooth compute/quality tradeoff at inference time that standard transformers cannot offer.

Hub-derived epistemic signals provide honest uncertainty estimation without auxiliary classifiers, derived directly from the dynamics of hub state evolution across layers.

CWT draws inspiration from Global Workspace Theory in cognitive science (Baars, 1988 ^[3]), where a shared broadcast medium enables specialized processors to communicate and coordinate. The hub serves as this global workspace — all layers can read from it, but writes are mediated by decay gates that evaluate the relevance and freshness of existing content before allowing updates.

2. Architecture

2.1 Workspace Structure

The workspace state tensor \(S \in \mathbb{R}^{B \times T \times d_s}\) replaces the residual stream. It is partitioned into four semantically distinct regions:

Spokes (\(d_{\text{spoke}} = 48\) per layer): Private scratch memory for each layer. A layer can read and write its own spoke but cannot access other layers’ spokes. This provides guaranteed private computation space that cannot be corrupted by other layers’ writes.

Hub Private / Billboards (\(d_{\text{hub\_priv}} = 16\) per layer): Permanent broadcast channels. Each layer writes to its billboard; all subsequent layers can read from all billboards. Unlike the hub shared region, billboards are never decayed — they serve as persistent signals about what each layer discovered.

Hub Shared (\(d_{\text{hub\_shared}} = 256\)): The central communication channel. All layers read from and write to this shared region, mediated by tag-addressed decay gates. This is the primary information pathway and the region where most semantic content resides. At 256 dimensions, it carries half the bandwidth of the baseline’s 512-dimensional residual stream, yet ablation shows it is the most critical component (+8,114% degradation when zeroed).

Tags (\(d_{\text{tag}} = 16\) per layer): Identity markers embedded in the workspace that identify which layer wrote which content. Decay gates read tags to make content-addressed forgetting decisions — suppressing stale information from earlier layers when it is no longer relevant.

The total workspace dimension is \(d_s = n_{\text{layers}} \times (d_{\text{spoke}} + d_{\text{hub\_priv}} + d_{\text{tag}}) + d_{\text{hub\_shared}}\). For our 8-layer configuration: \(d_s = 8 \times (48 + 16 + 16) + 256 = 896\).

2.2 Tag-Addressed Decay Gates

Each layer’s write to hub shared is gated by a decay mechanism that examines existing tags to decide what to suppress:

where \(b_n\) is a per-system bias (3.0 for S1, 5.0 for S2) that provides a “soft ponder lock” — S1 layers default to moderate retention while S2 layers default to strong retention, ensuring that S2’s refinements are not easily overwritten. The query is derived from the layer’s hidden state projected to tag dimension, and the mean tag is averaged over a sliding window of recent layers’ tags.

The gate values are applied through a custom FlooredMultiply operation with straight-through gradient estimation:

\[\text{hub\_new} = \text{FlooredMultiply}(\text{hub\_old}, g_n, \text{floor}) + \Delta_n\]

The gradient floor (0.5 for S1, 0.85 for S2) ensures that gradients always flow through all layers regardless of gate values, preventing dead gradient paths. The forward pass applies the true gate value; the backward pass clamps it to the floor, providing a form of straight-through estimation that preserves gradient magnitude.

2.3 Dual-System Processing

System 1 (S1): Layers 0–5 (6 unique weight matrices). These process every token exactly once, building the initial representation. S1 layers have lower decay gate bias (3.0), allowing more aggressive information turnover as the representation is constructed.

System 2 (S2): Layers 6–7 (2 unique weight matrices). These are reused during pondering, with each iteration seeing the updated hub content from the previous pass. S2 layers have higher decay gate bias (5.0 = base 3.0 + 2.0 S2 offset), producing a "soft ponder lock" — S1 layers default to moderate retention while S2 layers default to strong retention, ensuring that S2's refinements are not easily overwritten. This coupling means the S2 bias tracks the S1 base bias if tuned at different scales. The characteristic reconsider/commit oscillation pattern has S2-L0 making larger hub modifications and S2-L1 making smaller refinements.

An S1 exit loss (\(\lambda_{\text{s1}} = 0.1\), Phase 2 only) trains S1 to produce standalone-quality predictions by running the collapse and decoder on the workspace state at the S1/S2 boundary. This enables a no-ponder deployment mode (PPL 36.33) that uses only 8 layer passes.

2.4 Adaptive Compute via PonderNet

After S1 processing, S2 layers are applied iteratively with PonderNet-style halting. A learned halt head examines hub content and produces a per-token halting probability at each step:

\[p_{\text{halt}}^{(t)} = \sigma(W_{h2} \cdot \text{ReLU}(W_{h1} \cdot \text{hub}^{(t)}))\]

\[S_{\text{out}} = \sum_{t=0}^{T_{\max}} w_t \cdot S^{(t)}, \quad w_t = p_{\text{halt}}^{(t)} \cdot \prod_{i<t}(1 - p_{\text{halt}}^{(i)})\]

with a geometric prior regularization loss that encourages the model to use fewer steps when possible:

\[\mathcal{L}_{\text{ponder}} = \text{KL}(q_{\text{halt}} \| \text{Geometric}(\lambda_p))\]

At convergence (step 20K), the halt distribution settles at: step 0 = 23.9%, step 1 = 16.6%, step 2 = 12.5%, step 3 = 9.7%, step 4 = 7.7%, remainder = 29.6%, giving an expected depth of 2.49 S2 iterations. A blind halt ablation confirmed that KL divergence and entropy inputs to the halt head contributed only +0.3% over hub-content-only input, so the halt head uses hub content exclusively.

2.5 CWT-MLA: Hub-Derived Latent Attention

CWT uses Multi-head Latent Attention (MLA) with an architectural privacy constraint: queries are computed from the full readable workspace (spoke + hub + tags), but keys and values are derived only from the hub content:

\[Q = W_Q \cdot \text{read\_project}(S_{\text{readable}}), \quad c_{\text{KV}} = W_{\text{down}} \cdot \text{norm}(S_{\text{hub}})\] \[[K; V] = W_{\text{up}} \cdot c_{\text{KV}}\]

where \(c_{\text{KV}} \in \mathbb{R}^{d_{\text{kv\_latent}}}\) is a compressed latent representation with \(d_{\text{kv\_latent}} = 128\). This provides 8× KV cache compression compared to standard multi-head attention while architecturally enforcing spoke privacy — a layer’s queries can attend based on private scratch state, but the information retrieved is always from the shared hub.

2.6 Hub Self-Distillation

Hub self-distillation replaces the expensive vocabulary-projection probes used in earlier versions (v5.4). Each layer's hub output is supervised via mean squared error against the final hub state through a lightweight bottleneck:

\[\mathcal{L}_{\text{distill}} = \frac{1}{|P|} \sum_{n \in P} \| f_{\theta}(\text{hub}_n) - \text{hub}_{\text{final}} \|^2\]

where \(f_\theta\) is a two-layer MLP (hub → 128 → hub) and \(P\) is a stratified set of probed layers (the last 5 layers plus 1 random layer). The final hub state is detached — gradients flow only through intermediate layers, encouraging each layer to progressively approximate the final state.

This provides deep supervision at 250× lower cost than the probe-based approach. The earlier probes required a vocabulary-sized matmul (\(512 \times 50{,}257 \approx 25\)M elements) per probed layer, accounting for a 4.5× throughput difference between v5.4 (6.5K tok/s) and v5.6 (29K tok/s in Phase 1).

At convergence, distillation error decreases monotonically across layers (L0: 0.228, L1: 0.150, L2: 0.118, L3: 0.090, L4: 0.056, L5: 0.024, L6: 0.009, L7: 0.000), confirming that each layer progressively refines the hub toward its final state.

2.7 Output Path

The workspace is collapsed to \(d_{\text{model}}\) via a learned projection over the spokes and hub regions (excluding tags), normalized, then passed through a two-layer SwiGLU ^[20] FFN stack before the tied language model head:

\[\text{collapsed} = W_{\text{collapse}} \cdot \text{RMSNorm}([\text{spokes}; \text{hub}])\] \[x_0 = \text{RMSNorm}(\text{collapsed} \cdot g_{\text{epistemic}})\] \[x_1 = \text{RMSNorm}_1(x_0 + \text{FFN}_1(x_0))\] \[x_2 = \text{RMSNorm}_2(x_1 + \text{FFN}_2(x_1))\] \[\text{logits} = W_{\text{lm\_head}} \cdot x_2\]

where \(g_{\text{epistemic}}\) is a learned multiplicative gate conditioned on hub delta dynamics and ponder workload. The lm_head weights are tied with the input embedding. We note in Section 9 that this decoder design is a candidate for simplification — the two FFN layers consume ~4.3M parameters without access to attention or the workspace, making them the least efficient parameters in the model.

3. Training

3.1 Configuration

3.2 Dataset and Optimization

Parameter	Value
Total parameters	57.8M (weight-tied)
Core compute (attn + FFN)	22.9M
Workspace overhead	~9.2M
d_model	512
d_s (workspace)	896
Layers	8 (6 S1 + 2 S2)
Attention heads	8
FFN dimension	1408
KV latent dimension	128
Max ponder steps	5
Sequence length	4,096
Vocabulary	50,257 (GPT-2 ^[19])

We train on FineWeb-Edu (sample-10BT) ^[18], a curated educational web corpus. Training uses 4× NVIDIA RTX 3090 GPUs with DDP, AdamW optimizer ^[15] (\(\beta_1 = 0.9\), \(\beta_2 = 0.95\)), peak learning rate \(3 \times 10^{-4}\) with cosine decay to \(10^{-5}\), 1,000-step warmup, and gradient clipping at 1.0. Effective batch size is 96 sequences (batch size 3 × 8 gradient accumulation steps × 4 GPUs). The training corpus contains approximately 5.2 billion tokens; over 20,000 optimizer steps the model processes approximately 7.9 billion tokens (~1.5 epochs).

3.3 Two-Phase Training

Phase 1 (steps 0–2,000): S1 layers only. Pondering is disabled; all tokens receive exactly 8 layer passes. Hub self-distillation loss ramps from 0 to full over the first 500 steps. This establishes stable workspace dynamics before introducing adaptive compute. Phase 1 throughput: ~29,000 tokens/second.

Phase 2 (steps 2,000–20,000): Full adaptive pondering enabled. S2 layers iterate up to 5 times with learned halting. The ponder loss ramps in over 1,500 steps (steps 2,000–3,500). The S1 exit loss activates (\(\lambda_{\text{s1\_exit}} = 0.1\)), training S1 to produce standalone-quality predictions. Phase 2 throughput: ~16,000 tokens/second.

3.4 Loss Function

\[\mathcal{L} = \mathcal{L}_{\text{LM}} + \lambda_{\text{depth}} \mathcal{L}_{\text{depth}} + \lambda_{\text{distill}} \mathcal{L}_{\text{distill}} + \lambda_{\text{ponder}} \mathcal{L}_{\text{ponder}} + \lambda_{\text{conv}} \mathcal{L}_{\text{conv}} + \lambda_{\text{s1}} \mathcal{L}_{\text{s1\_exit}}\]

where \(\lambda_{\text{depth}} = 0.01\), \(\lambda_{\text{distill}} = 0.1\), \(\lambda_{\text{ponder}} = 0.01\), \(\lambda_{\text{conv}} = 0.001\), and \(\lambda_{\text{s1}} = 0.1\) (Phase 2 only). The depth regularization loss penalizes write magnitudes with a variance term that anneals over training, encouraging uniform contribution across layers.

3.5 Performance Optimizations

Training throughput improved from 5.8K to 29K tokens/second (Phase 1) through systematic elimination of CUDA synchronization barriers: setting OMP_NUM_THREADS=8 before imports (eliminating 212 threads per process), removing 8 unnecessary .item() calls per forward pass, returning .detach() tensors in loss dictionaries instead of .item() scalars (eliminating 96 syncs per optimizer step), enabling cudnn.benchmark, and using non_blocking=True on data transfers. Total: ~160 CUDA syncs eliminated per optimizer step.

4. Results

4.1 Training Trajectory

Pondering benefit grows from 12% at step 2,500 (when pondering first activates) to 16% at step 3,000, then stabilizes at 18–19% from step 4,500 onward. This progressive growth indicates that S1/S2 specialization develops over ~2,500 steps rather than appearing instantly, and the stable plateau indicates healthy specialization that does not degrade as both pathways improve.

4.2 Controlled Baseline Comparisons

Step	Val PPL (pondered)	Val PPL (no ponder)	Pondering Benefit
2,500	60.26	68.61	12.2%
3,000	52.67	62.66	15.9%
4,500	42.82	52.31	18.1%
6,000	38.21	47.00	18.7%
8,000	34.89	42.71	18.3%
10,000	32.91	40.17	18.1%
12,000	31.54	38.66	18.4%
17,000	29.81	36.64	18.6%
20,000	29.54	36.33	18.7%

We train two controlled baselines on identical data, tokenizer, hyperparameters, and hardware — standard Llama-style transformers ^[22] (RoPE ^[21], SwiGLU ^[20], RMSNorm ^[24], weight tying) with identical effective batch size (96 sequences), cosine LR schedule, and the same FineWeb-Edu data/validation splits:

Parameter-matched baseline: 8-layer, d_model=512, ~57.9M parameters. Matches CWT’s total parameter count.

Compute-matched baseline: 13-layer, d_model=512, ~67.5M parameters. Has approximately the same amount of attention+FFN capacity as CWT would need to achieve compute parity.

*SmolLM2-70M trained on 30.6B tokens (6× more data); different validation protocol.

CWT achieves several notable results. Against the parameter-matched baseline (same total parameters, same layer count), CWT wins by 3.7% despite having fewer attention+FFN parameters — the workspace overhead pays for itself and then some. Against the compute-matched baseline (82% more attention+FFN capacity, 5 more layers), CWT comes within 1.7% in PPL (29.54 vs 29.04), demonstrating that the workspace makes 22.9M compute parameters do nearly the work of 41.7M. CWT's S1-only path (PPL 36.33) outperforms SmolLM2-70M (PPL 37.72) with 6× fewer training tokens.

4.3 Compute Efficiency Analysis

The central finding of the baseline comparisons is that CWT achieves its quality not through total parameter count but through compute efficiency — the workspace infrastructure makes each attention+FFN parameter substantially more productive.

CWT dedicates 22.9M parameters to attention and FFN — the components that perform actual sequence modeling computation. The compute-matched baseline dedicates 41.7M to the same components. CWT’s workspace infrastructure costs ~9.2M parameters but makes each remaining compute parameter substantially more effective, achieving near-equivalent perplexity (within 1.7%) with a 45% smaller core compute budget.

This reframes the architecture’s value proposition: the workspace provides structural addressing that standard transformers must learn implicitly, and this implicit routing consumes a significant fraction of a standard transformer’s parameter budget. CWT’s explicit structural addressing costs ~16% of total parameters but recovers more than it costs.

Model	Total Params	Attn+FFN	Layers	Val PPL
CWT v5.6 (pondered)	57.8M	22.9M	8	29.54
Parameter-matched baseline	57.9M	~32M	8	30.67
CWT v5.6 (no ponder)	57.8M	22.9M	8	36.33
Compute-matched baseline	67.5M	41.7M	13	29.04
SmolLM2-70M ^[2]	69M	—	32	37.72*

Component	13-Layer Baseline	CWT (8-layer)
Embedding (tied)	25.7M	25.7M
Attention	13.6M	5.6M
FFN	28.1M	17.3M
Core compute subtotal	41.7M	22.9M
Workspace I/O (read/write proj.)	—	3.6M
Output decoder FFNs	—	4.3M
Embedding→state projection	—	0.5M
Subspace collapse	—	0.4M
Workspace infra (gates, tags, norms)	—	0.1M
Epistemic modules	—	0.3M
Norms + misc	0.1M	—
Workspace overhead subtotal	—	~9.2M

Scaling implication: The workspace overhead (~9.2M) is roughly fixed regardless of depth, while attention+FFN budget grows linearly with layers. At 130M total parameters, the overhead drops from ~16% to ~5–7%. At 300M+ it becomes noise. CWT’s efficiency advantage should therefore increase with scale.

4.4 Benchmark Evaluation

We evaluate CWT and both baselines on standard zero-shot benchmarks using lm-evaluation-harness ^[10]:

At 57–58M parameters with zero-shot evaluation, most benchmarks are near random chance (CommonsenseQA: 19.7% vs 20% random, WinoGrande: ~51% vs 50% random) and differences are within standard error margins. We include these results for completeness but note that zero-shot benchmarks at this scale lack the statistical power to meaningfully differentiate architectures. The perplexity comparison on held-out data from the training distribution (Section 4.2) provides a more reliable signal.

Two observations merit noting. Pondering provides the largest gains on context-dependent tasks: LAMBADA perplexity improves 34% (878 → 578) and BoolQ accuracy improves 2.3 points, both tasks that require integrating passage-level context to select answers. Conversely, pondering slightly hurts on pattern-matching tasks (SciQ -1.2, WinoGrande -1.6), suggesting that the halt head’s notion of “which tokens need more processing” is calibrated for language modeling loss on educational text rather than for multiple-choice reasoning. The deployment compute tradeoffs in Section 5.5 provide knobs to address this.

The high LAMBADA perplexity across all models (~578–878) reflects the training data distribution: CWT and both baselines were trained exclusively on educational text and have never seen the narrative prose that LAMBADA requires. This is a data limitation, not an architectural one.

5. Ablation Study

We perform a comprehensive ablation suite of 30+ interventions on the step 20,000 checkpoint (baseline PPL 28.49). Ablations are organized into four tiers by impact severity.

5.1 Tier 1 — Existential (architecture collapses)

The hub shared region and tags are existentially critical. Removing either produces catastrophic failure, confirming that the workspace architecture is not merely a wrapper around standard attention — the structural addressing is load-bearing. The gap between Zero Tags (+4,392%) and Shuffle Tags (+1,709%) shows that tag presence matters, but correct tag-to-layer assignment nearly triples the benefit.

5.2 Tier 2 — Critical (severe quality loss)

Hub Writes ×2 increased from +76% at step 8,000 to +547% at step 20,000. This reveals that the hub develops precise calibration during training — each dimension carries specific information at carefully tuned magnitudes. Doubling all writes at convergence is catastrophic, while early in training the model can partially tolerate it. This has direct implications for scaling: hub write magnitudes become an increasingly tight invariant that the model’s computation relies upon.

5.3 Tier 3 — Important (meaningful quality loss)

Kill Pondering (+22%) versus Max Pondering (+18%) confirms that the learned halt distribution adds value — the weighted combination outperforms both no pondering and forced maximum pondering. The halt head is making useful per-token allocation decisions rather than simply averaging across all steps.

5.4 Tier 4 — Minimal (dispensable components)

The epistemic gate and convergence head have no measurable impact on language modeling quality. This is by design — they are monitoring infrastructure, not core computation components. The convergence head accurately predicts hub delta norms (Pearson correlation 0.72 at step 20K) but this prediction is only useful at inference for early stopping decisions, not for training quality. The clean separation between load-bearing structure (Tiers 1–3) and dispensable instrumentation (Tier 4) validates the architecture’s modularity.

5.5 Deployment Compute Tradeoffs

The architecture provides a smooth compute/quality tradeoff at deployment time. Medium effort (2 ponder steps) achieves PPL 31.14 at only 1.5× the compute of the no-ponder path. The parameter-matched baseline transformer offers no such flexibility — it always requires its fixed layer count with no ability to trade compute for quality at inference time.

6. Epistemic Self-Monitoring

6.1 Hub Delta Dynamics

Rather than training auxiliary classifiers for uncertainty estimation, CWT derives epistemic signals directly from hub dynamics. At each layer, the hub delta norm — the magnitude of change to the hub shared region — provides a natural measure of how much the model is still revising its understanding:

Benchmark	CWT (no ponder)	CWT (pondered)	Param-Matched (8L)	Compute-Matched (13L)
ARC-Easy (acc)	45.2	45.6	44.5	45.9
ARC-Challenge (acc_norm)	22.9	23.0	23.6	23.7
BoolQ	56.5	58.8	55.2	58.6
CommonsenseQA	19.7	19.6	19.6	19.6
COPA	62.0	60.0	60.0	57.0
HellaSwag (acc_norm)	27.3	27.3	27.3	27.7
LAMBADA (ppl)	878.0	578.2	647.4	562.3
OpenBookQA (acc_norm)	29.0	29.8	28.8	29.6
PIQA (acc)	58.9	59.8	59.5	59.2
SciQ (acc)	66.7	65.5	66.8	70.2
WinoGrande	51.1	49.5	51.3	52.9

Ablation	PPL	Degradation
Max Amnesia (zero all memory between layers)	2,668.20	+9,263%
Zero Hub Shared (clear central hub)	2,341.01	+8,114%
Zero Tags (remove identity markers)	1,279.94	+4,392%
Shuffle Tags (randomize tag positions)	515.40	+1,709%
Skip Decoder FFN (remove output processing)	161.72	+468%

Ablation	PPL	Degradation
Noise Hub (Gaussian noise injection)	274.05	+862%
Hub Writes ×2 (double write magnitudes)	184.28	+547%
Zero Billboards (clear broadcast channels)	78.75	+177%
Zero Designator Signature (remove layer identity)	62.42	+119%

Ablation	PPL	Degradation
Zero Spokes (clear private memory)	48.44	+70%
Dead Gates (set all gates to 1.0)	43.80	+54%
Hub Writes ×0.5	39.14	+37%
Freeze Hub During S2	37.02	+30%
Kill System 2 (skip S2 entirely)	36.75	+29%
No Soft Ponder Lock	34.99	+23%
Kill Pondering (run S2 once)	34.82	+22%
Max Pondering (force all 5 steps)	33.74	+18%

Ablation	PPL	Degradation
Collapse Noise	31.70	+11%
Inverted Epistemic Gate	31.06	+2.5%
Zero Convergence Head	28.49	0.0%
Epistemic Gate	28.50	+0.04%

Mode	Effective Layers	PPL	Relative Compute
No ponder	8	34.82	1.0×
Low effort (1 step)	10	33.09	1.25×
Medium effort (2 steps)	12	31.14	1.5×
Full (5 steps)	18	28.49	2.25×

\[\delta_n = \| \text{hub}_{\text{after\_layer\_n}} - \text{hub}_{\text{before\_layer\_n}} \|_2\]

A DeltaNormalizer module tracks running statistics of delta norms via exponential moving average and maps raw deltas to calibrated uncertainty scores through sigmoid normalization:

\[u = \sigma\left(\frac{\delta - \mu_{\text{EMA}}}{\sigma_{\text{EMA}}}\right)\]

This self-calibrating approach produces honest uncertainty from step 1 of training. The normalizer uses pure buffers (no learned parameters), cold-starts from the first batch, and requires no hyperparameter tuning.

6.2 Epistemic Classification Results

At step 20,000, the hub delta-based classification correctly differentiates content types:

Internet slang (“lmao bruh fr fr no cap”) produces the highest unresolvable rate at 90%, correctly reflecting the model’s inability to process this content after training exclusively on educational text. The temporal edge case “According to recent studies in 2025” dropped from 100% unresolvable at step 8,000 to 28% at step 20,000 — the model correctly learned that this is a common format pattern in FineWeb-Edu, even though the specific date is unfamiliar.

6.3 Comparison with Probe-Based Approaches

An earlier version (v5.4) used vocabulary-projection probes at each layer to compute KL divergence between consecutive layers’ predictions. This approach suffered from two fundamental problems:

Category	Mean Non-Convergence	Unresolvable %
In-distribution (educational)	2.26	37%
Domain gradient (easy → hard)	2.23	24%
Edge cases (mixed)	2.32	32%
Out-of-distribution	2.51	61%

False early convergence: At step 25, all layers predicted near-identical random distributions over the vocabulary, producing low KL divergence. The probe reported “convergent” when the model was actually confidently clueless. Hub deltas of 7–9 at the same step correctly report high uncertainty.

Prohibitive compute cost: Each probed layer required a matmul against the full vocabulary (\(512 \times 50{,}257 \approx 25\)M elements), plus softmax and KL over 50K logits. Replacing probes with hub self-distillation (hub → 128 bottleneck → hub MSE) yielded a 250× reduction in supervision compute and a 4.5× throughput improvement.

7. Long-Context Extrapolation

We evaluate CWT and the parameter-matched baseline on contexts longer than the 4,096-token training length using four RoPE ^[21] scaling strategies: direct (no modification), linear position interpolation, NTK frequency scaling ^[5], and YaRN ^[17].

7.1 CWT vs Baseline Extrapolation (NTK scaling)

Both architectures degrade at comparable rates up to 2× training length, with CWT maintaining a slight absolute PPL advantage inherited from its lower starting point. Beyond 3×, the baseline degrades more gracefully — CWT’s pondering loop becomes a liability at extreme extrapolation lengths where positional encoding degradation corrupts the iterative refinement process.

The practical takeaway is that long-context extrapolation behavior is primarily determined by the RoPE scaling strategy rather than the model architecture. Both models achieve under 10% degradation at 2× with NTK scaling, confirming that 50–100% context extension is viable for deployment without fine-tuning.

7.2 Pondering Benefit by RoPE Strategy

The relationship between pondering benefit and context length is strategy-dependent:

Under NTK scaling, pondering benefit increases with context length up to 3×, peaking at 28.4%. NTK’s preservation of local attention patterns provides stable enough representations for the S2 ponder loop to perform useful iterative refinement on longer contexts. Under YaRN and direct extrapolation, pondering benefit decreases with context length — the positional encoding degradation corrupts the representations that the ponder loop operates on, making additional iterations counterproductive.

This has practical implications: the choice of RoPE scaling strategy determines not only base extrapolation quality but also whether adaptive compute remains beneficial at extended lengths. NTK scaling is recommended for CWT deployments that require both long-context support and full pondering.

8. Workspace Visualization

The structured workspace enables direct 3D visualization of internal processing states — an interpretability capability that standard transformers lack because their residual stream is an opaque superposition of all information.

8.1 Method

Context	CWT (pondered)	CWT Degradation	Baseline	Baseline Degradation
4,096 (1.0×)	28.48	baseline	29.01	baseline
5,120 (1.25×)	29.68	+4.2%	29.97	+3.3%
6,144 (1.5×)	28.69	+0.7%	29.06	+0.2%
8,192 (2.0×)	31.29	+9.9%	31.97	+10.2%
12,288 (3.0×)	42.12	+47.9%	39.94	+37.7%
16,384 (4.0×)	90.72	+218.5%	61.15	+110.8%

Context	NTK Ponder Benefit	YaRN Ponder Benefit	Direct Ponder Benefit
4,096 (1.0×)	18.7%	—	18.7%
5,120 (1.25×)	19.6%	18.8%	20.2%
6,144 (1.5×)	20.1%	18.2%	12.0%
8,192 (2.0×)	20.7%	13.8%	0.8%
12,288 (3.0×)	28.4%	8.4%	1.2%
16,384 (4.0×)	17.1%	5.0%	-11.6%

We record hub shared state vectors at every layer pass during generation using a SQLite-backed workspace state recorder, then apply UMAP ^[16] dimensionality reduction to project 256-dimensional hub states into 3D space. Each point represents the hub state at one (token, layer) position, colored by layer index (purple = early S1, yellow = late S2). Interactive Plotly visualizations allow rotation and inspection of trajectories.

8.2 In-Distribution Processing

For the prompt “The process of photosynthesis involves,” the hub trajectory shows a clear three-phase structure:

Foundation (layers 0–2, purple): All tokens cluster in a tight central region. Early S1 layers build a shared contextual representation.

Differentiation (layers 3–5, teal/green): Trajectories branch outward as each token’s representation diverges based on semantic content.

Convergence (layers 6–7, yellow): Tokens settle into distinct, tightly clustered final positions. S2 pondering produces small refinements around settled representations.

Figure 8.1 — 3D UMAP hub trajectory for "The process of photosynthesis involves…". Purple = early S1 layers; yellow = late S2 pondering. Drag to rotate.

Figure 8.2 — Animated topology showing hub state evolution across all layer passes.

Workspace region activity

Hub delta norms (epistemic signals)

Inter-layer hub state similarity

Decay gate values per layer

Gate selectivity patterns

S2 ponder oscillation dynamics

Hub write magnitudes per layer

Layer contribution ranking

8.3 Out-of-Distribution Processing

For the OOD prompt “hey bud no cap fo real fo real,” the visualization reveals representational collapse: no central foundation cluster forms, layer trajectories are compressed to near-points (minimal hub modification per layer), S2 oscillation fires mechanically but produces minimal hub changes, and yellow S2 clusters show poor separation with high cross-position similarity (0.64 vs 0.35 for in-distribution).

8.4 Comparative Visualization

Overlaying in-distribution and OOD hub trajectories in a shared UMAP The model builds rich, differentiated representations for familiar content and collapses to undifferentiated states for unfamiliar content — a spatial manifestation of the epistemic uncertainty signals measured in Section 6.

9. Discussion

9.1 The Compute Efficiency Argument

The controlled baseline comparisons reveal that CWT’s value is best understood through compute efficiency rather than total parameter counts. Against the parameter-matched baseline (same total params, same layers), CWT wins by 3.7% in PPL despite having fewer attention+FFN parameters. Against the compute-matched baseline (82% more attention+FFN), CWT comes within 1.7% PPL (29.54 vs 29.04).

The architectural insight is that standard transformers spend a significant fraction of their parameters on implicit routing — learning to demultiplex a superposed residual stream, determining what information to read, where to write, and what to forget. CWT handles this at fixed cost (~9.2M) through explicit structure (tags identify writers, spokes guarantee privacy, decay gates manage persistence), freeing the remaining parameters for semantic computation.

The more informative metric for evaluating workspace architectures is PPL vs core compute parameters (attention + FFN), on which CWT is nearly 2× more efficient than a standard transformer. The conventional comparison (PPL vs total parameters) penalizes CWT for structural addressing overhead that actively improves parameter efficiency.

9.2 Output Decoder as Optimization Target

The current output decoder contains two full SwiGLU FFN layers consuming ~4.3M parameters — equivalent to 2 additional ExpertBlocks. These parameters cannot perform attention, cannot read the workspace, and cannot interact with context. They are the least efficient parameters in the model.

The baseline transformers use only RMSNorm → tied lm_head after their final layer. Since CWT’s SubspaceCollapse already projects workspace state back to d_model, a minimal decoder (RMSNorm → tied lm_head) would free ~4.3M parameters — enough to fund 1–2 additional ExpertBlocks with attention and workspace access.

More aggressively, reallocating both the decoder savings and reducing spoke dimensions from 48d to 32d (supported by the relatively mild +70% ablation for Zero Spokes) could fund 4 additional layers. A 12-layer CWT (10 S1 + 2 S2) with a slim decoder would achieve compute parity with the 13-layer baseline while retaining all workspace benefits and pondering capability. Given that CWT already matches the 13-layer baseline at 55% core compute, achieving compute parity should produce a decisive win.

9.3 Hub Write Calibration and Scaling

The Hub Writes ×2 ablation revealed that the hub becomes precisely calibrated during training — degradation increased from +76% at step 8,000 to +547% at step 20,000. This has direct implications for scaling CWT to larger models. Hub shared dimensions should grow sub-linearly with model depth, as each additional layer adds write pressure to the shared channel. Write magnitudes may require learnable per-layer scaling at larger sizes to maintain calibration. The decay gate mechanism naturally handles increasing write pressure from more layers, but the calibration sensitivity suggests that gate initialization and bias values become increasingly important at scale.

9.4 Honest Epistemic Signals

The transition from probe-based to hub delta-based epistemic signals represents a principled shift from “what does the model predict” to “how much is the model still changing.” The former requires auxiliary classifiers that can be dishonest (producing false convergence on random inputs). The latter directly measures the model’s internal dynamics and is honest by construction — a model that is uncertain will have large hub deltas because its layers are actively revising the representation.

This foundation enables a future phase where a lightweight epistemic gate is trained post-hoc against frozen model weights, using the calibrated delta norms as supervision signal. The gate would learn to scale output logits based on genuine model uncertainty, enabling deployment-time confidence calibration that is architecturally grounded rather than heuristic.

9.5 Limitations

Evaluation scope: Zero-shot benchmarks at 57–58M parameters are near random chance on most tasks and lack statistical power to differentiate architectures. Perplexity on held-out data provides the most reliable comparison at this scale. Larger-scale training (350M+) is needed for meaningful benchmark discrimination.

Scale: We demonstrate CWT at 57.8M parameters only. The architectural advantages (structural addressing, adaptive compute) are theorized to increase with scale as workspace overhead becomes proportionally smaller, but this requires experimental validation at 130M+ parameters.

Learning rate schedule: The current cosine decay schedule leaves learning budget on the table. A warmup-stable-decay (WSD) schedule — holding at peak LR for steps 1,000–14,000, then cosine decaying to \(10^{-5}\) — would provide approximately 1.7× more effective high-LR training, with projected improvement of 2–3 PPL points.

Inference cost: CWT with full pondering uses up to 18 effective layer passes vs the 8-layer baseline’s 8 passes. The compute-efficiency argument holds on a per-parameter basis but not on a per-FLOP basis at full pondering depth. The no-ponder and low-effort deployment modes partially address this. The comparison is most favorable when framed as: CWT provides near-equivalent quality to a much larger baseline at equal total parameters, with a smooth inference-time compute dial that standard transformers cannot offer.

Decoder overhead: The current two-FFN output decoder consumes ~4.3M parameters suboptimally. This is a known inefficiency identified during this work and will be addressed in future versions.

Long-context extrapolation: CWT’s pondering loop becomes counterproductive beyond 3× training length under most RoPE scaling strategies. The interaction between adaptive compute and positional encoding extrapolation is an open area for investigation.

Training data distribution: All models were trained exclusively on educational text (FineWeb-Edu). Performance on tasks requiring narrative prose, dialogue, or broad world knowledge (e.g., LAMBADA, CommonsenseQA) reflects data coverage limitations, not architectural ones.

10. Related Work

Adaptive Compute Transformers. Universal Transformer (Dehghani et al., 2019) ^[8] introduced weight sharing across layers with adaptive halting. PonderNet (Banino et al., 2021) ^[4] extended this with differentiable halting probabilities. CWT combines adaptive halting with structured workspace, giving each iteration a qualitatively different function (via evolving hub state and decay gate dynamics) rather than simply repeating the same computation on the same residual stream.

Memory-Augmented Neural Networks. Neural Turing Machines (Graves et al., 2014) ^[12] and Differentiable Neural Computers (Graves et al., 2016) ^[13] introduced external memory with read/write/erase operations. CWT’s workspace is conceptually similar but implemented as a partitioned state tensor rather than an external memory — the layers themselves serve as read/write controllers through the decay gate mechanism, and the structured partitioning (spokes, hub, tags) provides architectural guarantees that external memory schemes must learn.

Global Workspace Theory. Baars' (1988) ^[3] cognitive architecture posits a shared broadcast medium where specialized processors compete for access. CWT’s hub shared region implements this directly — all layers can read, but writes are mediated by gates that evaluate content relevance. The billboard regions provide guaranteed broadcast channels. The workspace visualization (Section 8) provides empirical evidence of the global workspace dynamics.

State Space Models. Mamba (Gu & Dao, 2023) ^[14] and related architectures learn structured state management across sequence positions. CWT manages state across layers rather than positions, but shares the principle of gated state updates with learned forgetting. The approaches are complementary.

Efficient Attention. Multi-head Latent Attention (DeepSeek, 2024) ^[7] compresses KV cache through latent projection. CWT-MLA extends this with an architectural privacy constraint that enforces spoke separation while sharing hub content through attention, and derives KV from hub content rather than the full hidden state.

Parameter Efficiency in Transformers. Mixture-of-Experts architectures (Fedus et al., 2022) ^[9] achieve parameter efficiency through sparse activation. CWT achieves efficiency through structural addressing — all parameters are active for every token, but the workspace eliminates the routing overhead that dense transformers spend significant capacity on. The two approaches are orthogonal and could be combined.

11. Conclusion

The Cognitive Workspace Transformer demonstrates that replacing the transformer’s undifferentiated residual stream with structured state management yields substantial compute efficiency gains. At 57.8M parameters, CWT beats a parameter-matched standard transformer by 3.7% in perplexity, and comes within 1.7% of a 13-layer baseline that has 82% more core compute capacity — the workspace infrastructure makes each attention+FFN parameter substantially more effective by providing structural addressing that standard transformers must learn implicitly.

Analysis of the current parameter budget identifies clear optimization opportunities: the output decoder consumes ~4.3M parameters without workspace access, and reallocating these to additional layers with attention could produce further significant gains. The workspace overhead (~9.2M) is fixed cost that becomes proportionally smaller at scale, predicting increasing compute efficiency advantages at 130M+ parameters.

The architecture additionally provides capabilities unavailable in standard transformers: smooth inference-time compute/quality tradeoffs (PPL 34.82 at 1.0× compute to PPL 28.49 at 2.25×), honest epistemic self-monitoring through hub delta dynamics, interpretable 3D visualization of internal processing states, and robust long-context extrapolation up to 2× training length with under 10% degradation.

These results suggest that as language models scale, structured state management offers a path toward architectures where explicit addressing replaces learned routing, adaptive depth replaces fixed depth, and internal dynamics provide interpretable uncertainty signals without auxiliary classifiers.

Appendix A: Hub Self-Distillation Error Profile

Monotonically decreasing error at both checkpoints confirms that each layer progressively approximates the final hub state, validating the deep supervision approach. The error reduction between 8K and 20K (e.g., L0: 0.374 → 0.228) shows that early layers continue improving their hub predictions throughout training.

Appendix B: Convergence Head Accuracy

The convergence head predicts hub delta norms from hub content alone, enabling inference-time convergence estimation without tracking actual deltas. At step 20,000:

Pearson correlation: 0.72 across 240 layer-steps. Mean absolute error on S2 layers: 0.014. The convergence head can reliably replace actual delta computation at inference time, enabling cheap early-stopping decisions.

Appendix C: Baseline Long-Context Extrapolation Comparison

Both architectures degrade at comparable rates up to 2× training length. Beyond 3×, the baseline degrades more gracefully, as CWT’s pondering loop amplifies positional encoding errors across iterations.

Appendix D: Complete Ablation Summary

The full ablation suite includes 30+ interventions covering workspace structure, pondering dynamics, epistemic monitoring, output processing, and deployment configurations. See Section 5 for the tier-organized presentation. Key temporal dynamics: Hub Writes ×2 degradation increased from +76% (step 8K) to +547% (step 20K), demonstrating that ablation sensitivity is not static — workspace components become more critical as the model develops precise calibration over training.

Layer	Type	MSE at Step 8K	MSE at Step 20K
L0	S1	0.374	0.228
L1	S1	0.265	0.150
L2	S1	0.210	0.118
L3	S1	0.157	0.090
L4	S1	0.100	0.056
L5	S1	0.050	0.024
L6	S2	0.021	0.009
L7	S2	0.000	0.000

Layer Pass	Actual Delta	Predicted Delta	Absolute Error
L8 (S2-L1, step 0)	1.805	1.841	0.036
L10 (S2-L1, step 1)	1.796	1.807	0.011
L12 (S2-L1, step 2)	1.894	1.882	0.012
L14 (S2-L1, step 3)	1.915	1.908	0.007
L16 (S2-L1, step 4)	1.870	1.876	0.006
L18 (S2-L1, step 5)	1.833	1.843	0.010

Context	CWT Pondered	CWT No-Ponder	Baseline	CWT NTK Degrad.	Baseline NTK Degrad.
4,096 (1.0×)	28.48	35.03	29.01	baseline	baseline
6,144 (1.5×)	28.69	35.91	29.06	+0.7%	+0.2%
8,192 (2.0×)	31.29	39.48	31.97	+9.9%	+10.2%
12,288 (3.0×)	42.12	58.85	39.94	+47.9%	+37.7%
16,384 (4.0×)	90.72	109.43	61.15	+218.5%	+110.8%

References

Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebrón, F., & Sanghai, S. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv preprint arXiv:2305.13245.
Allal, L. B., Lozhkov, A., Penedo, G., Wolf, T., & von Werra, L. (2025). SmolLM2: When Smol Goes Big — Data-Centric Training of a Small Language Model. Hugging Face Technical Report.
Baars, B. J. (1988). A Cognitive Theory of Consciousness. Cambridge University Press.
Banino, A., Balaguer, J., & Blundell, C. (2021). PonderNet: Learning to Ponder. arXiv preprint arXiv:2107.05407.
bloc97. (2023). NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and target length training. Reddit, r/LocalLLaMA.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Advances in Neural Information Processing Systems (NeurIPS), 35. arXiv:2205.14135.
DeepSeek-AI. (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv preprint arXiv:2405.04434.
Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., & Kaiser, Ł. (2019). Universal Transformers. International Conference on Learning Representations (ICLR). arXiv:1807.03819.
Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Journal of Machine Learning Research (JMLR), 23(120), 1–39.
Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., et al. (2023). A Framework for Few-Shot Language Model Evaluation. Zenodo. https://doi.org/10.5281/zenodo.10256836
Graves, A. (2016). Adaptive Computation Time for Recurrent Neural Networks. arXiv preprint arXiv:1603.08983.
Graves, A., Wayne, G., & Danihelka, I. (2014). Neural Turing Machines. arXiv preprint arXiv:1410.5401.
Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., et al. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471–476.
Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv preprint arXiv:2312.00752.
Loshchilov, I., & Hutter, F. (2019). Decoupled Weight Decay Regularization. International Conference on Learning Representations (ICLR). arXiv:1711.05101.
McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv preprint arXiv:1802.03426.
Peng, B., Quesnelle, J., Fan, H., & Shippole, E. (2023). YaRN: Efficient Context Window Extension of Large Language Models. arXiv preprint arXiv:2309.00071.
Penedo, G., Kydlíček, H., Lozhkov, A., Mitchell, M., Colin, C., von Werra, L., & Wolf, T. (2024). The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. arXiv preprint arXiv:2406.17557.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Technical Report.
Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv preprint arXiv:2002.05202.
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv preprint arXiv:2104.09864.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS), 30. arXiv:1706.03762.
Zhang, B., & Sennrich, R. (2019). Root Mean Square Layer Normalization. Advances in Neural Information Processing Systems (NeurIPS), 32. arXiv:1910.07467.

The Cognitive Workspace Transformer: Structured State Management for Compute-Efficient Language Modeling

Abstract