Think Forward.

The Narrative Compiler Framework: Fixing LLM Hallucination & Tokenomics

699
Chapters: 5 7.6 min read

1: Chapter 1: Setting The Stage- Deloitte AI Scandal 142

In December 2024, the Australian government paid Deloitte $290,000 for a report that appeared complete and professionally written but contained fabricated material throughout. Several citations referred to sources that do not exist, some quotations were attributed to judges who never made them, and multiple references pointed to academic work that cannot be found in any database. The content was generated using GPT-4o and delivered to the client without these issues being identified during internal review. The problems were later discovered by a university researcher after the report had already been submitted, which led Deloitte to issue a corrected version and return the final payment. The failure originates from how current systems handle data-to-text generation. A single prompt is expected to read structured data, compute derived values, apply classification logic, organize content, and produce readable prose while preserving exact numerical and factual accuracy. These steps require different forms of reasoning, yet they are executed inside one probabilistic generation process without separation or verification between them. The result is text that is coherent at the surface level but unreliable when examined against the underlying data. This becomes a scaling problem rather than a one-off mistake. When document production relies on this approach, teams must allocate time to verify outputs, reconcile inconsistencies, and correct numerical or factual errors. As volume increases, the cost of review grows in proportion, often offsetting the time saved during generation. Attempts to improve reliability by adding more prompts or introducing agent-based workflows tend to increase repetition of the same operations without establishing a stable mechanism for verification. The approach presented in this series replaces that structure with a defined pipeline in which data processing, classification, generation, and validation are separated into distinct stages. Each stage has a fixed role, and outputs from earlier stages are treated as immutable inputs for later ones. The model is limited to producing language from already verified inputs rather than participating in computation or decision-making about the data itself.

2: Chapter 2: Why Agents, MCP, and RAG Fail for Data-to-Text 142

The current default approach to generating documents from data combines agents, multi-step prompting, and retrieval. These methods are often grouped together in practice, but they introduce the same structural issue: the model repeatedly interprets and transforms the same data without a fixed, verifiable intermediate state. Start with agent workflows. A typical setup assigns roles such as writer, reviewer, and editor. Each role operates on text produced by the previous step while also referencing the original data. The data is not processed once and stored as a stable representation; it is re-read and reinterpreted at every stage. Derived values are recomputed multiple times, sometimes with small differences. The final document depends on a chain of generated text rather than a single transformation from source data. When a number is incorrect, there is no clear point in the process where the error can be isolated, because each stage mixes interpretation with generation. Multi-chain prompting attempts to impose order by splitting the task into explicit steps within a single workflow. One step extracts information, another computes metrics, another organizes structure, and a final step generates the document. This looks closer to a pipeline, but the boundaries are not enforced. Each step still depends on the model to preserve exact values from the previous step. Intermediate outputs remain probabilistic. A value that is slightly altered during extraction will be used as input for all subsequent steps. The system accumulates small inconsistencies rather than preventing them. Retrieval-augmented generation changes how data is accessed, not how it is processed. Relevant documents or records are retrieved and inserted into the prompt. The model then reads and synthesizes them. For data-to-text tasks, this means that the model is responsible for selecting, combining, and expressing values from retrieved sources. If multiple sources contain overlapping or conflicting information, the model resolves them implicitly during generation. There is no requirement that the output match any single source exactly. Retrieval improves coverage but does not enforce consistency. These methods are often combined. A system may retrieve data, process it through multiple prompting steps, and coordinate the process with agents. The number of transformations applied to the same data increases. Each transformation introduces another opportunity for deviation. Token usage grows because the same information is processed repeatedly. The final output reflects a sequence of interpretations rather than a controlled mapping from input to output. Data-to-text generation requires a different structure. Numerical values must remain exact. Classifications must follow defined rules. Every statement must be traceable to a source. These requirements assume that data is processed once, stored in a stable form, and then used consistently throughout the pipeline. Agents, MCP, and RAG do not provide this property because they rely on iterative interpretation. They remain useful in earlier stages where the goal is to gather information, explore alternatives, or synthesize unstructured inputs. In those contexts, variation is acceptable and often necessary. Once the data is fixed and the task is to produce a document that must align exactly with that data, the process must shift to a deterministic pipeline where computation, classification, and generation are separated and verified.
bluwr.com/Chapter 2: Why Agents,...

3: Chapter 3: Prior Art and Pipeline Structure 142

The problem of translating structured input into structured output has been addressed in other domains through staged processing. Compiler design separates parsing, semantic analysis, transformation, and code generation into distinct phases, each operating on well-defined representations. Natural language generation research formalized a similar sequence, separating content selection, organization, lexical choice, and surface realization. These designs isolate responsibilities and prevent later stages from altering the assumptions established earlier in the pipeline. End-to-end neural generation replaced these staged systems with a single model that maps input directly to output. This removes explicit intermediate representations and shifts all responsibilities into one probabilistic process. While this simplifies implementation, it removes the boundaries that make verification and auditing feasible. When a model both computes values and expresses them, there is no clear point at which correctness can be enforced. A staged approach restores those boundaries. Data is transformed into a set of derived values using deterministic computation. These values are then mapped to semantic categories using explicit rules. Only after these steps are complete is text generated, and the generation step is constrained to use the prepared inputs. A final validation stage compares the generated text against the deterministic outputs to detect discrepancies. This structure ensures that computation, classification, and expression are handled independently. The model is not responsible for deriving facts, only for expressing them. Each stage produces artifacts that can be inspected, tested, and reused. The framework operates as a directed sequence of transformations from input data to validated text. Each layer has a defined input and output, and data flows forward without feedback into earlier stages. The input layer accepts structured records or extracts them from unstructured sources into a predefined schema. When extraction is required, it is limited to identifying and normalizing explicit facts without inference or aggregation. The goal is to produce a stable, typed representation of the data that downstream stages can consume. The feature layer performs deterministic computation. This includes arithmetic operations, aggregations, formatting, and lookups. The implementation can use SQL, Python, or any environment that produces consistent outputs for identical inputs. Results from this layer are cacheable and reusable, since they depend only on the input data. The semantic layer applies rule-based classification to the computed features. Rules encode domain definitions such as thresholds, categories, or states. These rules are externalized as data so they can be modified without changing application code. The output of this layer is a set of labels or tags that describe the state of the input according to business logic. The generation layer receives the original inputs, computed features, and semantic tags. The prompt specifies exactly which values must be included and prohibits the introduction of additional facts. Structured output constraints restrict the format of the response. The model converts the provided values into text without performing new calculations or introducing new data. The validation layer inspects the generated text and compares it against the outputs of the feature and semantic layers. Numeric values, percentages, and categorical statements are extracted and checked for agreement. Any mismatch results in rejection or routing to review. No document proceeds without passing this reconciliation step. This sequence enforces separation between computation, interpretation, and expression. It also creates a complete lineage from each statement in the text back to a deterministic source.

4: Chapter 4: Tokenomics & Failure 138

Token usage in direct generation scales with both input size and document count. When identical datasets are used repeatedly, the same information is reintroduced into prompts and reprocessed each time. This creates redundancy across runs. A staged pipeline changes this behavior by separating computation from generation. Feature computation runs once per dataset. The results are stored and reused. The generation step receives only derived values and semantic tags rather than raw input data. Let Tin represent the original input size and T'in the reduced representation produced after feature extraction. For n documents derived from the same dataset, direct generation cost scales with n⋅Tin. In the staged system, cost splits into a one-time computation cost plus n⋅Tin. As n increases, the amortized cost of preprocessing becomes negligible relative to repeated generation savings. This structure also changes verification cost. When outputs depend on raw inputs embedded inside prompts, validation requires rechecking both computation and interpretation. When outputs depend on precomputed features, verification reduces to checking alignment between text and deterministic values. This reduces the scope of manual review. A second effect concerns failure containment. In end-to-end generation, errors in reasoning, calculation, and phrasing occur in the same process, making attribution difficult. A staged pipeline isolates these responsibilities. Feature computation is deterministic and testable. Semantic classification is rule-based and auditable. Generation is constrained to express only pre-validated inputs. Validation operates as a final comparison layer between text and deterministic outputs. In practical terms, this structure prevents entire classes of errors that arise when models are allowed to both compute and express facts. Numerical inconsistencies, misapplied rules, and unsupported claims can be traced back to specific layers and eliminated without affecting unrelated parts of the system. The result is a system where cost and correctness are both controlled through separation of responsibilities rather than increased model complexity.

5: Chapter 5: Formalize & Systemize 135

A working implementation begins with a narrowly defined document type. The unit of construction is a skill, which combines input schema, feature computation, semantic rules, generation constraints, and validation logic into a single packaged pipeline. The input schema defines the structure of accepted data. Each field has a fixed type and meaning. Inputs outside this structure are rejected or normalized before processing. This step removes ambiguity at the entry point. The feature layer computes derived values from the input schema. These computations are deterministic and expressed in standard tooling such as SQL or Python. The outputs include numerical transformations, aggregations, and formatted representations. Once computed, these values are stored and reused across all downstream operations for the same input. The semantic layer maps computed features into categorical labels. These mappings are expressed as explicit rules that define thresholds and conditions. The rules function as a translation layer between raw computation and narrative intent. Changes in business definition are reflected by modifying rules rather than rewriting logic. The generation layer receives three inputs: original data, computed features, and semantic labels. It produces structured text under strict constraints. The model is restricted to expressing provided values. No additional facts are introduced. Output formats are predefined, often as structured JSON containing narrative sections. The validation layer compares generated text against deterministic outputs. It extracts numerical values, categorical claims, and references, then checks them against the feature and semantic layers. Any deviation indicates failure. Output is either accepted or routed for correction. A complete skill behaves like a compiled artifact. Input enters through a fixed interface. Output is produced in a predictable format. Internal logic remains inspectable and versioned. Once a single skill is stable, the same structure can be replicated across multiple document types. Financial reports, product summaries, operational dashboards, and compliance documents follow identical architectural patterns. Variation exists only in schema definitions, feature logic, and semantic rules. As the number of skills increases, duplication appears in semantic definitions. Terms such as “strong performance,” “declining trend,” or “high risk” recur across domains, often with subtle differences in meaning depending on context. A static rule system cannot represent these contextual variations efficiently. Each skill encodes its own version of definitions, which leads to inconsistency and maintenance overhead. A knowledge graph introduces a shared semantic layer. Concepts are represented as nodes, and relationships between them are explicitly defined. Each concept carries attributes such as context, domain, and threshold values. This allows meaning to vary based on surrounding conditions rather than fixed rule files embedded in individual skills. In this structure, a query retrieves the appropriate definition of a concept based on context parameters such as industry, market state, or organizational role. The semantic layer no longer evaluates rules directly. It resolves references into context-specific definitions drawn from the graph. Feature computation remains unchanged. Inputs are still transformed into deterministic values. The difference lies in how those values are interpreted. Instead of fixed thresholds embedded in code or configuration files, interpretation depends on graph queries that return context-aware mappings. This creates composability across systems. Multiple skills reference the same underlying semantic nodes. A change in definition propagates through the graph without modifying individual pipelines. Consistency emerges from shared structure rather than replicated configuration. The generation layer remains unchanged. It still receives features and resolved semantic labels. The difference lies upstream, where those labels are derived from a shared semantic space rather than isolated rule sets. Validation also extends naturally. Outputs can be traced not only to feature computations but also to the specific semantic definitions used during interpretation. This adds a second layer of provenance, linking each statement to both numerical derivation and contextual meaning. The system shifts from isolated pipelines to a connected network of shared meaning, where document generation becomes an application of structured knowledge rather than repeated local interpretation.