Knowledge Operations: A Capability Model for AI Systems
Classifying AI systems by the knowledge operations they perform, not the data formats they use.
Gerasimos Xydas1
Version 0.1.0 · 2026-05-21 · DOI 10.5281/zenodo.20083456 · Changelog
© 2026 Gerasimos Xydas. Licensed under CC BY 4.0.
Executive Summary
Two AI systems can both be called "RAG" while performing fundamentally different kinds of knowledge work; conversely, two systems can use different substrates—vectors, SQL, graphs, or tools—while satisfying the same workload requirement. This mismatch makes architecture-centered labels a poor guide to capability. This whitepaper proposes the inverse:
A system's capability in handling knowledge-augmented work is determined by the knowledge operations it performs reliably, not by the format of the data it retrieves.
Storage and architecture are inputs. Knowledge operations define capability: retrieving, scoping, interpreting, combining, computing, traversing, orchestrating, governing, and evaluating. Some are reasoning operations (synthesis, multi-hop traversal, contradiction detection, evaluator-optimizer judgement). Others (scoping, permission enforcement, audit logging) are control operations that support reasoning instead of performing it. Both count toward fitness. A knowledge graph does not make a system more capable than one using vector search; ten agents are not automatically more capable than one. The right architecture matches the epistemic demands of the task.
We call the broader category Knowledge-Augmented Systems (KAS): AI systems that combine language models with external knowledge, structure, computation, and execution. The model proposes seven capability archetypes (K0–K6) wrapped by a governance scale (G0–G5) and an evaluation discipline applied to every operation.
Together they produce a capability profile rather than a single level: a system can be strong at scoped retrieval and cross-source synthesis while having no need for relational reasoning or computation. That is an appropriate design choice for the workload, not a deficit. Buyers and architects should evaluate fit per workload, not per industry or per vendor stack.
"Fitness to task" is not relativism. The framework retains prescriptive force: a profile is insufficient when the task demands operations the system cannot perform, when governance is below the risk class's prerequisite, or when the system over-engineers operations the task doesn't require, paying latency, cost, and complexity the task cannot absorb.
1. Introduction
Retrieval-Augmented Generation (RAG) was originally described as a simple "retrieve, inject, generate" pipeline. However, systems built around large language models (LLMs) now combine retrieval with iterative search, structured querying, graph traversal, tool orchestration, planning, and execution.
Enterprise buyers and architects need a language for this broader class of systems. This whitepaper proposes a capability model for Knowledge-Augmented Systems: AI systems that couple language models with external knowledge, structure, and execution.
This paper sits within a broader shift: enterprise software is increasingly being designed for AI agents as first-class consumers, not only for human users. Gartner forecasts that up to 40 percent of enterprise applications will feature task-specific AI agents by the end of 2026, up from less than 5 percent in 2025 [1]. Some practitioners describe the consumer-facing side of this shift as Agent Experience (AX): the experience an agent has when it attempts to understand a task, access the right context, act within policy, and produce traceable outcomes inside a system built for it. AX is not yet a settled industry discipline with anchoring citations, but it is useful shorthand for the consumer-facing complement to the system-side framework this paper proposes. The two are complements, not alternatives. §3.4 returns to the architectural implications.
1.1 Reasoning operations and control operations
Not every knowledge operation is a reasoning operation. Synthesis under contradiction, multi-hop traversal under constraint, and evaluator-optimizer judgement are reasoning. Scoping, permission enforcement, and audit logging are control operations: they shape what gets reasoned over and what happens with the result. Both classes count toward capability because a workload's reliability depends on the whole set, but conflating them obscures where to invest. A team that buys "agentic reasoning" capability when the actual failure was permission propagation has misdiagnosed the problem and will not fix it by adding agents.
If you take one thing from this paper: the right architecture for a knowledge-augmented workload is determined by the knowledge operations the task class demands, not by the latest vendor narrative or the most fashionable storage choice. Three different teams in the same organization may need three very different KAS profiles. Treating them as a single "RAG initiative" is a common and costly organizational failure mode in this space.
1.2 How this model was derived
The eight capability dimensions and seven capability archetypes are a practitioner synthesis, not an empirical derivation. Their inputs are: (a) the academic literature on retrieval-augmented generation, agentic systems, and multi-hop reasoning cited throughout this paper; (b) the evaluation criteria used by major analyst frameworks for adjacent categories (Forrester's Cognitive Search Wave, Gartner's Market Guide for Enterprise AI Search, IDC's MarketScape for Knowledge Discovery); (c) the failure-mode taxonomies surfaced in agentic RAG surveys [2] and applied-agent guidance [3]; and (d) recurring patterns observed in enterprise RAG deployments. The model is offered as a thinking framework, not a benchmark, and should be revised as evidence accumulates.
1.3 Contributions
This paper makes four contributions. First, it defines Knowledge-Augmented Systems as a workload-oriented category of AI systems in which external knowledge, computation, or state is load-bearing for the answer. Second, it proposes knowledge operations as the unit of capability analysis, in deliberate contrast to format-anchored or architecture-anchored taxonomies. Third, it introduces seven operation archetypes (K0–K6), a parallel governance scale (G0–G5), and an evaluation discipline that treats abstention and calibrated uncertainty as first-class system properties. Fourth, it provides a workload-level profiling method — including a membership test, a five-orthogonal-layer model, per-class evaluation artifacts, and a threat-model mapping — intended for system design, vendor evaluation, and procurement.
2. Limitations of Existing Maturity Ladders
A common market trope presents capability as a linear maturity ladder of formats and architectures:
chunks → metadata → hierarchy → graph → agents
This implies that every system must "graduate" from vector search through SQL to graphs and multi-agent orchestration. Real workloads do not follow such a sequence, and more importantly, the items in this ladder are storage and architecture choices, not knowledge operations. A vector database and a graph database are different substrates for different knowledge operations; neither is more capable than the other.
A customer-support bot may need orchestration over unstructured documents but never require graph traversal. A financial analyst may need exact computation but not cross-document synthesis. A compliance assistant may require permission-aware retrieval and provenance more than agentic autonomy.
Format-anchored ladders also create a misclassification trap: a system that uses a graph database is read as "more advanced" than a system that uses recursive SQL, even when both perform the same multi-hop knowledge operation.
Knowledge operations can be ordered by complexity (retrieving is simpler than synthesizing, which is simpler than orchestrating), and an ordering of knowledge operations is a useful descriptive device. But ordering knowledge operations is not the same as prescribing that every system must climb them in sequence. Different workloads demand different subsets, and the right system implements only the operations its task requires. This is why we use the term capability archetypes instead of maturity levels: K0–K6 are an ordered enumeration of operation classes, not steps on a staircase.
2.1 Capability matters only relative to task fitness
We use two distinct terms throughout this paper. Capability describes what knowledge operations a system can perform reliably (the K-class set, the governance level, the evaluation discipline). Fitness describes whether those capabilities meet a workload's required operations, governance prerequisites, evaluation rigor, and operational constraints (latency, cost, reliability). A system can be highly capable on every dimension and still be unfit for a specific workload: over-engineered, too slow, too expensive, or governed at a level the workload doesn't justify. Conversely, a narrow capability profile can be a perfect fit if it matches what the workload actually demands.
For many workloads, a well-governed system that performs only K1 retrieval and K3 synthesis is a better fit than a fragile K6 workflow that attempts all eight operations badly. The first is narrowly capable but well-fitted; the second is broadly capable but poorly fitted.
2.2 The prescriptive floor
"Fitness to task" is not relativism. The framework retains prescriptive force: a capability profile is insufficient when any of the following hold.
- Missing required operation. The task class demands a knowledge operation the system cannot perform reliably. A compliance task that requires temporal reconciliation, run on a system with no temporal reasoning capability, is undersized regardless of how well it performs its other operations.
- Inadequate governance for risk class. The governance level is below the prerequisite for the task's risk class. An irreversible-action workflow operating without human-approval gates is unsafe regardless of how well it reasons.
- Disproportionate operation set. The system performs knowledge operations the task does not require, at a latency, cost, or complexity the task cannot absorb. A K6 orchestration workflow deployed for a task that needs only K1 scoped retrieval is over-engineered, not "more capable", and on a real-world cost-quality tradeoff it is inferior to the simpler system.
This is what allows the framework to declare specific systems unfit for specific workloads, even when the same architecture would be appropriate elsewhere.
3. From RAG to Knowledge-Augmented Systems
RAG, introduced by Lewis et al. (2020) [4], remains an important early pattern. It helps LLMs ground responses in external information and reduce dependence on static model weights. The pattern sits inside a longer line of work on retrieval-augmented and non-parametric memory: DPR [5] established that dense vector retrieval could outperform BM25 on open-domain QA; REALM [6] showed that retrieval can be incorporated into pretraining itself rather than bolted on at inference time; RETRO [7] demonstrated that a much smaller model with retrieval can match the performance of a much larger model without it. The relevance to KAS is not which mechanism wins. It is that all three measure success on retrieval-component metrics, not on whether the surrounding system performs the synthesis, computation, traversal, governance, or orchestration the workload requires. Retrieval alone, by any of these mechanisms, is insufficient for complex enterprise knowledge work.
Modern systems increasingly need to:
- retrieve relevant evidence;
- scope by metadata, permission, tenant, product, geography, or time;
- preserve document structure and provenance;
- synthesize across multiple sources;
- compute over authoritative records;
- traverse relationships between entities;
- call tools and APIs;
- validate intermediate and final outputs;
- abstain when evidence is insufficient;
- escalate when human approval is required.
We call this broader category Knowledge-Augmented Systems (KAS).
3.1 Working definition
A Knowledge-Augmented System is an AI system that combines language models with external knowledge, structure, computation, and execution in order to answer questions, support decisions, or perform workflows grounded in authoritative information.
A knowledge operation is a system-level action that transforms, constrains, validates, computes over, or acts upon external knowledge in a way that is observable at the workload boundary. To qualify, an operation must be (1) a verb the system performs at its boundary, observable as a typed result; (2) not constitutively tied to a single storage format — the same operation is implementable over vectors, graphs, SQL, APIs, or text, even if a given deployment chooses one; (3) what the system does, not what it stores or how it reaches the source (§6.1 elaborates the layer distinction); (4) falsifiable at the boundary — a §8-style benchmark question and a §8.1-style recommended-failure-behavior row can be written for it; and (5) conceptually non-reducible to a single existing K-class without information loss. The full membership test, including the K6 meta-irreducibility carve-out, is in Appendix A.1. The distinction between operations and the substrates that implement them (vector stores, graphs, SQL engines, agent loops) is the move §1's thesis depends on.
3.2 What KAS is not
KAS is a deliberately bounded category. Two boundary conditions are obvious enough to dispose of in a sentence: a model answering from parametric memory alone is outside KAS (no external knowledge participates), and raw search that returns ranked documents for a human reader is not a KAS by itself (no model-mediated operation transforms, computes over, or acts on the results). Three adjacent labels need more careful distinction:
- Not Compound AI Systems in the broad sense. Compound AI Systems (Zaharia et al. [8]) is a useful umbrella term covering any system composed of multiple AI components, including pure orchestration of LLM calls without external knowledge access. KAS does not propose a new technical primitive: the label is a workload-evaluation lens for the subset of compound AI systems where external knowledge, computation, or state is load-bearing for the answer.
- Not Augmented Language Models. Augmented Language Models (Mialon et al. [9]) and Tool-Augmented LLMs (Toolformer [10] and successors) describe LLMs with reasoning, retrieval, or tool-use capabilities trained or prompted into the model. KAS describes the system around such a model: the typed contracts, governance, evaluation, and consumer-facing operations a workload requires. An Augmented LM is a substrate; a KAS is an architecture.
- Not RAG-only. RAG is one access pattern within KAS, not a synonym for it. A KAS may use RAG, NL→SQL, graph traversal, tool orchestration, or any combination; the architecture is determined by the workload, not by the pattern.
What is new about KAS as a label is that it scopes the category by operation classes (the K-dimensions), governance, evaluation, and consumer (human or agent), instead of by any single access pattern, model architecture, or vendor capability. The framing is the novelty, not any one operation.
3.3 Relationship to existing categories
KAS is a refinement of, not a replacement for, several existing categories. Naming them is necessary because each carries an audience that the KAS framing should connect to rather than ignore.
- "Augmented Intelligence" (IBM, in long-running use) describes the human-AI collaboration mode. KAS describes the system architecture independently of how it is consumed; the two are orthogonal.
- "Cognitive Search" (Forrester, established Wave category since 2017) overlaps most strongly with the K1–K3 portion of this model: retrieval, scoping, structure, and synthesis. Vendor capabilities in the category vary widely. KAS extends the framing to include computation, relational reasoning, orchestration, and explicit governance, and replaces feature-based scoring with knowledge-operation profiling.
- "Agentic RAG" (Singh et al., 2025 [2]) primarily emphasizes orchestration over retrieval workflows and maps most directly to K6, but in practice spans retrieval control, synthesis, tool use, and evaluation depending on implementation. KAS situates these within a broader capability framework that does not assume agents are the destination.
- "Naive / Advanced / Modular RAG" (Gao et al., 2023/2024 [11]) is the most cited maturity taxonomy in the space, and the one this paper most directly disagrees with. Gao et al. stage RAG architectures by pipeline shape: Naive (retrieve-then-generate), Advanced (with pre- and post-retrieval refinement), Modular (with rewriteable components). That staging reproduces the format-anchored mistake §2 names. It tells you what the system is built like, not what knowledge operations it performs. As a consequence, Modular RAG and Agentic RAG become difficult to distinguish under the Gao et al. framing — both involve "modules" — even though Agentic RAG introduces a planner that is a categorically different operation (K6) than parameterized module composition. KAS treats the Gao et al. stages as cells in a wider profile rather than as the profile itself.
The contribution of KAS is not the discovery of any one of these capabilities, but the reframing of capability around knowledge operations (including governance and evaluation as cross-cutting axes) and the multi-dimensional capability profile that follows.
3.4 KAS as agent-native architecture
The operation-over-format reframing aligns naturally with the rise of agents as the primary consumers of enterprise knowledge systems. Format-anchored thinking ("we use a vector DB") describes how a system stores knowledge; operation-anchored thinking describes what an agent can ask the system to do. Traditional search was built around a human user examining ranked results; a KAS is built around an agent consuming typed contracts that expose knowledge operations.
The eight capability dimensions correspond, in practice, to the operation and control-plane needs that arise when agents consume enterprise knowledge systems: retrieval, scoping, interpretation, synthesis, computation, traversal, orchestration, governance, and evaluation. Retrieval, scoping, interpretation, synthesis, computation, traversal, and orchestration are agent-facing primitives, not user-facing features. Governance and evaluation are not primitives an agent invokes directly. They are cross-cutting requirements the system must enforce so that an agent can act responsibly on its outputs.
This connects KAS to the broader Agent Experience (AX) framing: software designed primarily for agent consumption instead of human UI. AX is not yet a settled industry discipline with anchoring citations, but practitioners use it as shorthand for the consumer-facing side of the agentic shift, complementary to the system-side framing this paper proposes. The framework remains useful for human-consumed systems (analyst-facing search, document-QA assistants), but it shows its strongest value where the consumer is an agent: an Evidence Pack returned to an LLM planner, a typed artifact passed to a downstream tool, a calibrated-uncertainty signal an agent uses to decide whether to abstain or escalate.
Where those agents need to discover and invoke each other across vendors, the integration surface is a separate framework. Three open standards are converging on this layer. MCP [12] (Model Context Protocol, Anthropic, 2024) connects LLM applications to external data sources and tools. A2A [13][14] (Agent2Agent Protocol, Google 2025, donated to the Linux Foundation in 2025) handles communication and interoperability between opaque agentic applications. AGNTCY [15][16] (Linux Foundation, 2025) covers discovery, identity, messaging, and the OASF schema [17]. These standards tell one agent how to invoke another. KAS tells you what either agent's knowledge operations actually do. The two layers are orthogonal, and Appendix C maps the OASF vocabulary to the K-classes as a worked example; equivalent mappings for MCP tools and A2A agent-card capabilities are deferred to companion work.
4. A Capability Stack for Knowledge-Augmented Systems
Instead of classifying systems along a single ladder, we propose evaluating them across independent dimensions. Each dimension reflects a fundamental knowledge operation.
| Dimension | Question answered |
|---|---|
| Retrieval control | Can the system reliably find the right evidence? |
| Context modeling | Does it preserve structure, hierarchy, and provenance? |
| Synthesis | Can it combine evidence across sources and surface candidate contradictions, scope differences, or conflicts? |
| Computation | Can it generate exact, executable answers through DSLs, SQL, APIs, or other tools? |
| Relational reasoning | Can it traverse entities, dependencies, lineage, and multi-hop links? |
| Orchestration | Can it plan, route, validate, recover, and coordinate tools? |
| Governance | Can it enforce permissions, provenance, policy, and auditability on every knowledge operation it performs? |
| Evaluation | Does it know when to answer, when to abstain, and how to measure correctness? |
This capability-stack view avoids the mistaken assumption that every organization must adopt graphs, agents, or SQL to be "advanced". Different workloads require different profiles.
A note on governance. Governance appears twice in this framework: as one of the eight capability dimensions above, and as a parallel governance scale (G0–G5) developed in §9. These are not duplicates. The §4 entry asks whether the system can enforce governance at all; the §9 scale asks how deep that enforcement goes. A system either has the capability or it does not (the §4 axis); among systems that have it, governance depth ranges from "no citations, no controls" (G0) to "runtime monitoring with human approval" (G5). The two together let architects evaluate both the presence and the depth of governance, treating it as a cross-cutting requirement on every other knowledge operation rather than a separate track.
A parallel note on evaluation. Evaluation is listed as a dimension because evidence sufficiency, calibrated abstention, and correctness measurement are observable system capabilities — a system either produces these signals or it does not. It is also treated as cross-cutting because those capabilities must be applied to every K-operation rather than isolated in a separate K-class. The same dual treatment that justifies governance's presence in both the dimension table and the G0–G5 scale (§9) justifies evaluation's presence as both a dimension and the normative answerability table of §8.1.
4.1 Operational constraints
The eight dimensions describe what a system can do. Operational constraints govern how it should do it. Three are non-negotiable in production:
- Latency. What response time is acceptable for the workload? Iterative multi-hop retrieval and orchestrated tool calls compound latency; this is not free.
- Cost. Per-query LLM and infrastructure cost scales with orchestration depth, context size, and retrieval breadth. Higher capability levels are not inherently cheaper.
- Reliability. What is the acceptable failure rate, and what happens when a retrieval or tool call fails?
These constraints bound every capability choice. An architecture that scores well across all eight dimensions but fails the latency or cost budget for the workload is over-engineered, not capable.
For the deliberate question of why the K-class set stops at K6 (and where memory, multimodality, temporal reasoning, reflection, multi-agent coordination, generation, and translation live instead), see Appendix A.
4.2 Knowledge preparation prerequisites
The capability stack assumes that retrievable evidence, queryable records, and traversable relationships exist in a form the system can use. They do not appear by accident. Document parsing, OCR, table and layout extraction, chunking strategy, metadata assignment, entity resolution, schema mapping, lineage capture, and index freshness all sit upstream of K0 and silently determine whether any K-class above can perform.
Knowledge preparation does not satisfy the §3.1 membership test
for knowledge operations and is not itself a K-class. It is a
layer-1 prerequisite (per §6.1) on which the operations above
depend. A K1 system cannot enforce permission-aware retrieval if
documents arrived without ACL metadata. A K3 system cannot reconcile
contracts if the same legal entity appears as "Acme Inc.", "Acme,
Incorporated", and "ACME INC" across the index without entity
resolution upstream. A K4 system cannot compute over an
authoritative record if the schema mapping conflates
created_at and updated_at.
Three implications follow. First, KAS profiles for procurement and architecture should specify the preparation budget alongside the operation set; a K3+G3 workload over enterprise PDFs needs a layout-aware extraction pipeline before any K3 work begins to matter. Second, the most common reason K-class capability appears to fail in deployment is ingestion-side, not operation-side: the operation is correct but the inputs are malformed. Third, evaluation harnesses (§8.2) should test preparation quality independently from operation quality; otherwise improvements at one layer get attributed to the other.
5. Capability Archetypes
How §4 and §5 relate. §4's eight dimensions are evaluation axes: they score what a system can do. §5's seven archetypes are workload-shape labels: they name what a task needs as its dominant operation. Six of the eight dimensions map directly to K-classes (Retrieval splits into K0/K1 by governance presence; Context modeling ↔︎ K2; Synthesis ↔︎ K3; Computation ↔︎ K4; Relational reasoning ↔︎ K5; Orchestration ↔︎ K6). The two cross-cutting dimensions, Governance and Evaluation, wrap every K-operation and have their own scales (G0–G5 in §9 and the normative §8.1 table) instead of dedicated K-classes.
The K-classes below should be read as archetypes, not as steps on a maturity ladder. A real system may combine K1 scoped retrieval, K4 computation, and K2 document understanding without needing K5 graph traversal. The ordering reflects increasing operation complexity, not a sequence every system must climb.
A system should not be assigned a single K-class unless one operation clearly dominates; most real systems should be described as a vector of K-classes plus a G-level (e.g., K1 + K3 + K4; G3). The K-classes are not mutually exclusive, and the framework's value comes from the combination, not from picking one.
| Class | Name | Primary operation | Typical architecture | Example question | Key risk |
|---|---|---|---|---|---|
| K0 | Basic Retrieval | Retrieve | Vector search plus prompt injection | "What does this document say?" | Shallow grounding |
| K1 | Scoped Retrieval | Scope | Hybrid search, metadata, ACLs | "Find policy for EU enterprise customers." | Bad filters or stale permissions |
| K2 | Contextual Understanding | Interpret | Hierarchical retrieval and section awareness | "Explain obligations in section 5." | Losing document structure |
| K3 | Cross-Source Synthesis | Combine | Multi-index retrieval, entity/source mapping | "Compare the policy, contract, and ticket history." | Poor contradiction handling |
| K4 | Computable Knowledge | Compute | SQL, DSL, APIs, validation | "How many accounts breached SLA?" | Generated query errors |
| K5 | Relational Reasoning | Traverse | Graph/entity model and path reasoning | "Which services depend on this component?" | Graph incompleteness |
| K6 | Orchestrated Workflows | Plan and act | Planner/router, tools, memory, guardrails | "Investigate churn risk and draft next actions." | Unsafe autonomy |
5.1 K0 — Basic Retrieval
K0 systems retrieve unstructured chunks and summarize them: semantic search, prompt injection of retrieved context, generation with little or no validation. This is the classic naive RAG pattern, and most public RAG demos sit here. Its failure modes are well-understood — irrelevant chunks, hallucination when retrieval is weak,2 poor source attribution, no awareness of when evidence is insufficient — and they motivate every K-class above. K0 systems are not deficient. They are the right shape for FAQ bots, document search, and simple policy lookup over public content. They become deficient only when deployed as the answer to a workload that needed K1's scoping or K3's contradiction handling.
(Absence of permission awareness is a property of K0, not a failure of it; K1 introduces scoped retrieval as a deliberate next step.)
5.2 K1 — Scoped Retrieval
K1 systems add control by scoping retrieval. They filter and rank evidence using metadata, permission filters, tenant boundaries, product scope, geography, date ranges, role, or other constraints.
Typical capabilities:
- metadata filtering;
- hybrid keyword/vector search;
- reranking;
- permission-aware retrieval;
- citations and source display.
Typical use cases:
- enterprise search;
- customer support scoped by product version;
- policy retrieval by region or role.
Main failure modes:
- stale metadata;
- incorrectly inherited permissions;
- over-filtering that hides relevant evidence.
5.3 K2 — Contextual Understanding
K2 systems preserve and operate over document structure. Instead of treating documents as flat chunks, they represent and use sections, headings, hierarchy, tables, appendices, and local scope.
Typical capabilities:
- hierarchical retrieval;
- section-aware QA;
- cross-section coherence (relating content in one section to its parent or sibling sections);
- local context preservation;
- provenance at section or paragraph level.
Typical use cases:
- contract interpretation;
- manual or policy analysis;
- technical documentation assistants.
Main failure modes:
- losing section boundaries;
- mixing obligations from unrelated parts of a document;
- weak table interpretation.
5.4 K3 — Cross-Source Synthesis
K3 systems combine evidence across documents and sources. They can compare, reconcile, summarize, and surface candidate contradictions for review. In enterprise settings, contradiction detection is rarely binary: most apparent contradictions stay unresolved until scope, authority, effective date, jurisdiction, and version precedence are known. Source authority is itself a first-class K3 input: a signed contract outranks a sales deck, a production database outranks a copied spreadsheet, an official policy outranks a Slack message. A K3 system that treats sources as equally weighted is averaging, not synthesizing.
Typical capabilities:
- multi-source retrieval;
- aligning evidence across sources around a shared subject (the system operates over entity resolution instead of performing it; entity resolution itself is a data-engineering and NLP problem upstream of the synthesis layer, and K3 systems consume its outputs);
- source mapping;
- candidate contradiction and scope-conflict detection (validated contradiction detection across enterprise documents, including temporal, jurisdictional, conditional, and version-scoped variants, remains a hard evaluation problem; K3 systems should surface candidates, not assert resolutions);
- comparative answers.
Typical use cases:
- comparing contracts;
- synthesizing market research;
- combining ticket history, product docs, and customer notes.
Main failure modes:
- flattening disagreement into false consensus;
- over-weighting recent or more verbose sources;
- weak provenance.
(Temporal reasoning is a frequent modifier at K3; see Appendix B.3 for boundary-case treatment.)
5.5 K4 — Computable Knowledge
K4 systems do not merely summarize text. They compute exact, executable answers from authoritative records. The knowledge operation is producing a verifiable result over structured data; in production today this almost always means delegating the computation to a system designed for it (a query engine, an API, a domain-specific calculator) instead of asking the LLM to compute internally. Tool-augmented LLMs that learn when to call APIs (Toolformer [10] and successors) push the boundary on this delegation, but LLM-internal computation with self-verification remains an active research direction, not yet a production pattern, and treating it as one is a recurring failure mode of K4 systems.
Typical capabilities:
- aggregation, filtering, and counting over authoritative records;
- exact numeric answers with traceable derivation;
- query validation and result verification;
- composing multi-step computations that an end user can audit.
Typical implementations:
- SQL generation against an enterprise database;
- Elasticsearch or domain-specific DSL queries;
- API calls to authoritative systems;
- domain-specific calculators or rules engines invoked as tools.
Typical use cases:
- analytics assistants;
- SLA reporting;
- financial exposure analysis;
- operational dashboards.
Main failure modes:
- invalid generated queries;
- schema misunderstanding;
- column-grounding errors (NL→SQL ambiguity, e.g., picking
created_atwhen the user meantupdated_at) [18]; - computing over the wrong source of truth;
- presenting unverified outputs as exact results (regardless of whether they came from a query or model inference).
(On the distinction between K4 result-verification and system-level claim-level fact-checking, see Appendix B.5.)
5.6 K5 — Relational Reasoning
K5 systems traverse and reason over relationships between entities. The knowledge operation is following chains of inference across connected facts, independent of whether those relationships are stored in a property graph, a relational schema with foreign keys, an RDF triplestore, or extracted on the fly from text.
Typical capabilities:
- multi-hop inference across entity relationships;
- dependency and lineage tracing;
- path explanation (showing which relationships justified the conclusion);
- relationship grounding (verifying claimed relationships against source data).
Typical implementations:
- property graph traversal (Neo4j, RDF, knowledge graphs);
- recursive SQL CTEs over relational schemas;
- knowledge-base traversal;
- hybrid graph + LLM-summarization approaches (e.g., GraphRAG [19]) and hybrid graph + vector retrieval combinations.
Typical use cases:
- software dependency analysis;
- supply-chain risk;
- legal entity ownership;
- fraud networks;
- infrastructure impact analysis.
Main failure modes:
- incomplete or partial relational coverage;
- stale relationships;
- schema drift / ontology mismatch between the relational model and the source data;
- path explosion;
- ranking traversal results by structural proximity (e.g., shortest path) rather than semantic relevance;
- mistaking traversal for validated reasoning (paths exist in the data but do not justify the inference).
5.7 K6 — Orchestrated Workflows
K6 is the K-class where the framework is tested hardest, because
what makes a system K6 is not what it can do but what it
decides to do in real time. A K6 system has a planning
policy that, given a task, selects which K0–K5 operations to run, in
what order, with what tools, and with what stopping criterion. The
planning policy is the operation. A pipeline that hard-codes K0 → K3
→ K4 with no choice point is not K6 even when it touches three
K-classes (this is the meta-irreducibility argument settled in
§A.2), and a system that wraps an LLM in a while loop
with tool-calling is not K6 either if the loop has no policy beyond
"keep going until done."
The two most-cited reasoning-and-acting patterns sit on the spectrum this distinction names. ReAct [20] interleaves a reasoning trace with tool actions inside a single chain. Its planning policy is "decide what to think about next given what just happened" — sufficient for short bounded tasks, but it degrades on long horizons because the policy has no memory of why earlier branches failed. Tree of Thoughts [21] generalizes the same idea into deliberate search over reasoning trajectories with explicit backtracking and a policy of "evaluate partial states, expand the most promising ones, prune the rest"; it pays a coordination cost ReAct does not. Production K6 systems borrow from both: ReAct-style traces inside individual sub-tasks, ToT-style search at the task-decomposition layer, and evaluator-optimizer loops [3] at the boundary between them. None of these is a complete blueprint. They are starting points for the planning policy a specific workload demands. Singh et al.'s agentic RAG survey [2] catalogues the productized variants; Anthropic's Building Effective Agents [3] catalogues the engineering patterns.
Typical capabilities — task decomposition and tool selection; evaluator-optimizer loops where an evaluator judges intermediate output and an optimizer revises the plan; recovery from tool failure (retry, substitute, escalate, abort); memory across steps (within-session at minimum, cross-session for case-based work); approval gates for consequential actions; and traceable plan execution that an auditor can reconstruct after the fact.
Typical use cases include customer support resolution workflows, research and compliance investigations, software engineering agents that touch real repositories, and revenue or churn-risk workflows that span retrieval, computation, and recommendation.
The failure modes are categorically harder than in single-operation systems. Unsafe autonomy and hidden tool failures are well-known. What receives less attention: reward / objective hacking, where the system optimizes a proxy of the goal (typically a high evaluator score) instead of the goal itself; circular planning, where the policy revisits a path it has already exhausted because it has no memory that it did so; escalating cost and latency from over-long traces or unnecessary parallel branches; and poor auditability when the planning trace is verbose enough to be present but disorganized enough to be unusable in a regulator-facing review. The first two are research-active. The last two are engineering problems most production K6 systems have not yet solved.
(Memory, multi-agent topologies, reflection, and the meta-irreducibility of K6 are recurring boundary-case questions; see Appendix A.2 and Appendix B.)
6. Retrieval and Access Patterns
A common mistake is to treat retrieval technologies as capability classes. They are not. Vector search, hybrid search, SQL, and graphs are access patterns that can be composed.
| Pattern | Best for |
|---|---|
| Keyword / BM25 | Exact terms, names, IDs, legal clauses |
| Vector search | Semantic similarity and paraphrase |
| Hybrid search | Balancing lexical and semantic matching |
| Metadata filtering | Scoped retrieval, permissions, tenant boundaries |
| Hierarchical retrieval | Long documents, manuals, policies, specifications |
| Entity retrieval | People, products, accounts, assets, organizations |
| Structured querying | Exact facts, metrics, filters, aggregations |
| Graph traversal | Dependencies, lineage, ownership, multi-hop relationships |
| Iterative retrieval | Ambiguous, exploratory, or multi-step questions where initial results inform subsequent queries. Iterative retrieval sits at the boundary between this layer and the operation layer above: the act of retrieving repeatedly is an access pattern, but the decision of when, what, and how to retrieve next is a knowledge operation that belongs to K6 orchestration. The two are coupled but distinct; a system can offer iterative-retrieval primitives without being capable of agentic query planning. |
These patterns are not ordered by sophistication. They are design choices.
The future is not "RAG versus GraphRAG versus agents." The future is compositional systems that choose the right knowledge operation for the task.
6.1 Five orthogonal layers
A workload profile becomes more rigorous when it distinguishes five independent layers, each of which can vary without forcing changes in the others:
- Source of truth — documents, relational database, knowledge graph, API, ticketing system, observability store. What the system reasons over.
- Access pattern — BM25, vector search, hybrid search, SQL, graph traversal, API call (the patterns enumerated in §6 above). How the system reaches the source.
- Knowledge operation — retrieve, scope, structure, synthesize, compute, traverse, orchestrate (the K0–K6 framework in §5). What the system does with the evidence it reaches.
- Control plane — permissions, provenance, evaluation, approval, audit (the governance scale in §9 and the evaluation discipline in §8). Under what constraints the system operates.
- Consumer — human, agent, downstream system (the AX framing in §3.4). Who the system serves and what contract that consumer expects.
The KAS framework in this paper lives at layer 3 (operations). §6 patterns sit at layer 2. §9 governance and §8 evaluation are the control plane at layer 4. The AX discussion is layer 5. The five-layer view is what allows the same system to serve different workloads. A single retrieval engine (layer 2) can support a K1-only human-facing search (layers 3 and 5) and a K6 agent-facing orchestration (layers 3 and 5) by recomposing layers 3 and 4 without touching layer 1 or 2.
7. Four Ways Systems Use Knowledge
A key distinction between less capable and more capable systems is whether retrieved information is treated as context, evidence, data, or state.
| Mode | Knowledge is treated as | Example |
|---|---|---|
| Context | Text that conditions the model | "Summarize this policy." |
| Evidence | Support for a claim | "Which clause justifies this answer?" |
| Data | Computable records | "Count overdue invoices by region." |
| State | Something that can be changed | "Update the ticket and notify the owner." |
This distinction helps separate simple RAG assistants, analytics assistants, decision-support systems, and agents.
State mutation requires action governance and usually implies K6 when the action is selected dynamically. A static pipeline that performs a fixed update may mutate state without being K6, but it should still inherit the relevant governance requirements.
8. Evaluation Criteria
A capability model is only useful if it can be tested. The following benchmark-style questions illustrate how to assess each capability.
Retrieval benchmarks like BEIR [22] and embedding benchmarks like MTEB [23] are useful for evaluating components, but they do not validate a full KAS. Full-system evaluation must additionally test answer faithfulness, citation correctness, permission enforcement, tool-call correctness, abstention behavior, and latency/cost distributions under realistic workloads. RAG-specific evaluation frameworks like RAGAS [24] address part of this gap (faithfulness, context relevance, answer relevance); BIRD [18] is the reference benchmark for K4 NL→SQL correctness; GAIA [25] is the closest analogue for K6 tool-use and multi-step reasoning. A complete evaluation harness for a KAS extends beyond retrieval and generation to cover the orchestration, governance, and computation dimensions.
A K-class score is meaningless without specifying which kind of correctness was tested. A vendor that scores high on retrieval-component metrics but has not been tested for governance correctness or abstention behavior is a retrieval engine that has not been governance-tested, not yet a K1 or K3 system in the prescriptive sense of §2.2.
| Capability | Test question | Expected behavior |
|---|---|---|
| Retrieval3 | "What does policy X say about refunds?" | Retrieve a relevant passage and summarize it accurately, over text and over images, tables, and document layouts when those modalities are indexed. |
| Filter | "Show refund rules for EU enterprise customers only." | Apply metadata and security filters correctly and indicate when nothing matches. |
| Interpretation | "Explain the obligations in section 5.2 of contract A." | Respect document structure and scope the answer to the specified section, including tabular and visual content where present. |
| Synthesis | "Compare the refund policy in contract A versus contract B." | Merge evidence from both documents, cite sources, expose contradictions, reconcile versions if either contract was amended, and produce a generalization only with the source coverage that supports it. |
| Computation | "How many open tickets violate the SLA by region?" | Produce an exact, verifiable answer against authoritative
records (via SQL, API, or verified inference), surface the
derivation, and reproduce the same answer when re-run with an
as_of parameter against the historical record. |
| Relational reasoning | "Which services depend on this database?" | Traverse dependencies in the relational model (graph, schema, or knowledge base) and enumerate affected services with the path of inference. |
| Orchestration | "Investigate why churn risk increased and draft a plan." | Decompose the task, retrieve relevant evidence, compute metrics, validate results, and generate recommendations. |
| Governance | "Show only the policies this user is authorized to see, and produce the audit trail." | Enforce permissions before retrieval, log every accessed document, and produce a complete audit trail on demand. |
| Evaluation | "How confident are you in this answer? Should you abstain?" | Surface a calibrated uncertainty / evidence-sufficiency signal, ground individual factual claims against authoritative records when claims are factual, abstain when evidence is below threshold, and explain why. Raw model probabilities are not enough; uncertainty must be empirically calibrated against a task-specific evaluation set. |
8.1 Answerability and abstention
Higher capability does not mean more answers. It means better control over when to retrieve, compute, escalate, or abstain.
A capable system should know when evidence is insufficient. And know here means produce an empirically calibrated signal, not a self-report of model confidence.
The table below is normative: it describes what a capable system at each class ought to do when evidence is insufficient. With the exception of the K0 row, these behaviors are not yet standard in deployed systems. Surfacing ambiguity, contradiction, incomplete paths, and conflict are active research and engineering frontiers, not commodity features. Self-reflective retrieval [26] and corrective retrieval [27] are early steps toward retrieval-control and critique at K1/K3; they do not yet constitute a solved production-grade abstention framework. Selective-prediction techniques are advancing toward K4 fail-closed behavior; agent benchmarks like GAIA [25] are surfacing the gap between K6 systems' demonstrated reasoning + tool-use performance and the level of reliability real-world deployments require.
| Class | Recommended failure behavior |
|---|---|
| K0 | May hallucinate if retrieval is weak (descriptive: this is the actual behavior, not the recommendation). |
| K1 | Should say "no matching source under the selected filters." |
| K2 | Should flag ambiguity within a document. |
| K3 | Should surface candidate contradictions between sources. |
| K4 | Should fail closed when computation cannot be executed or verified, or when the requested time window falls outside the indexed historical range. |
| K5 | Should expose incomplete or uncertain relational paths. |
| K6 | Should escalate when plans, tool outputs, or evidence conflict. |
| Any class with temporal modifier | Should expose validity windows, reject or warn on as-of queries
outside indexed range with "not valid as of |
8.2 Evaluation artifacts per capability class
The criteria in §8 describe what to test. Procurement and architecture teams need a more concrete answer: what artifacts must a vendor produce to demonstrate that a given K-class is exercised reliably on the buyer's data? The table below names the minimum artifact set per class. Each artifact is owned by the buyer, runnable on the buyer's corpus, and produced before the vendor passes acceptance. A vendor demo is not a substitute.
| Class | Minimum evaluation artifacts |
|---|---|
| K0 | Retrieval judgment set on the buyer's corpus; faithfulness evaluation set with ground-truth answers and grounding spans |
| K1 | ACL adversarial test set (queries that should return zero results when the requesting role lacks access); negative-filter coverage set (queries scoped to a tenant/region/time window with verified false positives held to zero) |
| K2 | Section-boundary test set (queries that ask "what does §X say about Y" with verified section-scoped answers); table and layout-element extraction tests |
| K3 | Contradiction-and-version test set (multi-document fixtures with known temporal, jurisdictional, and version conflicts); source-authority ranking tests with hand-graded outcomes |
| K4 | Executable-query gold set with answers reproducible across runs;
schema-ambiguity test set (column-grounding edge cases);
as_of reproducibility tests against historical
snapshots |
| K5 | Path-completeness tests (known multi-hop paths that must surface); path-validity tests (paths that exist structurally but should not justify the inference) |
| K6 | Task-trace fixtures with expected tool-call sequences; failure-injection tests (tool unavailable, tool returns wrong type, tool returns slowly); per-task cost and latency caps; escalation-trigger tests |
| Cross-cutting (G) | Audit-replay tests (reconstruct the answer from the trace); permission-regression tests (changes to ACLs reflected at retrieval time); provenance-verification tests (every cited claim resolves to a source within scope) |
RAGAS [24], BIRD [18], and GAIA [25] provide reusable harnesses for retrieval/synthesis, K4 NL→SQL, and K6 tool-use respectively, but the test data must be drawn from the buyer's corpus to detect workload-specific failures the public benchmarks miss. A high score on a public benchmark is necessary, not sufficient.
9. Trust and Governance
In regulated environments, correct answers are not enough. Systems must respect access controls, show provenance, detect stale sources, and provide audit trails. The categories below are aligned with the broader principles of the NIST AI Risk Management Framework [28] (Govern, Map, Measure, Manage), but are scoped to runtime properties of a knowledge-augmented system rather than the full risk-management lifecycle. The same scope maps onto the EU AI Act's risk-based obligations: high-risk AI systems require risk management, transparency, human oversight, and post-market monitoring, and a workload's KAS profile is one of the inputs an organization can use to demonstrate which of those obligations are technically satisfied.
We propose a parallel governance scale.
| Level | Governance capability |
|---|---|
| G0 | No citations, no access controls, no auditability. |
| G1 | Display basic citations and source IDs. |
| G2 | Permission-aware retrieval and context filtering. |
| G3 | Provenance tracking, freshness checks,4 and calibrated uncertainty / evidence-sufficiency signals. |
| G4 | Policy enforcement with redaction and audit logs. |
| G5 | Runtime monitoring and human-approval workflows for tool calls. |
Governance should not be added at the end. It must shape retrieval, context construction, tool access, and response generation from the start.
9.1 Governance prerequisites by capability class
Not every capability class requires full governance, but certain classes make specific governance capabilities non-negotiable. The table below shows the minimum recommended governance level for responsible deployment at each capability class.
A first-order rule overrides every row in this table: the moment private or enterprise data is indexed, G1 becomes the practical floor regardless of capability class (basic citations and source IDs, at minimum so users can trace what was returned to them). G0 is acceptable only over public, non-sensitive data used for non-decision-support purposes (an FAQ over a public manual, a demo over open-domain content). The rows below describe minimum governance given the capability class; private-data sensitivity raises the floor independently.
| Capability class | Minimum G level | Rationale |
|---|---|---|
| K0 Basic Retrieval | G0 (public data) / G1–G2 (private data) | Acceptable for exploratory or non-sensitive use; the moment private enterprise data is indexed, G1 is the floor and G2 (permission-aware retrieval) the practical default. Internal-only systems are where data leakage typically happens first |
| K1 Scoped Retrieval | G2 | Permission-aware retrieval without enforced access controls defeats the purpose |
| K2 Contextual Understanding | G1 | Section-attributed citations are the floor; permission-awareness is orthogonal to section-aware retrieval and depends on data sensitivity, not on structural understanding |
| K3 Cross-Source Synthesis | G3 | Combining sources without freshness checks and calibrated uncertainty signals obscures contradictions |
| K4 Computable Knowledge | G3 (analytic) / G4 (business-critical) | Computed answers must carry provenance and audit trail; results must not be presented as exact when their derivation is unverified. Business-critical workloads (financial reporting, regulatory submissions) raise the floor to G4 |
| K5 Relational Reasoning | G3 | Multi-hop traversal without relationship provenance produces untraceable conclusions |
| K6 Orchestrated Workflows | G3 (bounded tools) / G5 (consequential decisions) | Bounded-tool orchestration over well-scoped knowledge can operate at G3/G4; orchestration that issues consequential, hard-to-reverse actions requires G5 human-approval gates |
These are floors, not ceilings. Regulated industries should treat G4 as the default baseline regardless of capability class. The K6 row is the most domain-sensitive: a code-completion agent and a payment-execution agent are both K6 systems, but they sit on opposite ends of the consequential-action spectrum and require very different governance.
This domain-sensitivity is general, not specific to K6. The governance floor for any workload is the maximum of five factors: data sensitivity (PII, PHI, financial, regulated content), action consequence (read-only versus reversible-write versus irreversible-or-financial), regulatory burden (HIPAA, SOX, GDPR, EU AI Act high-risk), reversibility (whether a wrong answer can be undone before it propagates), and automation level (human-in-the-loop, human-on-the-loop, fully autonomous). The K-class table sets the floor given the operation; this max-of-five formula sets the floor given the workload. Whichever is higher wins.
9.2 Threat model for KAS
The governance scale in §9 addresses correctness and accountability under good-faith inputs. It does not, by itself, address adversarial inputs. KAS systems also face a category of threats a buyer's RFP should explicitly cover. The OWASP Top 10 for LLM Applications [29] catalogues the application-layer risks; the table below maps the ones that affect KAS specifically to the capability layer where they land, with the mitigation pattern named in KAS terms.
| Threat | KAS layer affected | Mitigation pattern |
|---|---|---|
| Prompt injection in retrieved documents | K0–K3, K6 | Treat retrieved content as untrusted; isolate instruction-bearing tokens from agent prompts; use structured tool schemas at K6 instead of free-form delegation |
| Retrieval poisoning (malicious or polluted corpus content) | K0–K3 | Source-authority ranking (§5.4); ingestion-time content scanning; provenance trails that survive synthesis |
| Permission leakage through retrieval cache or embeddings | K1, G2–G4 | ACL-aware caching keyed by user/role; embedding-level scope filters; G4 redaction applied at retrieval time, not generation time |
| Tool misuse / excessive agency | K6, G5 | Bounded tool catalogues per agent role; approval gates on consequential actions; rate limits per tool per session |
| Insecure output handling into downstream systems | K4, K6 | Output validation against typed schemas; never pass raw model output as a downstream command without parser validation |
| Stale or poisoned source-of-truth connectors | K1–K6 | Connector health checks; index-freshness SLOs (§9 G3); source-authority precedence rules |
| Cross-tenant memory leakage | K6 (memory layer) | Tenant-scoped memory backends; cross-tenant retrieval blocked at the K1 layer; audit logs partitioned per tenant |
A G-level alone does not capture adversarial robustness. A G5 system with no prompt-injection defenses is governed but unsafe. Threat-model coverage should be tested on the same evaluation cadence as the K-class artifacts in §8.2, and the test data should include adversarial fixtures the buyer constructs (not only public red-team corpora).
10. Common Misclassifications and Anti-Patterns
The market is full of claims that don't hold up, and the cost is starting to show. Gartner has forecast that more than 40% of agentic AI projects will be canceled by the end of 2027 [30]. This paper interprets one recurring contributor as a capability mismatch: systems are sold as orchestrated, K6-capable agents when the underlying governance, recovery, and evaluation machinery is not present.
| Claim | Reality |
|---|---|
| "We have a vector database, therefore we do RAG." | Vector search is retrieval infrastructure. Without filters, structure, provenance, or answer validation, it remains basic retrieval. |
| "Using a knowledge graph automatically provides reasoning." | A graph is a storage medium. Reasoning requires traversal, grounding, validation, and explanation. |
| "Adding agents makes us K6." | Agentic orchestration without tool governance, recovery, and evaluation is unsafe automation. |
| "SQL access implies exact answers." | Query generation must be correct, schema-grounded, and permission-aware. |
| "More context means better answers." | More context can increase distraction, contradiction, leakage, and cost; the "lost in the middle" effect [31] is well-documented. |
| "The most advanced system is the one with the most tools." | The most capable system is the one whose tools are necessary, governed, and measurable. |
11. Illustrative Workload Profiles
The most important point about industry examples is that industry is the wrong unit of analysis. The right unit is the workload. Two workloads in the same industry, even within the same team, can demand radically different KAS profiles. The examples below illustrate this directly: each industry shows two contrasting workloads that produce different profiles.
The placements are illustrative, not empirical. They reflect typical workload demands as observed in practitioner discussion and analyst category descriptions; they are not published deployment data. Architects should refine each row against their own use cases.
11.1 Financial services — two workloads, two profiles
| Workload | Capability profile | Why |
|---|---|---|
| Portfolio analytics assistant ("What is our exposure to issuer X across funds?") | K1 + K4 dominant; G3 | Permission-scoped retrieval into structured data, then computation against authoritative records. Synthesis and orchestration are low-value here; the answer is a query, not a narrative. |
| AML alert investigation ("Investigate alert #12345 across communications, transactions, sanctions lists, news.") | K1 + K3 + K5 + K6; G4–G5 | Cross-source synthesis over comms and structured data, multi-hop entity traversal across counterparties, orchestrated case-file assembly. Outputs feed regulatory filings, so governance must be strict. |
Both workloads are "Financial Services" but require different vendor strengths, different governance baselines, and different architectural choices.
The single most expensive KAS-procurement error. Buying one platform to serve a portfolio that spans K1+K4 analytics and K1+K3+K5+K6 investigative work systematically over-pays for the analytics workload and under-serves the investigative one. Both pass surface-level vendor demos because the demo corpus exercises neither extreme. The fix is per-workload procurement, even when the workloads share an industry label. This is the most common failure mode named in §2.2 — disproportionate operation set in one direction, missing required operations in the other — and it is the one that motivates this paper's insistence on workload-level rather than industry-level evaluation.
11.2 Legal — two workloads, two profiles
| Workload | Capability profile | Why |
|---|---|---|
| Contract clause lookup ("What does §4.2 of contract A say about indemnification?") | K1 + K2 dominant | Section-aware retrieval and structure preservation are everything. Synthesis is only needed across a handful of similar contracts. |
| Litigation discovery synthesis (Identify and reconcile claims across thousands of communications and depositions.) | K1 + K2 + K3; G4 | Adds heavy cross-source synthesis with contradiction detection and audit-grade governance. Orchestration is optional; computation is irrelevant. |
11.3 Customer support — two workloads, two profiles
| Workload | Capability profile | Why |
|---|---|---|
| Help-article search | K0 + K1 dominant | Simple retrieval scoped by product and version. |
| Resolution agent (Resolve a refund request: retrieve relevant policy, check ticket history, compute refund eligibility, escalate or act.) | K1 + K3 + K4 + K6; G3 | Pulls structured and unstructured sources, computes against billing data, orchestrates tool calls and escalations. |
11.4 Software engineering — two workloads, two profiles
| Workload | Capability profile | Why |
|---|---|---|
| Documentation search | K0 + K1 dominant | Hybrid retrieval over docs and code; little else needed. |
| Repository agent (Diagnose a failing test, trace dependencies, propose a fix, open a PR.) | K1 + K3 + K5 + K6; G3 | Multi-source synthesis across docs/PRs/issues, dependency-graph traversal, orchestrated tool calls into the repository. |
11.5 Compliance — two workloads, two profiles
| Workload | Capability profile | Why |
|---|---|---|
| Policy lookup by jurisdiction | K1 + K2; G2 | Permission-aware, section-aware retrieval over policy documents. |
| Cross-jurisdictional compliance report (Compare obligations across jurisdictions, produce auditable evidence.) | K1 + K2 + K3; G4 | Adds cross-source synthesis and audit-grade governance, but explicitly not orchestration; the value is in the analysis, not in autonomous action. |
The lesson across all five industries is the same: profile follows workload, not industry. Procurement decisions, vendor selection, and architectural investment should be made per-workload (or per-workload-class), not per-industry.
A second lesson is implicit in the contrasts above. Operational constraints (§4.1) form a parallel signature alongside the K-class profile, and two workloads with similar K-class shapes can sit on opposite ends of the latency, cost, and reliability budgets. The customer-support Resolution agent in §11.3 (K1+K3+K4+K6; G3) operates under interactive-latency caps that disqualify architectures the Litigation discovery synthesis workload in §11.2 (K1+K2+K3; G4) can absorb without difficulty, even though the latter's K-class profile is narrower and superficially "simpler". The capability-profile template (§13) should be filled in alongside the operational-constraint budgets, not as a substitute for them: a profile is a description of what the system does; an operational signature is a description of how fast, how cheaply, and how reliably it must do it.
12. Recommendations for Buyers and Architects
When evaluating a Knowledge-Augmented System, ask the following questions:
- What kinds of questions must the system answer: lookup, synthesis, computation, traversal, or workflow?
- Which sources of truth are required: documents, structured data, graphs, APIs, or all of the above?
- Do answers need to be computed exactly, or is approximate summarization acceptable?
- Which permissions and governance policies apply to the underlying data and tools?
- How will correctness be measured?
- What happens when sources disagree?
- What should the system do when evidence is insufficient?
- Which actions require human approval?
- What latency, cost, and reliability constraints are acceptable?
- Is the architecture solving a real task requirement, or merely adopting a fashionable pattern?
12.1 Practical design guidance
- Start with task requirements, not architecture.
- Treat retrieval quality as a hard prerequisite.
- Add structure only where it improves correctness or controllability.
- Add computation only where exactness matters.
- Add graph traversal only where relationships are central to the task.
- Add orchestration only where dynamic multi-step workflows are required.
- Add autonomy slowly and with explicit constraints.
- Measure process quality, not just final-answer quality.
- Design for abstention and escalation.
12.2 Practical transition guidance
Organizations rarely upgrade capabilities in a clean sequence. Capability investment should be signal-driven, not stage-driven. Many systems will jump directly to higher-numbered operations without ever needing the intermediate ones.
- Retrieval quality (K0/K1). Invest when irrelevant or incorrect retrieval is the root cause of user-visible errors. This is a prerequisite only when retrieval is in scope at all; a NL→SQL analytics assistant may need almost no K0/K1 work.
- Structure preservation (K2). Invest when users consistently ask section- or clause-level questions and flat chunking is demonstrably failing them. A system can need K2 without ever needing K1 scoping (e.g., a public manual reader) and vice versa.
- Cross-source synthesis (K3). Invest when single-source answers are demonstrably incomplete and the cost of contradiction is non-trivial. Independent of whether structure preservation is in place.
- Computation (K4). Invest when exact, verifiable answers are required. A system can need K4 directly without any K0–K3 work; a NL→SQL analytics assistant over a clean schema is K4-dominant.
- Relational reasoning (K5). Invest when relationship traversal is the primary operation, not an incidental one. Independent of synthesis or computation; a software dependency analyzer is K5-dominant with little need for K3 or K4.
- Orchestration (K6). Invest when static pipelines cannot handle the dynamic, multi-step nature of the workload. Carries non-trivial latency, cost, and recovery complexity. K6 is justified by task dynamism, not by ambition.
- Governance. Advance in parallel with whichever capabilities are added; do not let it lag. A system that reaches K3 synthesis without G3 provenance is producing unchecked claims regardless of which other capabilities it has.
The pattern across these is that signals from the workload, not a desired position on a ladder, dictate which capability to invest in next. A system whose users never ask multi-hop relational questions has no business adding K5, no matter how sophisticated it sounds.
12.3 Vendor evaluation checklist
When evaluating a system or vendor against this model, ask:
- Which capability dimensions does your system operate on? Can you demonstrate each with a live query against our documents (ingested under our metadata schema, with our access controls applied), not a vendor-prepared sample?
- What is retrieval recall on our specific corpus (at recall@10 or recall@20, against a judgment set we own), not a benchmark corpus (BEIR, MTEB) and not recall@100 on a vendor demo set?
- How does the system behave when evidence is insufficient? What is the false-abstention rate: how often does the system refuse to answer questions for which we know an answer exists in the corpus? A system that abstains on 40% of answerable questions to avoid hallucination is conservative, not capable.
- Are access controls enforced at retrieval time or only at generation time? What happens if the retrieval layer returns unauthorized content? Show the code path.
- How is provenance tracked across synthesis? Specifically: is citation at the chunk level, the passage level with character offsets, or the claim level (each sentence individually sourced)? For high-risk audit-grade workloads, claim-level provenance is the target state; passage-level citation may be sufficient when paired with reviewer approval and tamper-evident audit logs, but chunk-level alone rarely is.
- What governance controls exist for tool calls and orchestration steps? Can individual tools be restricted by role or approval level? Are tool-call logs tamper-evident? Can the agent bypass approval gates under failure conditions?
- What are per-query latency and cost at our expected volume and document scale? Provide P50, P95, and P99 latency, plus behavior under adversarial query patterns (very long documents, very broad queries), not just median against a vendor-curated workload.
- How is correctness measured? Can we run our own evaluation set (with evaluation-set rights in the contract, not just in the conversation), and on a timeline compatible with our procurement window?
- How does the system handle temporally-sensitive data?
Specifically: does it detect stale sources at retrieval time or only
at generation time? What is the measured lag between source update
and index update? How does it answer as-of queries that fall outside
the indexed historical range; does it surface "not valid as of
" or silently return a current-as-of-now answer?
13. Capability Profile Template
This template produces a capability profile for a single workload, not a vendor score. It is a thinking tool that anchors a conversation with engineering, security, and procurement; it does not replace empirical evaluation. Two practitioners filling it in for the same workload will arrive at similar (not identical) profiles; the value is in the structured comparison, not in inter-rater precision.
How to apply. Run the template once per workload (not once per system or once per industry). For organizations evaluating multiple workloads, a portfolio is a set of profiles, and different workloads will legitimately produce different profiles, requiring different architectural choices.
| Dimension | Anchor questions to characterize current and target capability | Current | Target | Gap |
|---|---|---|---|---|
| Retrieval control | What retrieval recall is required at what cutoff? What filter granularity? Permission-aware? | |||
| Context modeling | Does the workload demand section-level scoping? Hierarchical structure? Cross-section coherence? | |||
| Synthesis | Must the system reconcile multiple sources? Surface contradictions? Generate comparative answers? | |||
| Computation | Must answers be exact and verifiable against authoritative records? At what tolerance for error? | |||
| Relational reasoning | Are multi-hop relationships central to the task? At what traversal depth? With what path explanation? | |||
| Orchestration | Is the workload genuinely multi-step and dynamic, or does a static pipeline suffice? Failure recovery requirements? Does the workload require stateful memory, and at what scope (within-session, cross-session, cross-user)? | |||
| Governance | Permission-aware retrieval? Source-level provenance? Audit trail? Approval gates? Risk class of the workload? | |||
| Evaluation | Calibrated uncertainty / evidence-sufficiency signal? Abstention threshold? False-abstention tolerance? Reasoning-trace exposure? |
This produces a profile rather than a single label. A workload might be:
K1 scoping + K2 context + K4 computation + G4 governance + low K6 orchestration.
That profile is more useful than calling the system "K4-class", and it makes vendor and architectural choices traceable to specific workload demands instead of to ladder position.
14. Limitations and Future Work
This paper offers a thinking framework for system design, evaluation, and procurement. Three scope bounds and one sensitivity-to-assumptions are worth naming explicitly so the framework is read for what it is.
- Scope of claim. The K0–K6 archetypes, the G0–G5 governance scale, and the per-class evaluation artifacts of §8.2 are a synthesis of academic literature, analyst-category structure, and practitioner observation (§1). They are offered as a profiling vocabulary and a structured comparison tool. Two practitioners filling in the §13 template for the same workload will arrive at similar but not identical profiles; the value lies in the structured comparison, not in inter-rater precision.
- Illustrative, not measured, examples. The workload profiles in §11 are illustrative shapes consistent with practitioner discussion and analyst category descriptions. They are intended as starting points for the §13 template, not as measurements of named deployments. Architects should refine each row against their own use cases.
- Coverage is bounded. The K-class set deliberately stops at K6 (the membership argument is in Appendix A). Causal/counterfactual reasoning, runtime online learning, and cross-vendor agent coordination beyond OASF are outside the framework's scope and may demand their own treatments. The paper's silence on those is intentional, not an omission.
- Sensitivity to LLM evolution. The dimension count assumes the limitations of currently-deployed LLMs. As models get better at multi-hop reasoning over long contexts and at LLM-internal computation with self-verification, the K-class requirement does not necessarily disappear — it relocates. A workload that today demands K5 external infrastructure (a graph database, recursive SQL) may, with sufficiently capable models, demand K5-grade input preparation instead: ensuring the model receives the right structured context to perform multi-hop reasoning over flat text. §4.2 already names knowledge preparation as a layer-1 prerequisite that silently determines whether any K-class above can perform; what shifts as model capability advances is the boundary between layer 1 (preparation) and layer 3 (operation), not the existence of the requirement itself. The framework's update cadence should track that boundary; the kill-criteria appendix is the trigger.
Future work that would most strengthen the framework: case studies applying the §13 template across heterogeneous workloads; an inter-rater reliability study among architects from different organizations; a public benchmark that operationalizes the per-class evaluation artifacts of §8.2; a controlled-vocabulary mapping from K-classes to the OASF skill taxonomy proposed back to the OASF maintainers (Appendix C); and longitudinal tracking of how production KAS deployments fail and what the failures retroactively predict about the responsible K-class plus G-level combination.
15. Conclusion
Knowledge-Augmented Systems represent the evolution from simple retrieval to compositional knowledge work over external sources. Retrieval helps systems find information; structure helps organize it; synthesis helps combine it; computation helps prove it; relational reasoning helps traverse it; orchestration helps solve work across tools. But higher complexity is not inherently better: the right architecture depends on the task, the source of truth, the required precision, the governance environment, and the acceptable level of autonomy.
The market needs a classification model that resists hype. This whitepaper proposes such a model: not a ladder to climb blindly, but a capability stack for designing, evaluating, and buying knowledge-augmented systems.
Appendix A. K-class membership and what's deliberately excluded
This appendix answers the recurring question: why does the K-class numbering stop at K6, and where do memory, multimodality, temporal reasoning, reflection, multi-agent coordination, generation, and translation belong if not as new K-classes?
A.1 The membership test
A knowledge operation belongs at the K-class level when all of the following hold:
- Verb at the system boundary. The system performs it; an agent or human can request it and observe a typed result. Not a property of inputs, not a topology of components.
- Not constitutively tied to a single technology product or storage format. The same operation is implementable over vectors, graphs, SQL, APIs, or text, even if a given deployment chooses one.
- Layer 3, not layer 1 or 2. What the system does, not what it stores or how it reaches the source (see §6.1).
- Falsifiable at the boundary. A §8-style benchmark question and a §8.1-style recommended-failure-behavior row can be written for it.
- Conceptually non-reducible to a single existing K-class without information loss. It is not a composition or recursion of existing K-class operations, except for K6, which is meta-irreducible (see A.2).
A.2 K6 meta-irreducibility
K6 is the only K-class whose substance is composition: planning over K0–K5 calls is irreducibly the operation, not the underlying calls it dispatches. A system that hard-codes a K0→K3→K4 sequence without a planner is a static pipeline that happens to use multiple operations, not K6. The carve-out is intentional: condition (5) above admits K6 because the planning policy itself is what is being tested, not the operations the planner dispatches.
A.3 Govern and Evaluate as layer-4 cross-cutters
Governance and Evaluation pass all five conditions of the membership test but are not assigned K-numbers. They operate as layer-4 cross-cutters: they wrap every K-operation instead of being one. They have their own scales (G0–G5 in §9 and the §8.1 normative table), and the §4 dimensions table lists them deliberately as cross-cutting instead of as K7 / K8. The K-class numbering accordingly stops at K6.
A.4 Excluded operation candidates
The three candidates readers most often nominate as missing K-classes, and where they actually live:
| Excluded candidate | Why not K-class | Where it lives |
|---|---|---|
| Memory / state recall | Substrate for K6, not a separate operation | K6 sub-capability (§5.7); §13 template Orchestration row; Appendix B.1 |
| Reflection / self-critique | Composition of Evaluate over a prior K-op | K6 evaluator-optimizer loop; §8.1 Self-RAG/CRAG; Appendix B.4 |
| Multi-agent coordination | Topology of K6, not a new operation | Appendix B.2 |
Other candidates surface less frequently and are dispatched in a sentence each. Planning is K6's primary verb (not a separate class). Multimodality and temporal reasoning are cross-cutting modifiers on every K-class (see the §8 footnote and Appendix B.3). Translation / format conversion is a layer-1/2 representation problem (§6.1). Verification / fact-checking is a sub-operation of Evaluate plus K4 result-verification (§8 and Appendix B.5). Generation is the LLM substrate every K-class assumes. Disambiguation / entity linking is the upstream NLP problem K3 explicitly defers to (§5.4). World-model construction and runtime online learning are research frontiers, not production-KAS verbs, and are out of scope.
A system that retrieves, synthesizes, computes, traverses, and orchestrates across these handled-elsewhere concerns is K0–K6; there is no missing K-class hiding in any of them.
Appendix B. Boundary cases
The five concerns most often raised as "where's X in your model?" Each is treated below as a sub-operation, modifier, topology, or substrate inside the existing model, not as a new K-class.
B.1 Memory in K6
K6 systems that operate across turns, sessions, or users require a memory layer. Memory is a K6 sub-concern, not a distinct knowledge operation: a stateful K6 system requires the same planning, tool governance, recovery, and evaluation as a stateless one, plus session-state management with its own staleness, permission, and audit implications. Three rough scopes show up in practice: within-session (working memory across the steps of a single workflow); cross-session (episodic memory tied to a user or case); and cross-user (long-term semantic memory shared across the deployment). Each scope inherits the K-class's governance prerequisites; cross-user memory at K6 + G3 needs the same provenance and freshness checks as the underlying retrievals. The §13 template's Orchestration row should be answered together with "does this workload require stateful memory, and at what scope?"
B.2 Multi-agent topologies
K6 orchestration can be implemented by a single planner routing to tools, or by a network of specialized agents coordinating via message passing. The capability requirements (task decomposition, tool governance, failure recovery, human approval gates) are unchanged regardless of topology. Multi-agent architectures add coordination overhead and introduce agent-to-agent trust as a new governance concern (which agent can invoke which, with what approval level); these map to the existing K6 + G4/G5 cells in the §9.1 prerequisites table. Multi-agent is therefore a topology choice within K6, not a new K-class.
B.3 Temporal reasoning
K3 contradictions are frequently temporal: clauses that were superseded, policies that changed effective date, contracts amended after signing, regulatory rules with different in-force windows. Temporal reasoning (point-in-time validity, version reconciliation, "what was true on date X") is not a separate K-class but a cross-cutting modifier. It shows up most often inside K3 synthesis and §8.1 abstention behavior, and inside §9 G3 freshness checks. A K3 system that flattens "policy A says X" and "policy A v2 says ¬X" into a single contradiction without surfacing which version is currently in force has done a partial job. The §8.1 normative table includes a "Any class with temporal modifier" row precisely so that temporal failure modes are evaluable without inventing a K-class for them.
B.4 Reflection / self-correction
Reflection (the pattern where a system inspects its own intermediate output and decides whether to re-retrieve, re-plan, or revise) is the K6-internal application of Evaluate (§8) over the output of K0–K5. Self-RAG [26] and Corrective RAG [27] (cited in §8.1) are early productized forms; the operation is K6, the discipline is §8. Treating reflection as its own K-class would double-count the underlying Evaluate that already governs it.
B.5 Verification vs evaluation
Claim-level verification (checking a specific factual claim against an authoritative source) is a sub-operation of Evaluate (§8) and Govern (§9 G3 provenance), not a separate K-class. The §8 Evaluation row covers the system-level form (calibrated uncertainty / evidence-sufficiency); K4 result-verification covers the computational form. A system that conflates "the LLM thinks this is right" with "this claim has been grounded against a record" has lost the distinction; the §8 row's "ground individual factual claims against authoritative records when claims are factual" clause is the relevant test.
Appendix C. AGNTCY / OASF reference and skill-ID mapping
KAS and AGNTCY (Linux Foundation, 2025 [15][16]) address different problems. AGNTCY is an agent-to-agent integration framework: protocols, identity, discovery, messaging, and a schema (OASF) for inter-agent interoperability across vendors and frameworks. KAS is a capability-evaluation framework: a per-workload, per-system profile of which knowledge operations a single system performs reliably and to what governance and evaluation discipline. Neither framework subsumes the other. A system can publish a KAS profile without ever participating in an AGNTCY network, and an AGNTCY-compliant agent can have no knowledge operations at all (a pure tool-execution agent, for example). The two frameworks co-exist; they do not stack.
That said, AGNTCY's OASF needs some vocabulary to describe what each agent does so that other agents can discover it. That vocabulary, OASF's 15-category skill taxonomy [17], independently names several operations the KAS dimensions also name. The overlap is incidental, not architectural: OASF positions skills as something an agent can perform for other agents to invoke; KAS positions operations as something a system performs with knowledge, evaluable on its own outputs. Where the vocabulary surface coincides, KAS should reference OASF skill IDs instead of redefining them, and an agent that is both AGNTCY-discoverable and KAS-profiled can optionally expose its KAS profile through OASF as a distribution choice, not a dependency.
The mapping below is descriptive: a KAS dimension at a given K-class implies that the named OASF skills are performed to a normative standard described in §8.1 / §12.3, not merely that the agent self-attests them.
| KAS dimension (K-class) | OASF skill IDs | Notes |
|---|---|---|
| Retrieval control (K0/K1) | 6 (RAG family); 601 Retrieval of Information; 60101–60103 Indexing / Search / Document Retrieval; 103 Information Retrieval and Synthesis | OASF treats RAG as one skill family; KAS splits K0 vs K1 by governance presence (§5.1, §5.2). |
| Context modeling (K2) | No direct OASF skill; closest is 103 Information Retrieval and Synthesis | Section-aware / hierarchical retrieval is not yet a named OASF skill, and is a candidate for a controlled-vocabulary extension. |
| Synthesis (K3) | 103 Information Retrieval and Synthesis (leaf 10303 Knowledge Synthesis); partial 1504 Hypothesis Generation | OASF does not name contradiction detection; KAS §5.4 makes it explicit. |
| Computation (K4) | 14 Tool Interaction: 1401 API Schema Understanding, 1402 Workflow Automation, 1403 Tool Use Planning | OASF treats computation as tool-use; KAS treats it as a knowledge operation regardless of whether tools mediate it. |
| Relational reasoning (K5) | 15 Advanced Reasoning & Planning: 1502 Long-Horizon Reasoning | Multi-hop traversal is not a named OASF skill at the leaf level. |
| Orchestration (K6) | 10 Agent Orchestration: 1001 Task Decomposition, 1003 Multi-Agent Planning | Direct overlap; K6 maps cleanly. |
| Governance (G0–G5, cross-cutting) | 13 Governance & Compliance: 1301 Policy Mapping, 1302 Compliance Assessment, 1303 Audit Trail Summarization, 1304 Risk Classification | OASF positions governance as skills an agent performs for others; KAS positions it as cross-cutting properties an agent must enforce over its own outputs. Same vocabulary surface, different stance. |
| Evaluation (cross-cutting) | 11 Evaluation & Monitoring: 1101 Benchmark Execution, 1103 Quality Evaluation, 1104 Anomaly Detection, 1105 Performance Monitoring | OASF's evaluation is what an agent does for others; KAS §8.1 evaluation is the discipline applied to the agent (calibrated abstention, false-abstention rate, claim-level provenance). Floor-vs-content split. |
A separate companion artifact specifies how a KAS capability profile (eight-axis vector + G-level + evaluation discipline) maps to the OASF profile object as a controlled vocabulary, suitable for proposal back to the OASF maintainers.
References
[1] Gartner. (2025, August 26). Gartner Predicts 40% of Enterprise Apps Will Feature Task-Specific AI Agents by 2026, Up From Less Than 5% in 2025 [Press release]. https://www.gartner.com/en/newsroom/press-releases/2025-08-26-gartner-predicts-40-percent-of-enterprise-apps-will-feature-task-specific-ai-agents-by-2026-up-from-less-than-5-percent-in-2025
[2] Singh, A., Ehtesham, A., Kumar, S., Talaei Khoei, T., & Vasilakos, A. (2025). Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG. arXiv. https://arxiv.org/abs/2501.09136
[3] Schluntz, E., & Zhang, B. (2024). Building Effective Agents. Anthropic. https://www.anthropic.com/research/building-effective-agents
[4] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems 33 (NeurIPS 2020). arXiv:2005.11401.
[5] Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of EMNLP 2020. https://aclanthology.org/2020.emnlp-main.550/
[6] Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M.-W. (2020). REALM: Retrieval-Augmented Language Model Pre-Training. Proceedings of ICML 2020. https://proceedings.mlr.press/v119/guu20a.html
[7] Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., van den Driessche, G., Lespiau, J.-B., Damoc, B., Clark, A., de Las Casas, D., Guy, A., Menick, J., Ring, R., Hennigan, T., Huang, S., Maggiore, L., Jones, C., Cassirer, A., Brock, A., Paganini, M., Irving, G., Vinyals, O., Osindero, S., Simonyan, K., Rae, J. W., Elsen, E., & Sifre, L. (2022). Improving Language Models by Retrieving from Trillions of Tokens (RETRO). Proceedings of ICML 2022. https://proceedings.mlr.press/v162/borgeaud22a.html
[8] Zaharia, M., Khattab, O., Chen, L., Davis, J. Q., Miller, H., Potts, C., Zou, J., Carbin, M., Frankle, J., Rao, N., & Ghodsi, A. (2024, February 18). The Shift from Models to Compound AI Systems. Berkeley Artificial Intelligence Research (BAIR) Blog. https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/ (position piece, not peer-reviewed)
[9] Mialon, G., Dessì, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R., Rozière, B., Schick, T., Dwivedi-Yu, J., Celikyilmaz, A., Grave, E., LeCun, Y., & Scialom, T. (2023). Augmented Language Models: a Survey. Transactions on Machine Learning Research (TMLR), 2023. https://openreview.net/forum?id=jh7wH2AzKK
[10] Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. Advances in Neural Information Processing Systems 36 (NeurIPS 2023). arXiv:2302.04761.
[11] Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., & Wang, H. (2024). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv. https://arxiv.org/abs/2312.10997 (original submission 2023-12-18; current major revision 2024).
[12] Anthropic. (2024, November 25). Introducing the Model Context Protocol. https://www.anthropic.com/news/model-context-protocol
[13] Google Developers Blog. (2025, April 9). Announcing the Agent2Agent Protocol (A2A). https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/
[14] Linux Foundation. (2025, June). Linux Foundation Launches the Agent2Agent Protocol Project to Enable Secure, Intelligent Communication Between AI Agents [Press release]. https://www.linuxfoundation.org/press/linux-foundation-launches-the-agent2agent-protocol-project-to-enable-secure-intelligent-communication-between-ai-agents
[15] AGNTCY Collective. (2025). AGNTCY — Internet of Agents (project home). https://agntcy.org
[16] Linux Foundation. (2025, July 29). Linux Foundation Welcomes the AGNTCY Project to Build the Internet of Agents [Press release]. https://www.linuxfoundation.org/press/linux-foundation-welcomes-agntcy-project
[17] AGNTCY. (2025). Open Agentic Schema Framework (OASF) — schema browser. https://schema.oasf.outshift.com (top-level skill categories 1–15; leaf skill IDs verified 2026-05-04, including 103, 601 / 60101–60103, 1001, 1003, 1101, 1103, 1301, 1302, 1502).
[18] Li, J., Hui, B., Qu, G., Yang, J., Li, B., Li, B., Wang, B., Qin, B., Cao, R., Geng, R., Huo, N., Zhou, X., Ma, C., Li, G., Chang, K. C.-C., Huang, F., Cheng, R., & Li, Y. (2023). Can LLM Already Serve as a Database Interface? A Big Bench for Large-Scale Database Grounded Text-to-SQLs (BIRD). Proceedings of NeurIPS 2023 Datasets and Benchmarks Track. arXiv:2305.03111.
[19] Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., Metropolitansky, D., Ness, R. O., & Larson, J. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv. https://arxiv.org/abs/2404.16130
[20] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. Proceedings of ICLR 2023. https://openreview.net/forum?id=WE_vluYUL-X
[21] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Advances in Neural Information Processing Systems 36 (NeurIPS 2023). arXiv:2305.10601.
[22] Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. Proceedings of NeurIPS 2021 Datasets and Benchmarks Track. arXiv:2104.08663.
[23] Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2023). MTEB: Massive Text Embedding Benchmark. Proceedings of EACL 2023. https://aclanthology.org/2023.eacl-main.148/
[24] Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2024). RAGAs: Automated Evaluation of Retrieval Augmented Generation. Proceedings of EACL 2024 System Demonstrations. https://aclanthology.org/2024.eacl-demo.16/
[25] Mialon, G., Fourrier, C., Swift, C., Wolf, T., LeCun, Y., & Scialom, T. (2024). GAIA: A Benchmark for General AI Assistants. Proceedings of ICLR 2024. https://openreview.net/forum?id=fibxvahvs3
[26] Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2024). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. Proceedings of ICLR 2024. https://openreview.net/forum?id=hSyW5go0v8
[27] Yan, S.-Q., Gu, J.-C., Zhu, Y., & Ling, Z.-H. (2024). Corrective Retrieval Augmented Generation. arXiv. https://arxiv.org/abs/2401.15884
[28] National Institute of Standards and Technology. (2023). AI Risk Management Framework (AI RMF 1.0). https://www.nist.gov/itl/ai-risk-management-framework
[29] OWASP Foundation. (2025). OWASP Top 10 for Large Language Model Applications (Version 2025). https://owasp.org/www-project-top-10-for-large-language-model-applications/
[30] Gartner. (2025, June 25). Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027 [Press release]. https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027
[31] Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, Vol. 12 (2024), 157–173. https://aclanthology.org/2024.tacl-1.9/
PhD, National and Kapodistrian University of Athens. ORCID: 0009-0003-7817-5791. LinkedIn.↩︎
We use hallucination throughout this paper because it is the field-standard term. Confabulation is arguably more technically precise (fabricated content the system presents as true, with no underlying perceptual error), and several recent works prefer it on those grounds. We treat the two as interchangeable. Thanks to Harry Mamangakis, who pointed this out.↩︎
Multimodality is a cross-cut, not a separate dimension (see §4.2). Every row above applies equally when the input or evidence is non-text: images, tables, scanned PDFs, video, audio, layout. Failure to process a modality should fail closed at the corresponding K-class; for example, a K0 system unable to read an image should report that, not silently fall back to text-only retrieval. Recommended evaluation practice is to re-run each row above against a multimodal corpus fixture instead of adding a separate "Multimodal" row.↩︎
"Freshness checks" at G3 means more than caching invalidation. For temporally-sensitive workloads (financial reporting, GDPR right-to-be-forgotten, contract effective dates), G3 freshness checks must include detection of stale sources at retrieval time (not only at generation time), measured lag between source update and index update, and explicit handling of as-of queries that fall outside the indexed historical range. See §4.2 on temporal reasoning as a cross-cutting modifier.↩︎