Measuring What Matters: The AI Capability Maturity Model

The Numbers That Should Alarm You

McKinsey’s 2025 State of AI survey found that 78% of organisations now use AI in at least one business function. Only 6% qualify as “AI high performers” — organisations where AI contributes more than 5% of EBIT. BCG, independently: 74% of companies see no real value from their AI investments.

Read those numbers together. Nearly four out of five organisations have adopted AI. Fewer than one in fifteen have made it work.

The obvious question is: why? The less obvious but more useful question is: how would you tell the difference between a team that is extracting value and one that is accumulating waste? What would you measure? What would you look for?

In our earlier work, we named the failure mode — LOSS, Lines Of Stochastic Source — and we described the discipline that separates productive AI use from entropy generation — context engineering, not prompt engineering. This article completes the sequence. It describes the measurement.

The Pre-Zachman Moment

In 1987, John Zachman published “A Framework for Information Systems Architecture” in the IBM Systems Journal. It was a two-dimensional grid: six perspectives crossed with six interrogatives. It gave enterprise architecture something it had never had — a systematic way to assess what had previously been judged by intuition and experience.

Before Zachman, organisations knew their IT architecture was either “good” or “a mess.” They could not articulate why, could not identify which specific dimension was mature and which was failing, and could not construct a targeted improvement plan. The framework did not build architectures. It made them assessable.

AI-first software development is in that pre-Zachman moment right now. Every engineering leader has an opinion about whether their AI tools are helping. Most of those opinions are based on perceived velocity — “the team feels faster.” None of them are grounded in a structured assessment of what maturity actually looks like across the dimensions that matter. The AI Capability Maturity Model is our contribution to closing that gap. Like Zachman, it crosses dimensions with levels. Unlike Zachman, it is specialised for AI-first development, and it assesses both human and AI practitioners.

Three Dimensions

The ACMM assesses across three dimensions at every maturity level. Each dimension answers a different question. All three must advance together — a team with Level 4 technology and Level 1 process is not a Level 4 team.

People

Can they do the work? Both human operators and AI agents are assessed — because in an AI-first team, both are practitioners. People maturity is about observable skill, not available tooling.

Process

Is the work structured? The difference between “we use Claude” and “we execute a directed development lifecycle with plan, groom, verify, do, verify, done, and retro phases” is the difference between Level 1 and Level 2.

Technology

Do the tools support the discipline? Technology maturity is necessary but never sufficient — it enables process and people maturity but does not replace them.

The Zachman echo is deliberate: Who (People), How (Process), What (Technology), crossed with maturity levels instead of architectural perspectives.

Five Levels, Five Exit Gates

Each level represents a qualitative shift — a phase transition, not a percentage improvement. What distinguishes the ACMM from aspirational frameworks is that every level has an exit gate: a concrete, observable criterion that must be met before a team can claim it. No self-assessment. No “we’re probably a Level 3.” Evidence or nothing.

Level 1

Exploratory

The team uses AI for ad-hoc tasks. Conversations are undocumented. There are no quality gates on AI output. The AI is stateless — every interaction starts from zero.

Observable evidence: no session logs, no version control of AI artefacts, no persistent context across sessions, no shared vocabulary between human and AI practitioners. This is vibe coding, and it is where LOSS accumulates fastest.

Exit gate: The team recognises the need for structure, and at least one person has used an agentic AI tool rather than a chat interface.

Level 2

Structured

The team has basic discipline. Sessions exist. Handoffs work. There is a development lifecycle understood by both human and AI practitioners. Commits carry semantic metadata.

Exit gate: The team executes a complete directed development lifecycle chain — plan, groom, verify, do, verify, done, retro — end-to-end on a real feature. Not a training exercise. A production feature.

Level 3 — The Threshold

Semantic

The team does domain modelling. Formal schemas exist. A linter runs. Invariants are tested. The inbox-to-model pipeline is operational — source material goes in, validated domain models come out.

Exit gate: The team has produced a domain specification that passes formal validation with no more than five by-design warnings, and includes test witnesses for every declared invariant. This is not a low bar.

Level 4

Compositional

The team works across sessions. Cross-session coordination is active. Models are deployed and certified against formal criteria. Schema evolution includes data migration. Adversarial review of specifications is standard practice.

Exit gate: The team has a certified application running on a production-capable server with passing certification and active cross-session coordination.

Level 5

Self-Governing

The methodology governs itself. AI agents onboard new AI agents. Humans set strategic direction; AI executes autonomously within it. Objective tracking operates on session counts, velocity, and defect rates — never calendar time. Continuous improvement is structural, not aspirational.

Exit gate: The team has contributed at least one methodology improvement — a skill, a protocol, or a governance decision — back to the core. This separates users of a methodology from co-authors of it.

The Industry Already Has the Numbers

The data exists. What has been missing is the framework to interpret it.

GitClear’s 2025 analysis of 211 million lines of code across Google, Microsoft, and Meta repositories found that since the adoption of AI coding tools, code duplication rose from 8.3% to 12.3%, refactoring dropped from 25% to less than 10% of changed lines, and code churn increased from 3.1% to 5.7%. These are the measurable signatures of Level 1. Code that is generated, not engineered.

Apiiro’s 2025 security research found that AI-assisted developers commit at three to four times the rate of unassisted developers, but introduce security findings at ten times the rate. Veracode independently found that 45% of AI-generated code samples introduce OWASP Top 10 vulnerabilities. These numbers map directly to the Process dimension — the absence of quality gates, the absence of structured review, the absence of governance.

The Stack Overflow 2025 Developer Survey reported that 84% of developers are now using or planning to use AI coding tools. In the same survey, trust in AI accuracy fell from 40% to 29% year over year. Adoption up. Trust down. This is the empirical signature of an industry that has adopted tools without adopting discipline — Level 1 maturity at scale.

How the Assessment Works

An ACMM assessment is not a survey. It is a structured evaluation with three observation methods, each targeting a different subject.

Human practitioners are assessed through a structured interview and artefact review — typically 45 to 60 minutes. The assessor examines inbox documents the practitioner has authored, model reviews they have conducted, and session logs from their AI interactions. The questions are concrete: “Describe the session lifecycle.” “Show me a domain extraction you performed.” “Walk me through a model review where you challenged the AI’s output.”

AI agents are assessed through transcript review and live task observation — 30 to 45 minutes. The assessor examines session transcripts for evidence of correct protocol: does the agent maintain state across restores? Does it follow instructions without drift? Does it make correct compositional decisions — knowing when a part-whole relationship requires composition versus a simple reference?

Teams are assessed through end-to-end execution — 2 to 4 hours. The assessor watches the team execute a full development lifecycle on a test domain. Not a rehearsed demonstration. A cold domain with source material, executed from extraction through modelling, validation, and deployment.

Each checklist item is scored: Evidenced (observable artefact or demonstration exists), Partial (capability exists but is inconsistent), Not Present (no evidence), or Not Applicable. A level is achieved when 80% or more of items score Evidenced or Partial. A level is “in progress” at 50%.

To make this concrete, here are four items from the operational questionnaire:

L2-P1

Understands session lifecycle. Human evidence: can describe the init-restore-compress cycle in interview. AI evidence: transcript shows correct restore sequence with state loaded.

L3-P4

Understands mereological composition. Human evidence: knows when to use composition versus reference. AI evidence: makes correct compositional decisions when modelling a new domain.

L4-P3

Understands certification output. Human evidence: can interpret a certification report and identify which findings require action. AI evidence: addresses certification findings systematically in subsequent sessions.

L5-P3

Onboards new practitioners. Human evidence: has onboarded another team member using the methodology. AI evidence: configures a new session for a colleague with correct role, mission, and context.

The output of an assessment is a scored report with dimension breakdown, specific findings, and a prioritised improvement plan. Each finding maps to a training recommendation and a reassessment trigger — not “revisit in three months,” but “reassess after five sessions using the directed development lifecycle.”

The 6% Are Not Lucky

The McKinsey finding — 6% of organisations are AI high performers — maps to the ACMM with uncomfortable precision. In our informal assessments across financial services, agriculture, and logistics teams, the organisations that extract genuine value from AI are consistently at Level 3 or above. They have crossed the Schmitt trigger threshold. They manage context systematically, not reactively. They capture session knowledge. They track performance with objective metrics.

The 94% that have adopted AI without extracting value are at Level 1-2. They have the tools. They lack the discipline. And critically, they lack the framework to diagnose which specific dimension — People, Process, or Technology — is holding them back.

The commercial thesis is direct: the gap between Level 2 and Level 4 is where value concentrates. Level 2 teams generate code. Level 4 teams generate knowledge that compounds across sessions, across team members, and across projects. The ACMM makes that gap measurable, and therefore closeable.

You cannot improve what you cannot measure. The AI Capability Maturity Model gives you the measurement. The improvement is the work that follows.