Context Engineering Is Not Prompt Engineering

The Newest Hottest Title

Andrej Karpathy recently suggested that the hottest new role in AI is “context engineer.” The industry pivoted overnight. Prompt engineering is dead. Context engineering is the future. Conferences are updating their tracks. LinkedIn titles are changing.

The name is better. The practice is still early.

What most teams call “context engineering” — crafting richer prompts, stuffing context windows with documentation, building retrieval pipelines that surface relevant code — is what we would classify as Level 2 on a five-level maturity model. Maybe early Level 3, for the teams that are genuinely systematic about it. The industry has found a better label for a discipline it has barely started practising.

The Schmitt Trigger

If you read our earlier piece on LOSS — Lines Of Stochastic Source — you already know the core problem: AI generates code that is statistically plausible but semantically ungrounded. It looks right. It compiles. The tests pass. The business logic is wrong.

What we have observed since is that AI code quality does not degrade gradually. It follows a threshold function — what electrical engineers call a Schmitt trigger. There are two stable states and almost nothing in between.

Below a critical threshold of domain context, AI output is stochastic. Every line is an inference built on inferences. Errors compound silently. You are manufacturing LOSS at machine speed.

Above that threshold, the same AI produces grounded, reliable, genuinely brilliant work. The output is not merely correct — it is semantically faithful to the domain. The threshold is not a slider you tune. It is a cliff you either clear or you do not.

The question that matters is: what determines where the threshold sits? The answer is not the AI’s capability. It is the maturity of the team operating it.

Five Levels of AI Developer Maturity

Over the past three years, working across financial services, precision agriculture, energy, and logistics, we have observed a consistent pattern in how teams use AI for software development. The pattern resolves into five discrete levels. Each level represents a qualitative shift in how the team relates to the AI — not a gradual improvement, but a phase transition.

Level 1

Chaos

The industry calls this vibe coding. Copy from ChatGPT. Accept the first suggestion. Marvel at the speed.

There is no context management. Every conversation starts from zero. The AI has no memory of your domain, your conventions, your constraints, or what it told you yesterday. One person on the team “knows how to prompt” — they are the hero, and the methodology lives entirely in their head.

The dangerous characteristic of Level 1 is that LOSS accumulates with no feedback loop. There is no mechanism to detect it, no metric to track it, no signal that anything is wrong until a domain expert manually inspects the output. Which, at the speed teams are generating code, happens less and less.

Level 2

Repeatable

This is prompt engineering. The team has templates. Someone has documented what works. There are shared prompt libraries, system prompts that set a baseline, maybe a retrieval pipeline that injects relevant documentation into the context window.

The practice is reactive. The team generates code, discovers problems, adjusts the prompt, regenerates. The AI is still treated as autocomplete — a faster way to type what the developer already knows how to write.

This is where most teams land when they adopt Karpathy’s “context engineering” label. They are engineering the context. They are doing it better than Level 1. They are still below the Schmitt trigger threshold, because the context they engineer is structural, not semantic. They provide the AI with documentation and code. They do not provide it with domain understanding.

Level 3 — The Threshold

Defined

This is context engineering done properly. The shift is fundamental: from “generating code” to “managing the AI’s understanding.”

At Level 3, the developer treats the context window as a first-class engineering resource. They write persistent project instructions that set baseline domain knowledge for every session. They create reusable knowledge modules — encoded expertise that loads on demand, not ad hoc prompts reconstructed from memory. They build custom agents for specialised tasks. They use repeatable workflows triggered by name, not typed from scratch.

Most critically, they capture session knowledge. When a session ends, the insights from that session — what was learned, what patterns were discovered, what domain constraints were surfaced — are preserved for the next session. The codebase gets smarter over time. Learning compounds.

This is where the Schmitt trigger fires. Above the threshold. The AI has enough domain context to be semantically faithful, because the operator is systematically managing what the AI knows.

Level 4

Managed

At Level 4, the developer becomes a semantic engineer. They work with ontologies and formal domain models. The models drive code generation — not the reverse. The specification comes first. The code is a rendering of the specification, not the other way around.

Performance tracking becomes objective. The team measures session counts (not calendar time), velocity per session (points landed, not hours worked), and defect rates (invariant violations, semantic drift findings). Calendar time is meaningless for AI work. An AI session can accomplish in forty minutes what would take a human team a week. Measuring in weeks tells you nothing. Measuring in sessions — “this sprint took 8 sessions at 6.2 points per session with 2 defects” — tells you everything you need to improve.

Level 4 developers understand what the mathematician Kurt Gödel proved in 1931: every formal system has truths it cannot prove from within itself. Applied to AI development, this means the AI cannot validate its own domain reasoning. The domain model cannot express meta-rules about when it should be extended. The developer who maps these boundaries — who knows exactly where the AI’s competence ends — is more productive than one who fights them. The boundary is where human expertise lives.

Sessions at Level 4 are not disposable chats. They are knowledge artefacts. Each session leaves a record of what was understood, not merely what was done.

Level 5

Optimising

At Level 5, the team achieves what we call mereological mastery — the discipline of maintaining part-whole coherence across every artefact in the system.

Every specification, every piece of code, every test, every document maintains coherent relationships with every other artefact. Change a part, and the whole adjusts. Change the whole, and the parts adapt. The transformation between specification and implementation is bidirectional. The specification is not documentation. It is the application. Every deployed app is a rendering of the specification for a particular audience or context.

The system self-improves. Every interaction with a user, every session with an AI operator, every domain conversation enriches the specification. The specification appreciates. The app depreciates. This is the correct commercial framing: invest in the specification, not the code.

What This Looks Like in Practice

Theory is pleasant. Evidence is persuasive.

We recently onboarded a new operator through an AI-first methodology. Seven sessions. In those seven sessions, they delivered a sixteen-page production website with authentication, analytics, bot protection, and a full ticketed support portal backed by a relational database. The methodology captured session knowledge at every compress. It enforced commit discipline — explicit file staging, pre-commit audits, post-commit verification. It produced a detailed backlog with definition-of-done criteria for every item. The code review surfaced one brand-hierarchy violation, five security hardening items, and two housekeeping issues. All traceable. All actionable.

This is what Level 3-4 looks like in practice. Not faster typing. Structured compound learning.

We measure our own performance in session counts and velocity, never in calendar time. Our operational rule is explicit: calendar time is forbidden as an internal metric.

This is not an affectation. When your measurement system matches the machine’s operating cadence, you can actually improve. “Two weeks” is noise. “Eight sessions, 6.2 points per session, 2 defects” is signal.

Six Dimensions of Qualification

Maturity is not a single axis. At each level, the AI operator must be qualified across six dimensions.

Epistemic: What does the AI know? Does it have schema awareness, tool knowledge, and — critically — awareness of its own limitations?

Teleological: Why do the rules exist? Not just that an invariant must hold, but the business purpose it protects. An AI that knows “guest names are required” behaves differently from one that knows “guest names are required because the hotel has a legal obligation to identify occupants for safety and regulatory compliance.”

Axiological: What values guide decisions? When two valid approaches conflict, which does the domain prefer? Guest experience over operational convenience. Financial accuracy over processing speed. Data protection over data availability.

Mereological: How do parts relate to wholes? If you delete a parent entity, what happens to its children? If you change a composed part, does the whole’s invariant re-evaluate? These are not academic questions. They are the source of an entire class of defects that no amount of prompt engineering will prevent.

Operational: How to act correctly? Plan before executing. Create exactly what was specified — no more, no less. Use batch operations instead of sequential calls. Explain errors and offer recovery. These are protocols, not preferences.

Phenomenal: What is the felt sense of the domain? “Guest,” not “user.” “Stay,” not “booking period.” “Property,” not “building.” The vocabulary, the empathy, the tone — these are not cosmetic. They are the difference between an AI that operates in a domain and one that operates on it.

Level 1 activates none of these dimensions. Level 3 activates Epistemic and Operational. Level 5 requires all six. Maturity is not one skill. It is six simultaneous qualifications.

The Commercial Thesis

The industry labelled “context engineering” as the next frontier because it is the first discipline that acknowledges the AI needs more than a prompt. That instinct is correct. The execution is insufficient.

Context engineering is one rung on a five-level ladder. The rungs above it — semantic engineering, mereological mastery — are where AI development becomes an engineering discipline rather than an art form. The gap between Level 2 and Level 5 is not theoretical. It is the difference between a team that compounds knowledge and a team that compounds entropy.

The Consortium for Information & Software Quality measured the cost of poor software quality at 2.41 trillion dollars in 2022. That was before 77 million developers gained access to AI code generation tools. Before Cursor reached a 9 billion dollar valuation for an enhanced text editor. Before the industry decided that generating more code, faster, was the same as building better software.

The teams that reach Level 3 will clear the Schmitt trigger threshold. They will build systems where learning compounds, where specifications appreciate, where every session leaves the codebase smarter than it found it.

The teams that stay at Level 1-2 will compound LOSS. They will generate more code, faster, with less understanding. Their test suites will stay green. Their domain fidelity will erode. They will not know it until an accountant opens the books.

The industry is accelerating toward entropy. The solution is not less AI. It is more discipline.

Where Is Your Team?

Five questions. Be honest.

Do you manage your AI’s context window deliberately — or do you paste and hope?
Do you capture what you learned in each session for the next one — or does every conversation start from zero?
Do you track AI performance with objective metrics — or do you rely on “it feels faster”?
Do you work with formal domain models that drive what the AI generates — or does the AI generate first and you verify after?
Does your specification survive beyond the current sprint — or does it evaporate when the ticket closes?

If you answered no to the first two, you are below the Schmitt trigger. That is where LOSS accumulates.

The gap between where most teams are and where they could be is the largest opportunity in software engineering right now. Not because the tools are insufficient. Because the discipline has not caught up with the tooling.

Context engineering is a start. It is not the destination.