Why your ACPU is too high — and how pre-agentic AI fixes it
There is a metric most AI teams are not tracking: how much they are spending on AI per end user, per month. Not the total API bill. Not the cost per request. The fully loaded cost of all AI consumption, divided by the number of users actually benefiting from it.
Call it ACPU — AI Cost Per User. And for most organisations running production AI today, it is quietly out of control.
The root cause: using a sledgehammer for everything
The default pattern when building AI-powered products is to reach for the most capable model available. GPT-4o, Claude Opus, Gemini Ultra — frontier models that are genuinely impressive and genuinely expensive. Input costs ranging from $2 to $15 per million tokens. Output costs that can reach $60 per million tokens on the high end.
These models earn their price for the right tasks: complex reasoning, nuanced synthesis, multi-step agentic workflows. But a significant portion of AI workloads in production are not those tasks. They are classification, pattern detection, enrichment, summarisation of structured inputs, and contextual tagging — tasks where a well-prompted small model performs at 90% of the quality for 1% of the cost.
Mistral 7B, for example, handles enrichment tasks — taking a structured input of behavioural signals and generating a contextual summary — for roughly €20 per million operations. The same volume on a frontier model costs closer to €2,000. The output quality difference for that specific task is negligible. The cost difference is not.
What ACPU reveals
When you calculate ACPU honestly, two things usually become visible.
First, a large share of AI spend is concentrated on low-complexity tasks that do not require frontier intelligence. The expensive model is doing the equivalent of using a racing engine to idle in traffic.
Second, the expensive tasks — the ones that genuinely benefit from frontier reasoning — are often operating without adequate context. The agent making a recommendation does not know the user's behavioural history. The model generating a personalised response is working from a thin, generic profile. So it compensates with more tokens, more reasoning steps, more cost — and still produces a generic result.
Both problems have the same solution.
Pre-agentic AI as a cost and quality strategy
A pre-agentic data strategy uses lightweight, affordable models continuously in the background to build rich behavioural context per user. Every interaction, every signal, every enrichment adds a layer to a compact profile — designed from the start to be summarisable in a few hundred tokens.
When the frontier model eventually acts — recommending, deciding, personalising — it receives a dense, relevant context instead of starting from scratch. It needs fewer tokens to reason well. It produces better outputs. And it costs less per interaction, because the heavy lifting of context-building has already been done cheaply.
This is not a trade-off between quality and cost. It is a compound gain on both dimensions simultaneously.
The numbers make the case
Consider a base scenario: one million enrichment operations per month, each taking a structured behavioural input and generating a contextual tag or summary.
At Mistral 7B rates: approximately €20. At GPT-4o rates: approximately €2,000.
The enrichment output — a compact behavioural insight — then feeds the frontier model as context. The frontier model's own consumption drops because it is reasoning over prepared context rather than raw, unstructured data. ACPU falls at both ends of the pipeline.
For product and engineering teams managing AI budgets, this is not a minor optimisation. At scale, it is the difference between a sustainable unit economics model and one that breaks as soon as the user base grows.
The strategic implication
Organisations that adopt a pre-agentic approach are not just cutting costs today. They are building the data infrastructure that will make their future agents meaningfully better than competitors who skipped this step.
Cheaper enrichment now. Richer context later. Better agents when it matters.
ACPU is the metric that makes this visible. Once you start tracking it, the architecture decisions become obvious.