Why tokens-per-watt matters more than tokens-per-second

Featured Post | Sustainability



May 4, 2026

Category: Sovereignty | Read time: ~3 minutes

Latest Posts

PUE is dead. Long live tokens-per-watt

May 16, 2026

Can you trust where your AI support engineer is sitting?

May 16, 2026

Hallucination isn’t a bug to patch — it’s a risk to be managed

May 16, 2026

When cloud provider decides your sovereignty doesn;t matter today

When your cloud provider decides to negate your sovereignty

May 16, 2026

Recent Posts

The AI infrastructure market has spent the last three years obsessed with one number: how fast can a system generate tokens? Benchmarks, vendor comparisons, procurement conversations — almost all of it reduces to throughput.
How many tokens-per-second can you squeeze out of a given model on a given card?
That is the 'horsepower' of the AI era, and that analogy is more relevant than it first appears.

Automotive engineers have long understood that horsepower and torque are not the same thing, and that optimising for one at the expense of the other produces a vehicle that performs well in only one set of conditions.
As Jason Fogelson of Kelley Blue Book puts it:
“Torque is more important than horsepower when you first accelerate.
Horsepower is more important than torque when you want to maintain peak performance.”

A high-torque engine generates greater force at lower revs, enabling rapid acceleration from a standing start, whilst high horsepower sustains velocity once you are moving.
The skill is in engineering a platform that delivers both, balanced for the task at hand, but flexible in use and over time.

Exactly the same logic maps directly onto AI infrastructure.
Tokens-per-second is horsepower: once a deployment is running at scale and demand is established, throughput is the key metric that sustains performance and keeps latency manageable.

But right now, as governments and enterprises try to accelerate their AI programmes — particularly sovereign deployments that cannot simply plug into hyperscaler delivered capacity — the binding constraints are not card throughput, it’s not about horsepower.

The globally felt constraints are power generation and transmission limits, heat management, water consumption, grid connection queues, and the time and cost for large DC build-out.
These are exactly the conditions that require torque: or in AI terms, the ability to generate the maximum productive AI output per unit of energy consumed, and to accelerate meaningfully within real-world physical and resource limits rather than ideal benchmark conditions.
The key metric for AI acceleration is therefore not token-per-second horsepower, but tokens-per-watt torque.

Bring the two together in a well-engineered AI platform — efficient at the watt level, fast at scale — and the measure that matters becomes tokens-per-second-per-watt; the only unified metric for AI infrastructure to measure performance in the real world, not just on the specification sheet.

Why energy has become the binding constraint

The International Energy Agency’s April 2026 report is worth reading carefully by anyone making AI infrastructure decisions. Their analysis shows that electricity demand from data centres grew 17% in 2025, significantly outpacing global electricity demand growth of 3%.
AI-focused data centres grew faster still — up 50% in a single year.
The IEA’s base case now projects that global data centre electricity consumption will double by 2030, with AI-specific demand tripling.

These are not simply interesting statistics, they describe a physical constraint that is already reshaping investment decisions, procurement cycles and planning approvals across every market where AI compute is being deployed at scale.

The UK provides a sharp illustration of the problem: there are currently 50 gigawatts of data centre projects queued for grid connections, against a current national peak demand of around 45 gigawatts.
UK government departments are publicly contradicting one another about how to account for this in energy projections. Planning approvals are slowing, and grid connection timelines are extending to years rather than months. Our AI revolution is failing before its even begun - because we're trying to apply horsepower to a torque problem.

For any organisation deploying AI inference at scale — and increasingly for any organisation deploying AI at all — this is no longer simply a background policy concern. It is an operational constraint that constrains and shapes what can be built, where, and at what cost.

The problem with buying on throughput

Raw throughput figures are not unimpressive numbers, but give an incomplete data set in ways that matter enormously in a constrained-energy environment. A system optimised purely for tokens-per-second will, almost by definition, also be optimised to draw the maximum available power.

That may be fine in an environment where power is cheap, abundant and grid-connected, but is a significant problem in the environments that matter most for sovereign and critical AI workloads: edge deployments, physically constrained facilities, off-grid or island-mode operations; and any context where the carbon or financial cost of electricity is a primary consideration.

The academic community has started to catch up with this. Recent benchmark research measuring energy per token across GPU architectures under production inference conditions has produced findings that complicate straightforward previous assumptions about which hardware is ‘best’.
they have discovered that efficiency depends substantially on the interaction between GPU characteristics and model size — meaning that the card that leads on throughput for large frontier models may not compete on energy per token for the smaller, specialised models that enterprise AI deployments increasingly prefer, or in mixture-of-expert landscapes.

Context window length introduces further variation: research published earlier this year demonstrated that tokens-per-watt can vary by a factor of twelve or more across different context window sizes on the same hardware. Choosing infrastructure on throughput alone, without understanding the energy profile of your actual workload, is an increasingly common procurement error. Bigger - in terms of raw throughput, energy draw and indeed cost - is very often not better

What tokens-per-watt actually measures

Tokens-per-watt is a straightforward concept: for every watt of power consumed, how many output tokens does a system generate?
It captures the productive yield of your energy expenditure rather than just the peak rate of output. In combination with cost-per-inference, it provides a far more complete picture of operational AI economics than throughput alone.

These metrics are gaining traction among analysts and infrastructure planners precisely because they bridge the gap between theoretical maximum AI performance and deployed energy constrained reality.
Some are already extending it further — to ‘tokens-per-watt-per-dollar’, capturing the combined efficiency of energy and cost. The direction of travel is regardless now becoming clear: organisations increasingly need to account for inference efficiency, not just inference speed, in making their infrastructure decisions.

This matters beyond the pure economic case; for public sector bodies, regulated industries and organisations with sustainability commitments, energy efficiency at the inference layer is becoming a sustainability governance question as much as a financial one. The carbon associated with AI inference is not currently well measured or well disclosed by most providers.
That is however changing, and organisations that have already built their AI infrastructure around efficient inference will be in a considerably better position when these reporting requirements tighten.

Efficiency as core architecture, not afterthought

There has been a tendency to treat energy efficiency as something that can be bolted on to AI infrastructure after the fact — focussing on it as a reporting exercise rather than a design principle - but that approach will become increasingly untenable.

Genuine efficiency at the inference layer requires early architectural decisions rather than measurement reaction. Which hardware platform is used and how it is configured: how models are selected and routed across a deployment: how workloads are matched to the right compute tier — all of these shape energy consumption in ways that cannot be recovered through retrospective optimisation.

Academic modelling suggests that intelligent model selection alone — routing queries to appropriately sized models rather than defaulting to the largest available — could reduce AI energy consumption by more than 27% and can be deployed at scale. Informed right-sizing of models will become much more important in the future than we allow it to be today.

Axiom Edge’s infrastructure is built around these principles from the ground up.
Our performance benchmarks reflect real-world inference efficiency under production conditions — tokens-per-second per-watt, not just tokens-per-second.
For us, the constraint of energy is not an inconvenience to be managed as someone else’s problem; it is the design driver that separates infrastructure built for the long-term from infrastructure built for the improvement of a vendors balance sheet.

We are confident that the market will eventually catch up with this framing, but the organisations that get there first — in procurement, in deployment, and in the questions they ask of their AI infrastructure providers — will have a structural first-mover advantage that compounds and saves them money over time.

The engineers who build the best engines understand that horsepower and torque are not rivals — they are partners. The same is true of tokens-per-second and tokens-per-watt.
The metric that matters is the one that effectively balances both.

Axiom Edge is a sovereign AI inference and cloud provider built for efficiency, security and national deployment. Learn more at axiom-edge.ai

50 gigawatts in the queue: why the UK’s AI ambitions and its grid are on a collision course →

PUE is dead. Long live tokens-per-watt

Sustainability

May 16, 2026

Can you trust where your AI support engineer is sitting?

Sovereignty

May 16, 2026

Hallucination isn’t a bug to patch — it’s a risk to be managed

Security & Assurance

May 16, 2026

Why tokens-per-watt matters more than tokens-per-second

Featured Post | Sustainability

May 4, 2026

Latest Posts

PUE is dead. Long live tokens-per-watt

Can you trust where your AI support engineer is sitting?

Hallucination isn’t a bug to patch — it’s a risk to be managed

When your cloud provider decides to negate your sovereignty

Recent Posts

Why energy has become the binding constraint

The problem with buying on throughput

What tokens-per-watt actually measures

Efficiency as core architecture, not afterthought

Related Posts

PUE is dead. Long live tokens-per-watt

Can you trust where your AI support engineer is sitting?

Hallucination isn’t a bug to patch — it’s a risk to be managed

No cookies. No trackers.
No 3rd-party requests.

2026 Axiom Edge