Performance

Every figure we give has a verified source.

We don't claim efficiency — we calculate it, from component specifications and peer-reviewed independent benchmarks. Every number on this page is traceable to a primary source we'll share on request.

Power — why we use maximum design power, not average

"Average power draw" suggests sustainability, but hides the real power consumption.

Average power is an industry sustainability trick because its unknowable at design time.
Power draw depends on workload mix, ambient temperature, CPU & RAM utilisation, and storage I/O patterns that vary continuously.
For this reason you can only define average power in situ, during live operations.
Quoting an 'average' allows vendors to present a number that will rarely be achieved in production — while the real number you need to determine your facility requirements, your power contract, and your energy bills, is component design maximum power.

We use maximum component-verified draw throughout.
It's the only figure that's independently verifiable, defensible in procurement, and truly honest about what your facility needs to support.

Configuration Max draw (kW) Racks vs Autonomy
Autonomy Cloud (42RU — full rack) 23.1 1
Conventional AI estate (NVIDIA H100) ~117 10 5.1×
Conventional cloud estate (Dell R760) ~139 8
Combined conventional (AI + Cloud) ~256 18 11.1×
23.1 kW Autonomy — component-verified
11.1× lower Than combined conventional estate
~87% less 5yr power costs
0–50°C Ambient range. Standard cooling.

Sources: Component TDP specifications from Qualcomm, AMD, and SoftIron datasheets. Conventional figures from Dell PowerEdge XE9680 and R760 published specifications. Power cost calculated at £0.25/kWh commercial rate, PUE 1.2 (Autonomy) / 1.35 (conventional). Full methodology available on request.

AI inference efficiency — peer-reviewed benchmarks

Benchmarked against NVIDIA. Better on efficiency.
Competitive on throughput.

The Qualcomm AI100 Ultra cards we use by default in Autonomy Cloud are programmable AI accelerators — purpose-built for inference workloads.
In 2025, researchers at UC San Diego's National Research Platform benchmarked them directly against NVIDIA A100, H200, and AMD MI300A hardware across 15 open-source LLMs, measuring throughput (tokens per second) and energy efficiency (tokens per second per watt). Results were published at ACM PEARC '25 and are available at arXiv:2507.00418.

We show the data as it is.
The AI100 Ultra wins on energy efficiency across the majority of models, and wins outright on raw throughput for most models up to ~32B parameters.
Only at 70B+ parameters, does the H200 become competitive or pull ahead on throughput — though the AI100 Ultra still matches or beats it on efficiency for some models at that scale.
This pattern is consistent with what you'd expect from a purpose-built inference accelerator versus a general-purpose GPU. Better on most, similar on the rest.
Most importantly, the AI100 Ultra is optimised by design for the energy-constrained, complex deployment environments where Autonomy Cloud operates.
Autonomy Cloud with Qualcomm is specifically built for a life on the AI Edge.

Energy efficiency — tokens per second per watt vs NVIDIA H200

Model (parameter size) H200 tok/s/W AI100 tok/s/W AI100 advantage Winner
TinyLlama-1.1B 16.04 28.40 1.77× AI100
CodeGemma-2B 18.37 36.85 2.01× AI100
StarCoder2-15B 10.57 27.75 2.63× AI100
Codestral-22B 9.94 11.55 1.16× AI100
Gemma2-27B 15.08 17.43 1.16× AI100
Llama3.3-90B Vision 7.46 13.64 1.83× AI100
Granite-20B 10.21 6.25 0.61× H200
DeepSeek-70B 11.39 10.32 0.91× H200
Llama3.3-70B 16.91 13.14 0.78× H200

tok/s/W calculated from Table 1, Sada et al. (2025). H200 figures use single-GPU configuration. Qualcomm AI100 Ultra figures use minimum required SoC configuration per model.

Raw throughput — up to ~32B params
AI100 Ultra vs single H200 GPU, tokens per second

ModelH200AI100Adv.
TinyLlama-1.1B 3,670 7,347 2.0×
Llama3.2-3B 3,545 7,510 2.1×
Llama3.1-8B 3,103 5,063 1.6×
StarCoder2-15B 3,990 6,260 1.6×
Granite-20B 3,680 6,840 1.9×
Qwen2.5-32B 2,550 4,927 1.9×

70B+ models — throughput & efficiency
A more mixed position...

Llama3.3-70B

Throughput (tok/s):
H200 5,366 vs AI100 4,528
Efficiency (tok/s/W):
H200 wins (16.9 vs 13.1)

DeepSeek-70B

Throughput (tok/s):
AI100 4,528 vs H200 4,333
Efficiency (tok/s/W):
H200 wins (11.4 vs 10.3)

Llama3.3-90B Vision

Throughput (tok/s):
AI100 5,961 vs H200 3,556
Efficiency (tok/s/W):
AI100 wins (13.6 vs 7.5)

What the data shows

Where the AI100 Ultra consistently leads

Energy efficiency (tok/s/W) for models up to ~32B parameters — a factor of 1.2× to 2.6× versus H200.
Raw throughput for the same model range — typically 1.6× to 2.1× more tokens / second / card.
NB: This is the sweet spot for enterprise LLM inference -
coding assistants, RAG pipelines, document analysis, and real-time classification workloads.

For Edge AI:
Optimised models, speed, and power efficiency are everything and the AI100 Ultra on Autonomy Cloud has no Edge AI parallel.

Where H200 is still competitive

Raw throughput for 70B parameter models H200's large memory bandwidth and NVLink scaling give it some advantages for latency sensitive high volume processing.

NB: If your primary business workload is serving a single 70B model at maximum throughput, the H200 may be faster — but the efficiency delta at this scale narrows quickly.

At the cloud core:
H200 can make sense; but for Edge AI the H200 is both undeployable and often inferior.

VM density — like-for-like workload comparison

4,769 Workload VMs delivered by Autonomy (42RU)
206 VMs per kW — Autonomy platform
34 VMs per kW — equivalent Dell R760 estate

Autonomy delivers the same VM workload at 6.1× higher VM density per kW than the equivalent Dell R760/VMware estate.

Same workload output. Higher density. Less than one sixth of the power draw per virtual machine.

Source: Sada et al., "Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and High-Performance GPUs," ACM PEARC '25, Columbus OH, July 2025. arXiv:2507.00418v1. tok/s/W figures calculated from Tables 1 and 2. VM density figures calculated from component specifications — full methodology available on request.

5-year total cost of ownership

The full picture.
Every cost included.

The comparison below includes hardware, software licensing, power, facility, and staffing — every cost category that a real deployment incurs. The conventional estate figure uses published VMware VCSP v5.3 list pricing. Nothing is excluded to make either number look better than it is.

Cost element (5yr) Autonomy Cloud Conventional
Hardware CapEx Higher Lower
Software licensing (VMware VCSP stack) £0 £9.4m+
Power (£0.26/kWh, 5yr) ~87% lower baseline
Facility / colocation (5yr) ~90% lower baseline
Staffing — specialist FTE requirement ~12 fewer FTE 22 FTE minimum
5-year total (excl. staffing) ~27% lower baseline
5-year total (incl. staffing) ~30% lower baseline

The conventional estate's software licensing bill alone exceeds the entire Autonomy platform cost. The hardware CapEx gap closes before year three. From year three onwards, Autonomy is cheaper in every cost category simultaneously — power, facility, licensing, and staffing.

Based on component specifications and VMware VCSP v5.3 published list pricing, March 2026. Staffing based on minimum viable FTE for each platform type.
Full working methodology available on request.

The methodology

We show our workings.
Ask us to share.

Every figure on this page has a traceable source.
Contact us for the full comparison document —
component specs, pricing basis, assumptions,
and our calculations.

Request the full methodology