When is Cloud API the cheapest option for LLM deployment?

Cloud API is most cost-effective for small models (8B parameters) with moderate query volumes (<300K queries/month). For example, with 500K queries/month, a small 8B model costs approximately €10K/3yr via Cloud API, compared to €17K for Cloud GPU and €27K for On-Premise.

When does On-Premise become more cost-effective than Cloud?

On-Premise wins for medium and large models (70B+ parameters) at 500K+ queries/month. The breakeven typically occurs at 18 months for 70B models. For large models (671B), On-Premise can save 62-82% over 3 years compared to Cloud alternatives.

What are the real GPU rental costs in 2025?

October 2025 pricing: RTX 5090 from $0.30/hr (consumer, small models), H100 from $1.85/hr with 1-3yr commitment (enterprise), H200 from $3.79/hr on-demand (datacenter). Lambda Labs offers the cheapest H100 committed rate at $1.85/hr.

How does model size affect TCO advantage?

The larger/smarter the LLM, the more On-Premise wins. Small 8B models: Cloud API best (€10K/3yr). Medium 70B models: On-Premise wins (€84K vs €91K Cloud GPU). Large 671B models: On-Premise dominates with 62-82% savings (€209K vs €545K Cloud GPU or €1.13M Cloud API).

TCO Calculator Expert

Compare Cloud API, Cloud GPU Rental, and On-Premise deployments with real-time hardware validation and cost breakdown. Built for senior architects making production decisions.

How to use

Run the comparison in three passes

01
Lock the workload first.
Set queries/month + token mix on the Cloud API tab. The values auto-sync to Cloud GPU and On-Premise so every scenario uses the same demand curve.
02
Pick the model pair.
Choosing a provider automatically selects the equivalent open-source model and GPU requirements. Reverse sync buttons keep API ⇄ Cloud GPU ⇄ On-Prem aligned.
03
Validate TCO + VRAM.
Use the sticky bar for 3-year totals, then open the VRAM + power widgets to ensure the GPU count/quantization is production-safe before exporting the report.

Quick profiles

Jump-start with a persona checklist

Scroll to the scenarios ↓

💱 Display Currency:

1 USD = 0.92 EUR (Loading...)

Cloud API

€0

3-year TCO

Cloud GPU

€0

3-year TCO

On-Premise

€0

3-year TCO

📊 Default Scenario: Enterprise with Existing Datacenter (500K queries/month)

Profile: Mid-sized enterprise, 500K queries/month, Claude 3.7 Sonnet equivalent (Llama 3.3 70B FP8), 40% GPU discount, industrial power (€0.12/kWh), automated DevOps (0.05 FTE), existing datacenter.

Real pricing (Oct 2025): Cloud API €227K/3yr (€3/$15 per M tokens). Lambda H100 @ $1.85/hr = €91K/3yr (2× H100 80GB). On-premise 2× H200: €44K capex + €39K opex = €83K total. Breakeven at ~18 months.

🎯 Key Insight: The larger/smarter the LLM, the more on-premise wins! Small models (8B): Cloud API best (€10K). Medium models (70B): On-premise wins (€83K vs €91K Cloud GPU). Large models (671B): On-premise dominates with 62-82% savings. At 500K queries/month, self-hosting starts making financial sense. 5yr+ horizon: on-premise wins dramatically.

Cloud API Configuration

📊 Workload

Queries per month

🔄 Auto-synced with Cloud GPU scenario

Avg input tokens per query Typical RAG: context + question

Avg output tokens per query ⚠️ Output tokens cost 2-5× more than input!

Peak concurrent requests

💰 Pricing

Provider / Model

Price per 1M input tokens (USD)

Price per 1M output tokens (USD)

Egress bandwidth (GB/month) Typical for RAG with document retrieval

Egress price (USD/GB)

💵 Cost Summary

Input tokens cost: €0

Output tokens cost: €0

Egress bandwidth: €0

Monthly Total: €0

Annual Total: €0

                                3-Year TCO:
                                €0
                            

Cloud GPU Configuration

🤖 Model Selection

Model

Quantization

Context Window

📊 Total VRAM Required: 88 GB

Model weights: 70 GB
KV cache: 18 GB
Safety margin (20%): 18 GB

🖥️ Hardware

GPU Type

Number of GPUs

Cloud Provider

Hourly rate (USD/GPU) Auto-filled from provider, or enter custom rate

Reserved vs On-Demand discount (%)

💾 Storage & Networking

Model storage (auto-calculated)

Vector DB storage (GB/month) $0.10/GB/month typical

Egress bandwidth (GB/month)

📊 Workload

Queries per month

🔄 Auto-synced with API scenario

Avg tokens per query (input + output) 1200 input + 600 output = 1800 total

🚀 Performance:

Throughput: 480 tokens/sec
Max queries/hour: 1,570
GPU utilization: 87%

💵 Cost Summary

GPU rental (24/7): €0

Storage: €0

Egress bandwidth: €0

Monthly Total: €0

Annual Total: €0

                                3-Year TCO:
                                €0
                            

On-Premise Configuration

🤖 Model Selection

Model

Quantization

Context Window

📊 Total VRAM Required: 88 GB

Model weights: 70 GB
KV cache: 18 GB
Safety margin (20%): 18 GB

🖥️ Hardware Capex

GPU Type

Number of GPUs 2× H200 = sufficient for Llama 3.3 70B FP8

B2B discount (%) 30-40% typical for multi-GPU enterprise orders

Server chassis + motherboard (USD)

CPU (USD) AMD EPYC 9554 or similar

RAM (GB)

RAM price (USD/GB)

NVMe SSD (TB)

NVMe price (USD/GB)

Total Capex: $0

⚡ Power & Cooling

GPU TDP (watts each)

System overhead (CPU+RAM+storage, watts)

PUE (Power Usage Effectiveness) 1.2 = excellent datacenter, 1.6 = office

Electricity cost (EUR/kWh) Industrial rate (€0.12) vs residential (€0.18-0.30)

Operating hours per day

Monthly Power: €0
Total TDP: 0W × PUE × hours

🔧 Operational Costs (Annual)

Maintenance contract (% of capex)

IT staff allocation (FTE) 0.05 FTE = ~2hr/week (fully automated k8s/vLLM)

Average IT salary (EUR/year) Shared DevOps/MLOps team cost allocation

ISP bandwidth (Mbps)

ISP monthly cost (EUR) Incremental cost (existing datacenter connectivity)

💵 Cost Summary

Capex (amortized 36 months): €0

Power & cooling: €0

Maintenance: €0

IT staff: €0

ISP: €0

Monthly Total: €0

Annual Total: €0

                                3-Year TCO:
                                €0
                            

📈 3-Year TCO Comparison

Metric	Cloud API	Cloud GPU	On-Premise
Monthly Cost	€0	€0	€0
3-Year TCO	€0	€0	€0
Breakeven vs API	-	-	-
ROI over 3 years	-	-	-

TCO Calculator Expert

Run the comparison in three passes

Jump-start with a persona checklist

📊 Default Scenario: Enterprise with Existing Datacenter (500K queries/month)

Cloud API Configuration

📊 Workload

💰 Pricing

💵 Cost Summary

Cloud GPU Configuration

🤖 Model Selection

🖥️ Hardware

💾 Storage & Networking

📊 Workload

💵 Cost Summary

On-Premise Configuration

🤖 Model Selection

🖥️ Hardware Capex

⚡ Power & Cooling

🔧 Operational Costs (Annual)

💵 Cost Summary

📈 3-Year TCO Comparison

💡 Recommendations

📥 Export Results