Executive Summary
- Deploy onprem con RAG e ACL granulari su policies/procedures interne.
- p95 latency from 1200ms → 220ms with KV cache and stable prompt templates.
- Hallucination reduced from 10% to 1.8% on ground truth dataset.
- Cost / 1k token from €0.024 → €0.006 (Medium profile, 4× GPU).
- Full audit trail (E2E tracing, SIEM), zero data egress.
Before / After
Metric
Before
After
Improvement
p95 latency
1200ms
220ms
-82%
Cost/1k token
€0.024
€0.006
-75%
Hallucination rate
10%
1.8%
-82%
Internal knowledge coverage
62%
91%
+47%
Internal adoption
0
1,800 users/month
100%
Values measured on internal dataset and defined SLOs
Timeline
W12
Assessment
DPIA, SLO, dataset eval, architecture.
Deliverable: Data map, Adapted Reference Architecture, RAG Plan
W35
Pilot
RAG + PII guardrails, Milvus, vLLM, alerting.
Deliverable: Vector index 2M chunk, Eval baseline, Alerting
W6
Production
Canary, runbook, training.
Deliverable: SLO enforce, Backup/DR, User training
Decisioni & Tradeoff
RAG vs FT
Choice: Controlled RAG with granular ACLs
Alternatives: Finetuning 13B
Why: Evolving domain, frequent updates
Risks: ACL errate, chunking subottimale
KPI Impact: Better coverage and factuality
Serving
Choice: vLLM with adaptive batching and KV cache
Alternatives: TensorRTLLM
Why: General throughput with p95 < 300ms
Risks: Too aggressive batches
Vector DB
Choice: Milvus with per-domain collections and PII masks
Why: Scale and ACL
Risks: Hot partitions
Stack & Architecture
Models
- 13B quant INT4
- Embedding e5-small
Serving
- vLLM + KV cache
Vector
- Milvus
Security
- mTLS
- KMS/HSM
- SIEM feed
- Airgapped updates
SLO & KPI
Internal chat p95 < 300ms
Raggiunto 220ms
Recall@k >= 0.90
Raggiunto 0.92
ROI & Unit Economics
Formula: ROI = (ΔProd + ΔQuality + Risk avoided) − (Capex/amm + Opex)
- Costo/1k token 75%
- ΔTCO 37%
- Riduzione ticket "where is&" al supporto 34%
Risks & Mitigations
Risk: Wrong ACLs → Mitigation: Policies tested in staging, audit.
Risk: Hot partitions → Mitigation: Per-domain sharding + balancing.
Lessons learned
- Stable prompt templates → cache hits → + response consistency.
- Granular ACLs essential for banking compliance.
- KV cache significant impact on recurring queries.
Testimonials
"For the first time we can measure costs and quality on a daily basis."
Head of Operations