Executive Summary

  • Deploy onprem con RAG e ACL granulari su policies/procedures interne.
  • p95 latency from 1200ms → 220ms with KV cache and stable prompt templates.
  • Hallucination reduced from 10% to 1.8% on ground truth dataset.
  • Cost / 1k token from €0.024 → €0.006 (Medium profile, 4× GPU).
  • Full audit trail (E2E tracing, SIEM), zero data egress.

Before / After

Metric
Before
After
Improvement
p95 latency
1200ms
220ms
-82%
Cost/1k token
€0.024
€0.006
-75%
Hallucination rate
10%
1.8%
-82%
Internal knowledge coverage
62%
91%
+47%
Internal adoption
0
1,800 users/month
100%

Values measured on internal dataset and defined SLOs

Timeline

W12

Assessment

DPIA, SLO, dataset eval, architecture.

Deliverable: Data map, Adapted Reference Architecture, RAG Plan
W35

Pilot

RAG + PII guardrails, Milvus, vLLM, alerting.

Deliverable: Vector index 2M chunk, Eval baseline, Alerting
W6

Production

Canary, runbook, training.

Deliverable: SLO enforce, Backup/DR, User training

Decisioni & Tradeoff

RAG vs FT

Choice: Controlled RAG with granular ACLs
Alternatives: Finetuning 13B
Why: Evolving domain, frequent updates
Risks: ACL errate, chunking subottimale
KPI Impact: Better coverage and factuality

Serving

Choice: vLLM with adaptive batching and KV cache
Alternatives: TensorRTLLM
Why: General throughput with p95 < 300ms
Risks: Too aggressive batches

Vector DB

Choice: Milvus with per-domain collections and PII masks
Why: Scale and ACL
Risks: Hot partitions

Stack & Architecture

Models

  • 13B quant INT4
  • Embedding e5-small

Serving

  • vLLM + KV cache

Vector

  • Milvus

Security

  • mTLS
  • KMS/HSM
  • SIEM feed
  • Airgapped updates

SLO & KPI

Internal chat p95 < 300ms

 Raggiunto 220ms

Recall@k >= 0.90

 Raggiunto 0.92

ROI & Unit Economics

Formula: ROI = (ΔProd + ΔQuality + Risk avoided)  − (Capex/amm + Opex)
  • Costo/1k token 75%
  • ΔTCO 37%
  • Riduzione ticket "where is&" al supporto 34%

Risks & Mitigations

Risk: Wrong ACLsMitigation: Policies tested in staging, audit.
Risk: Hot partitionsMitigation: Per-domain sharding + balancing.

Lessons learned

  • Stable prompt templates → cache hits → + response consistency.
  • Granular ACLs essential for banking compliance.
  • KV cache significant impact on recurring queries.

Testimonials

"For the first time we can measure costs and quality on a daily basis."

 Head of Operations

Bring this impact to your domain