Finance Internal assistant Semantic search Production

From 1.2s to 220ms in 6 weeks
Banking Knowledge Assistant

Onprem LLM with controlled RAG, audit trail and predictable TCO

220ms

p95 latency

→ -82%

-37%

ΔTCO

→

1.8%

Hallucination

→ -82%

1.8k

Adoption

→ users

Download PDF Request Assessment

Executive Summary

Deploy onprem con RAG e ACL granulari su policies/procedures interne.
p95 latency from 1200ms → 220ms with KV cache and stable prompt templates.
Hallucination reduced from 10% to 1.8% on ground truth dataset.
Cost / 1k token from €0.024 → €0.006 (Medium profile, 4× GPU).
Full audit trail (E2E tracing, SIEM), zero data egress.

Before / After

Metric

Before

After

Improvement

p95 latency

1200ms

220ms

-82%

Cost/1k token

€0.024

€0.006

-75%

Hallucination rate

10%

1.8%

-82%

Internal knowledge coverage

62%

91%

+47%

Internal adoption

1,800 users/month

100%

Values measured on internal dataset and defined SLOs

Timeline

W12

Assessment

DPIA, SLO, dataset eval, architecture.

Deliverable: Data map, Adapted Reference Architecture, RAG Plan

W35

Pilot

RAG + PII guardrails, Milvus, vLLM, alerting.

Deliverable: Vector index 2M chunk, Eval baseline, Alerting

Production

Canary, runbook, training.

Deliverable: SLO enforce, Backup/DR, User training

Decisioni & Tradeoff

RAG vs FT

Choice: Controlled RAG with granular ACLs

Alternatives: Finetuning 13B

Why: Evolving domain, frequent updates

Risks: ACL errate, chunking subottimale

KPI Impact: Better coverage and factuality

Serving

Choice: vLLM with adaptive batching and KV cache

Alternatives: TensorRTLLM

Why: General throughput with p95 < 300ms

Risks: Too aggressive batches

Vector DB

Choice: Milvus with per-domain collections and PII masks

Why: Scale and ACL

Risks: Hot partitions

Stack & Architecture

Models

13B quant INT4
Embedding e5-small

Serving

vLLM + KV cache

Vector

Milvus

Security

mTLS
KMS/HSM
SIEM feed
Airgapped updates

→ View Full Reference Architecture

SLO & KPI

Internal chat p95 < 300ms

Raggiunto 220ms

Recall@k >= 0.90

Raggiunto 0.92

ROI & Unit Economics

Formula: ROI = (ΔProd + ΔQuality + Risk avoided) − (Capex/amm + Opex)

Costo/1k token 75%
ΔTCO 37%
Riduzione ticket "where is&" al supporto 34%

Risks & Mitigations

Risk: Wrong ACLs → Mitigation: Policies tested in staging, audit.

Risk: Hot partitions → Mitigation: Per-domain sharding + balancing.

Lessons learned

Stable prompt templates → cache hits → + response consistency.
Granular ACLs essential for banking compliance.
KV cache significant impact on recurring queries.

Testimonials

"For the first time we can measure costs and quality on a daily basis."
Head of Operations

Bring this impact to your domain

Book Assessment Download Case PDF

From 1.2s to 220ms in 6 weeksBanking Knowledge Assistant

Executive Summary

Before / After

Timeline

Assessment

Pilot

Production

Decisioni & Tradeoff

RAG vs FT

Serving

Vector DB

Stack & Architecture

Models

Serving

Vector

Security

SLO & KPI

Internal chat p95 < 300ms

Recall@k >= 0.90

ROI & Unit Economics

Risks & Mitigations

Lessons learned

Testimonials

Bring this impact to your domain

From 1.2s to 220ms in 6 weeks
Banking Knowledge Assistant

Decisioni & Tradeoff