RAG token explosion and what to do
Retrieval plus generation multiplies tokens; eval-gated mini-class routing is often the largest absolute savings lever.
Up to 638× spread between most and least expensive compliant routes for identical workloads at the same quality floor (o10 State of Inference Spend 2026).