Note that these experiments are making me feel behind the curve so much. Things change so fast, but some things are always going to hold true. Gonna be in SF next week on a work trial, so hopefully the gradient of my learning curve can steepen a bit.
So I had a mild shock when iterating on a LLaMA 3B quantized model. Ended up finding that the quality of pretrained model+inference was roughly equivalent to a gold-plated model using Opus for compute. Just in terms of the “smell” — this product would be equivalent to something that I had spent 4 months working on.
And I could have greater control of TPS / TTFT / quality (Opus-as-a-judge, but still working on that).
Read in Philip Kiely’s Inference Engineering (free book, quite awesome) about the trade-offs between quality, throughput, and latency.
I think there’s an underlying compute economics story here. If we hold fixed useful capability per marginal inference dollar, there’s effectively a production possibility frontier between reasoning quality, latency, throughput. Perhaps include privacy, and deployment flexibility, though I feel like they might be secondary for non-cost-conscious buyers.
As inference costs compress, the optimal architecture shifts away from centralized frontier-only models toward smaller specialized models embedded directly inside workflows. SLMs may have lower absolute capability, but often much higher capability density — i.e. economically useful intelligence per unit compute.
That changes the deployment equilibrium. Instead of paying frontier-model costs for every interaction, systems increasingly route narrow or repetitive tasks to cheap local models and reserve expensive reasoning for only the hardest queries.
Leave a comment