Smaller models, faster results: what 18 months of production AI taught me
Hey there,
I spent the last 18 months building a production data lake that processes meetings through LangGraph agents. Every class, every office hours session, every conversation. All captured and processed to generate insights, extract action items, and create structured notes. The result is essentially an indexed database with a meta-representation of almost every professional interaction in my life.
When I started, I did what most people do: I threw everything at the biggest, most expensive models available. If you're going to build something ambitious, use the best tools, right?
Wrong.
What I discovered, especially for specific, well-defined tasks, was counterintuitive. When I needed action items extracted from meeting notes in a precise format that could feed into my task management system, the massive models weren't just overkill. They were often worse. More latency. More cost. And surprisingly, not always more accurate for the specific task at hand.
I ended up settling on much smaller models. Llama 3.1 8B turned out to be remarkably effective for these focused extraction tasks. The key insight: when you know exactly what you need, a smaller model trained or prompted for that specific job often outperforms a general-purpose giant.
NVIDIA'S RESEARCH VALIDATES THIS
Then NVIDIA dropped their Nemotron-Flash paper, and suddenly my empirical observations had rigorous research backing them up.
The paper opens with a simple but important observation: "Most existing SLM designs prioritize parameter reduction to achieve efficiency; however, parameter-efficient models do not necessarily yield proportional latency reductions."
This is the key insight. When we talk about "efficient" models, we usually mean fewer parameters. But parameters aren't what matters in production. Latency and throughput are what matter. A model that's 50% smaller but takes the same time to respond hasn't actually improved your user experience or reduced your costs.
NVIDIA's team asked a different question: "how fast can we make this while maintaining quality?" The results are striking:
- Nemotron-Flash-3B achieves 1.7× lower latency and 6.4× higher throughput compared to Qwen2.5-3B
- Nemotron-Flash-1B delivers 45.6× higher throughput than Qwen3-0.6B while achieving 5.5% higher accuracy
Read that again: the smaller model is both faster AND more accurate.
WHY THIS MATTERS FOR AGENTIC WORKFLOWS
My meeting pipeline is just one example, but the pattern applies everywhere we're using LLM decision-making within our graphs and agentic workflows. When you're building these systems, where models are called repeatedly in loops, making decisions, calling tools, and iterating, latency compounds. A 100ms improvement per call becomes seconds saved per workflow. Seconds become minutes. Minutes become the difference between a system that feels responsive and one that feels broken.
And it's not just latency. When you're optimizing for production, you're juggling multiple constraints:
- Speed: How fast can you get a response?
- Reliability: How consistent are the outputs?
- Accuracy: How correct are the results for your specific task?
- Cost: What's this going to cost at scale?
Small, purpose-fit models often win on all four dimensions for well-defined tasks within larger workflows.
THE TECHNICAL SURPRISE
One finding from the paper that surprised me: the conventional wisdom that "deeper is better" doesn't hold when you optimize for latency.
The researchers found that for a given latency budget, there's an optimal balance between model depth and width. Going deeper doesn't always help, and can actively hurt performance in latency-sensitive scenarios. They also discovered that hybrid architectures, combining different types of attention mechanisms (standard attention, Mamba2, DeltaNet), can outperform pure architectures.
A BROADER MOVEMENT
What excites me most is that this isn't just NVIDIA in a research lab. There's a whole community pushing in this direction. People optimizing models to run on a 4090, on smaller infrastructure, even on phones. The democratization of AI isn't just about access to APIs. It's about being able to run capable models on hardware you own, at speeds that make real applications possible.
PRACTICAL TAKEAWAYS
If you're building AI workflows today, here's what I take from both the research and my own experience:
1. Don't default to the biggest model you can afford. Profile your actual latency requirements and work backward.
2. Throughput matters as much as accuracy. A model that's 2% less accurate but 10× faster might serve your users better, and cost far less.
3. Match model size to task specificity. General reasoning? Maybe you need something larger. Extracting structured data in a known format? A small, fast model often wins.
4. Test on your actual hardware. The paper emphasizes that optimal architectures vary by deployment target. What works on an H100 might not be optimal for edge deployment.
I wrote up the full story with more details on my blog.
Read the full post: https://colinmcnamara.com/blog/small-language-models-ai-workflows-nemotron-flash/
RESOURCES:
If you want to dig deeper, here are the key resources:
- NVIDIA Nemotron-Flash Paper (arXiv): https://arxiv.org/abs/2511.18890
- Nemotron-Flash-1B on Hugging Face: https://huggingface.co/nvidia/Nemotron-Flash-1B
- Nemotron-Flash-3B on Hugging Face: https://huggingface.co/nvidia/Nemotron-Flash-3B
- Nemotron-Flash-3B-Instruct: https://huggingface.co/nvidia/Nemotron-Flash-3B-Instruct
Colin