AI Engineering Product

Building AI Agents That Actually Work in Production

January 15, 2025

There’s a demo version of every AI agent and then there’s the version that has to work at 2am when a founder is stressed about their pitch and your product is the thing they’re counting on.

I’ve been building the AI coaching agent at LEANSTACK for the better part of two years now. When we started, the hardest engineering problem felt like it was going to be the LLM integration itself—the prompting, the context windows, the model selection. It turns out those were the easy parts.

The problems nobody talks about in the demos

The real challenges in production AI systems are operational. They’re the kind of problems that only appear under the weight of real users with real expectations:

Latency compounds anxiety. When someone is working through a high-stakes business problem, a 4-second response feels like abandonment. We had to rethink every part of our streaming and caching strategy before users stopped rage-closing the chat.

Token costs are a product problem, not just an engineering one. You can’t just pass every conversation in full context on every turn. But aggressive pruning breaks continuity. We built a summarization layer that preserves the semantic thread without burning tokens on verbatim history.

Evaluation is a discipline, not an afterthought. Early on, we would ship a prompt change and find out a week later that we’d broken something subtle in the coaching logic. The fix wasn’t better testing in the traditional sense—it was building an evaluation harness that let us measure quality across a range of real-world scenarios before shipping.

Observability is different for AI systems. Your normal APM dashboard tells you a request failed. What it doesn’t tell you is that the agent gave a technically coherent answer that was completely unhelpful for the user’s actual situation. Logging the right things—and building the tooling to review them—is a non-trivial investment.

What I’d tell myself on day one

Start with the failure modes. Before you’re excited about what the agent can do, get clear on what happens when it goes wrong. Design for graceful degradation from the beginning. And invest in evaluation infrastructure earlier than feels comfortable—you’ll thank yourself when you need to make a significant model or prompt change and you can do it with confidence instead of prayer.

The technical scaffolding around an AI agent—the observability, the evals, the cost controls, the UX for complex multi-step workflows—that’s where most of the engineering work lives. The LLM is surprisingly the easy part.