Between the pace of AI and our usage signals, a fixed roadmap beyond that simply doesn't make sense anymore.

This meeting runs on two inputs: the first is direct customer feedback, like bugs, UX challenges, and product gaps, submitted through continuous feedback guides in our UI. Though this data is qualitative, we can actually quantify it with Pendo: we can measure how many users hit the same frustration, how many have made the same request, and the account value behind each priority. And when a user is especially engaged, we aim to get them on a call.

The second major input is Issues in Agent Analytics. I’m the Lead PM for Leo,Pendo's AI product assistant. We use Agent Analytics every week to measure Leo the same way we'd expect any of our customers to use it for their own agents. Below, I shared two examples of what this looks like in practice.

The guardrail that was failing thousands of user conversations

When Leo launched, we deliberately decided not to support how-to questions. Things like, "How do I create a segment?" "How do I update my roadmap?" Users would ask, and Leo would deflect them to Pendo's knowledge base. 

We put that guardrail in because Leo was hallucinating on those questions. We hadn't connected it to an up-to-date knowledge source yet, and we didn't want to ship wrong answers. This is the reality of making continuous trade-offs when building a roadmap and developing a product.

While we felt we were being responsible, Agent Analytics eventually showed us we were creating frustration.

The issue detector flagged it without us even having to ask: over 60% of all issues in Leo traced back to this single root cause. One problem, spread across thousands of conversations, coming up over and over again. Without Agent Analytics, we would’ve had no way to understand the severity of this gap as we prepared to move from closed beta to a full release. A PM can't make data-backed prioritization decisions without aggregate insights across every user conversation. With this data, we knew we were losing more and more user trust after just one conversation with Leo.

To fix this, we built a dedicated knowledge sub-agent within Leo, similar to its other sub-agents for tasks like quantitative analysis and guide creation. We added one for knowledge, connected it to our up-to-date knowledge store, and enabled Leo to answer how-to questions accurately rather than deflect them.

Overnight, this issue disappeared from our issues list. We went from thousands of flagged conversations to zero, with one architectural release.

The users we didn't design for, and how we found them

Leo's original ICP was clear: people who don't know Pendo, don't have time to learn analytics, and want fast answers to simple questions. When we launched Leo this past year, this cohort showed up.

However, a second cohort showed up too, and unfortunately, it was one we hadn't built for: Analytics power users. Pendo champions who looked at Leo and said: you just put a conversational UI on top of all my data. They immediately started asking complex analytics questions, like cross-app conversion funnel analysis and queries across large datasets. Leo couldn't handle it well. We knew those limitations existed, but we didn't know how often users were hitting them, or which gaps were causing the most friction.

Using Agent Analytics, we built tracked use cases to quantify the split: how often are users asking simpler questions Leo handles well, versus how often are they in a territory where prompts are more advanced, and the experience is breaking down? We categorized by complexity level and by intent: acquisition questions, adoption questions, conversion questions, etc.

Pendo's Sr. Director of Analytics and I sat down with that data and used it to sequence which capabilities to build first. The best part was that our Agent Analytics data drove the order in which we’d build. Which, as any PM can attest to, is much preferred over gut feel or relying on the loudest voice in the room. We knew, objectively, the frequency with which users hit each gap and the impact we could expect from each fix we’d make.

In the end, we extended Leo's tooling to handle those more complex quantitative requests that the second cohort was making.

Evals said we were ready, but we needed more than that

Before shipping the tool changes to Leo's thousands of users, we needed to know whether they'd truly improve the experience, or just shift the problem elsewhere. Our evals looked good (green across the board).

The tricky part is that we no longer trust evals alone. In the last year, our team has been continuously building and making improvements to Leo, and we’ve found that you can have everything green in your test automation suite, and then the first real user asks something your tests never anticipated, and the agent fails. Evals are controlled, but users definitely aren’t. Add to this the non-deterministic nature of these models, and we would pass every test yet still fail the first conversation.

To overcome this barrier to meaningful optimization, we ran an experiment in Agent Analytics before rolling out the quant changes to all our users. We A/B tested the new tooling with 20% of our user base. The result: a 67% drop in issue rate for that cohort, on real users, not synthetic bots. And because it's connected to the broader context of user behavior, we could trace user engagement and AI quality metrics right alongside it.

This is what gave us the conviction to roll out these significant backend changes more broadly. Rather than relying on overly optimistic evals or a one-off thumbs-up from an internal test user, we have an objective, scalable way to release every new update to Leo with confidence. 

What we've seen since

Leo's new visitor adoption is up +16.1% over the last 60 days alone. More significantly for a PM, our weekly returning visitors are up by 61.3%. Which tells me Leo is worth something to our audience and isn’t a one-off novelty.

Though retention is climbing, there are still more rocks to move. I’m grateful we have a consistent process in place to continuously prioritize the time of my team and engineers based on real user signals: find the issue in AA, zero in on the root cause, make the right infrastructure change, confirm before shipping with an experiment, then watch our north stars grow.

This loop is how we manage Leo, and it's the same loop I'd recommend to any PM building an AI agent.

Connect behavioral context to any agent you build or buy, then measure if it's working with Pendo for AI agents. Learn more.