Everyone currently builds agents, but hardly anyone talks about observability, evals and optimization. This is scary because these systems can behave unpredictably in the real world. Until then, they have already lost the trust of the users and have no historical data to understand what the problem caused. The basic problem is that teams treat AI agents such as deterministic software if they are actually probabilistic systems that can subtly fail, how the difficult part is to decide what "errors" even means for their application. An e-commerce recommendations that contain slightly suboptimal suggestions could be okay, but a medical triage agent lacks the symptoms of what really works? MUSTIT.AI, traceloop, Langsmith or similar platforms you can see the entire chain of arguments, set evals and receive autonomous optimization (only in the list) so that your agents become more reliable over time
prompts·1 min read19.9.2025
Are you using observability, evaluation, optimization tools for your AI agents?
Source: Original