Article

A new approach to safe agentic AI

By Muppuli Gnanaraj Govindaraj, Harry Keir Hughes, Sravya Reddy Jammula

21 May, 2025
8 min read

Insights

As agentic AI systems become more complex and widely adopted in the enterprise, ensuring their reliability, efficiency, and security has become a critical challenge.
But many existing methodologies for monitoring and evaluating these systems are too simplistic and context-dependent, masking shortcomings in an agent’s performance.
At Infosys, we use a four-level framework to evaluate agents, including infrastructure surveillance, prompt evaluation, performance monitoring, and feedback integration.
Tools are available to speed up this process, including LangSmith, and Galileo, an AI evaluation and observability platform.
This approach isn’t hypothetical. Infosys is using it at our client organizations, reducing false escalations by 38% and increasing output quality by 45%.

Safe AI vs. Fair AI with Cornell University's Professor Aditya Vashistha

Agentic AI systems pursue complex goals and workflows with minimal human supervision. Unlike narrow AI models confined to single tasks, agentic AI exhibits decision-making and adaptive behaviour similar to a human agent.

This greater autonomy brings immense potential for business efficiency and innovation. However, it also introduces risks if the agent behaves unexpectedly.

As agentic AI systems become more complex and widely adopted in the enterprise, ensuring their reliability, efficiency, and security has become a critical challenge.

Effective monitoring and evaluation strategies are needed to detect failures, optimize performance, and improve AI decision-making through human feedback. Without proper oversight, an autonomous agent could stray from intended goals, produce biased or incorrect outputs, or expose security vulnerabilities.

The challenges of monitoring agents

However, monitoring and evaluating these agentic systems isn’t straightforward.

Agents are unpredictable and complex, exhibiting behaviour that can be difficult to assess. This is compounded by the fact that there are no standardized evaluation metrics and benchmarks for agentic behaviour, with many existing methodologies too simplistic or context-dependent. This means that organizations can’t compare results across systems and can inadvertently mask shortcomings in an agent’s performance.

Further, ensuring evaluation covers ethical and safety dimensions is challenging, requiring careful monitoring of outputs and decisions. There are also the integration and operational challenges inherent in complex systems that make it hard to isolate the source of an error or performance issue.

This means that organizations can’t compare results across systems and can inadvertently mask shortcomings in an agent’s performance.

Finally, feedback and improvement loops are difficult to implement – humans are needed to review agent decisions but do this in a timely fashion with input systematically incorporated.

The multi-level framework

At Infosys, we use a four-level framework to evaluate AI agents:

Infrastructure surveillance
Prompt or response evaluation
Performance monitoring
Feedback integration

This provides a structured methodology to observe and control an agent’s behaviour at every level of an AI system. By combining telemetry, logging, automated metrics, and human oversight, the framework ensures continuous improvement and operational transparency in AI-driven workflows.

1. Infrastructure surveillance

The foundation of the framework is a robust observability infrastructure to monitor the running environment of the AI agent. Even the most sophisticated agent is ultimately just software running on computers – thus traditional infrastructure monitoring is the first line of defence.

This layer involves tracking system metrics such as CPU and memory usage, network I/O, request rates, error logs, and other low-level telemetry.

A well-implemented observability system continuously monitors network traffic, request patterns, system health, and resource utilization to detect anomalies. Dashboards display real-time agent logs and system metrics, enabling engineers to spot unusual spikes in activity or suspicious access attempts. If an autonomous agent is deployed via cloud microservices, tools like Prometheus and Grafana can be integrated to visualize these metrics and alert on out-of-bound values.

2. Prompt or response evaluation

At the next level, the focus is on evaluating the content of the agent’s interactions – namely the prompts it receives and the responses or actions it generates.

Ensuring that AI-generated responses are accurate, appropriate, and aligned with expected behaviour is critical for user trust, and for compliance.

This component introduces application-specific evaluation of the agent’s decisions and outputs. Best practice is to log all prompts given to the agent and the agent’s responses. These are then compared against reference standards or metrics.

One effective technique is to compare AI outputs with human-generated outputs for the same or similar inputs. For instance, in deploying an AI customer support agent, organizations can log the AI’s answer to a query and later compare it to how a human agent answered that query (or an ideal answer from a knowledge base). Automated similarity measures like cosine similarity or BLEU scores can then quantify how close the AI’s response is to the human response. This helps in detecting omissions or inaccuracies. Moreover, this evaluation is continuous: The system repeatedly evaluates its performance by examining new AI responses against the accumulating set of human-validated answers.

Ensuring that AI-generated responses are accurate, appropriate, and aligned with expected behaviour is critical for user trust, and for compliance.

An important part of this evaluation layer is identifying malicious or problematic prompts and ensuring the agent’s responses are robust. The monitoring framework flags any input that matches patterns of known prompt injection attacks or disallowed content. If a user tries to prompt the agent to reveal confidential information or produce hate speech, this should trigger an alert or a safe failure.

On the output side, given agents are typically embedded in key business processes, organizations must guarantee response quality and safety. One novel approach is using a Large Language Model (LLM) “as a judge” to evaluate the agent’s outputs (see Figure 1). In other words, a separate AI model (or the agent itself in critique mode) can assess whether a given response is well-formed, helpful, and non-harmful. This ongoing content monitoring is crucial for detecting biases or hallucinations early and retraining or adjusting the agent accordingly.

Figure 1. LLM-as-a-judge as part of the evaluation layer

Source: Infosys RAI Office

3. Performance monitoring

In addition to what the agent does, it is increasingly important to monitor how efficiently it does it to keep performance costs under control.

Performance monitoring is therefore the third pillar of the framework, focusing on timing, throughput, and resource usage at the task level. Optimizing the speed and efficiency of AI-driven workflows is essential for scalability and user satisfaction: An otherwise correct agent could still fail if it is too slow or costly to be practical.

The framework measures execution times for each step the agent takes and for entire task flows. Each agent task is logged with timestamps for key stages: When the task was initiated, when the agent began processing, and when the task completed. From these, organizations can compute durations and latencies for every action an agent makes.

By aggregating this data, bottlenecks can be identified – perhaps a particular sub-module (like a knowledge retrieval step) consistently lags, or certain tools the agent calls are slowing it down. Further, visualizing performance metrics over time using control charts helps establish baselines and detect regressions. For example, if average response time per user query increases significantly after a new agent version, the chart will show a clear deviation, prompting investigation. At Infosys, we also monitor time-to-first response and throughput (tasks per minute) as key indicators of performance.

In summary, the performance monitoring layer treats the agent as a software system whose service level agreements (SLAs) must be met – ensuring it operates within acceptable time and resource limits.

4. Feedback integration

The final and arguably most critical layer of the framework is integrating human feedback to continually refine the agent. Autonomous does not mean unattended – human oversight is a cornerstone of safe agentic AI. In this methodology, we capture explicit feedback from users and human evaluators and feed it back into the improvement cycle.

One approach is to solicit user ratings or critiques after interactions. For instance, after an agent completes a task or answers a question, the user or a moderator might rate whether the outcome was satisfactory. These ratings become another data point in the evaluation repository.

More formally, we can measure what is known as golden instruction adherence – whether the agent followed the key instructions or policies it was supposed to, and exception F-scores – rates at which the agent triggers exceptions or falls back to human intervention.

By tracking how often and in what way the agent deviates from expected behaviour (the “golden path”), organizations can improve the AI’s decision-making over time.

For example, if an AI assistant repeatedly asks for clarification on certain user requests, this might indicate unclear instructions in those scenarios, which means developers should adjust the agent’s prompt or training data.

Human feedback is vital for future-proofing the AI system in face of changing conditions. Real users will inevitably use the agent in unanticipated ways. Monitoring how these real interactions differ from the training scenarios provides insight into where the AI might need adaptation.

Our framework recommends periodic review sessions where human reviewers go through logs of agent decisions, particularly the borderline cases, either flagged by the system or sampled randomly. In these sessions, reviewers can label decisions as correct or flawed and provide explanations. This creates high-quality data for fine-tuning the agent or updating its heuristics.

By tracking how often and in what way the agent deviates from expected behaviour (the “golden path”), organizations can improve the AI’s decision-making over time.

Some modern tools facilitate this loop; for instance, LangSmith, by LangChain, allows developers to collect traces of agent runs and attach feedback or ratings to each run, all within a single platform. Such platforms support LLM-native observability, meaning they are built to handle the nuances of language model outputs and chain-of-thought traces. They let humans search and filter agent runs (find all runs where the agent gave a low-quality answer) and then examine those in detail to debug or improve the logic.

The need for automation

Developing this framework requires hard work, a partner, and ideally establishment of a platform engineering squad to offer disparate teams access to the technology in an automated fashion.

However, there are tools available that can speed up the process of safeguarding agentic AI systems across the four layers of the framework.

LangSmith, a platform designed to support the development, testing, and optimization of LLM applications, provides features for debugging, tracing, and monitoring agentic systems. Using LangSmith, developers can log each agent run with all intermediate steps and then use a dashboard to see aggregate statistics like latency, tokens used, cost, and user feedback over time. This makes it easier to spot outliers – for example, a particular day where the agent’s error rate spiked – and drill down into what went wrong.

Another relevant tool is Galileo, an AI evaluation and observability platform. Galileo’s recently introduced Agentic Evaluations, a tool that offers an end-to-end framework for evaluating AI agent performance, providing visibility into every action across entire workflows. It supports both system-level monitoring and step-by-step analysis, enabling developers to build more reliable and trustworthy agents. Platforms like this often include a guardrail metrics store and modules for prompt evaluation, fine-tuning, and monitoring in production. For example, Galileo’s Monitor module allows teams to set up custom metrics such as a hallucination rate or an accuracy score, and track them in real time as the agent interacts with users. It can automatically flag outputs that have a high likelihood of being hallucinated or harmful, using research-driven metrics, and thus helps in proactively catching errors.

Putting it all into practice

This approach isn’t hypothetical. It’s being used at our client organizations, and to great effect.

A leading global audit firm needed to automate financial audits while ensuring compliance with strict industry regulations. Initially, the multi-agentic AI system faced challenges such as misclassified exceptions and incomplete justifications, which led to unnecessary escalations and reduced trust in AI-driven audits.

By integrating LangSmith for evaluation tracking and Grafana dashboards for real-time monitoring, the firm identified and addressed these gaps. Through comparisons between AI-generated and human-audited reports, along with an LLM-as-a-judge approach, the AI's justification quality improved by 45%, according to experts on the project, and false escalations were reduced by 38%. Additionally, control charts were introduced to monitor adherence to prescribed audit workflows, ensuring Golden Instruction compliance and enhancing overall audit reliability.

In another example, a Fortune 500 company deployed an agentic AI assistant to handle IT service desk requests such as password resets, and troubleshooting. This is a complex, real-world workflow automation scenario. The company implemented a monitoring and evaluation scheme following the Infosys multi-level framework described.

At the infrastructure level, they aggregated logs of all agent actions in their cloud environment. Early on, this helped catch a misconfiguration where the agent was inadvertently making an API call twice for each user request – logs showed an abnormal pattern of duplicate requests, alerting engineers to fix the logic.

At the interaction level, the quality of the agent’s responses was evaluated by comparing them with human IT support responses. For a period, the AI’s answers and the human operators’ answers to similar tickets were collected. Using semantic similarity and manual review, the team identified areas where the AI’s answers were lacking.

For example, the AI often omitted an apology in responses when a user faced an inconvenience, whereas human agents always included a polite apology. This was flagged through content monitoring, and the prompt was adjusted to include an apology where necessary.

They also used an LLM-as-judge approach: for each resolved ticket, another language model rated whether the AI’s resolution was satisfactory or if the user might need follow-up. These ratings surfaced a few cases where the AI gave technically correct but overly brief answers that users found confusing – something the judge model could catch by “imagining” a user’s perspective.

The outcome of this deployment was that the company managed to automate a significant portion of routine IT requests with the agent, while the monitoring framework provided confidence that any decline in performance or unexpected behaviour would be quickly detected and addressed. It goes some way to highlight how multi-level evaluation in a real enterprise setting not only averts failures but also guides the AI to a level of performance and reliability that meets business requirements.

Stability, safety, reliability

AI is increasingly used in key business processes, from logistics to customer service. Agentic AI – one of our top 10 AI imperatives for 2025 – is a progression of generative systems towards autonomous entities that aims to make the workforce more productive and hopefully less stressed. However, this increased autonomy still needs structured oversight.

Monitoring and evaluation must therefore be multi-layered, addressing challenges such as unpredictable AI behaviour and the lack of standardized evaluation metrics.

A structured framework — incorporating infrastructure surveillance, prompt-response monitoring, performance tracking, and human feedback loops — ensures that organizations maintain tight control over AI systems, allowing for continuous refinement and risk mitigation. By integrating these dimensions, AI-driven workflows can achieve stability, safety, and reliability, preventing errors from escalating into business or ethical failures. Doing so will ensure AI continues its march forward, and help businesses ensure that both employees and customer accept the results of these autonomous marvels.

Authors

Muppuli Gnanaraj Govindaraj, Harry Keir Hughes, Sravya Reddy Jammula