Data retention policy

The transcripts of simulated conversations are only available for 13 months, but associated reports on those simulations are available indefinitely.

Report components

The report contains three components:

  • Summary
  • Agent performance
  • Conversations

The Summary tab provides an overview of how the agents performed overall. Review this to learn about the issues that were found and how to address them.

The Summary tab of a completed report

The Agent performance tab displays the performance results per agent.

The Agent Performance tab of a completed report

The Agent Performance Details tab of a completed report

The Conversations tab surfaces the list of conversation transcripts from the simulation. Dive into a conversation to view the transcript and assessment side-by-side, to easily validate the evaluation and diagnose issues with agent performance.

The Conversations tab of a completed report

The Review conversation window in a completed report

Report scores

Scores on agent performance range from 1 to 10 as follows:

  • Exceptional (9–10): The gold standard. The agent is flawlessly on-character, follows every instruction, and provides deeply tailored, empathetic responses that feel human and professional.
  • Proficient (7–8): Solid, reliable performance. The agent meets most goals and stays polite, though the delivery might occasionally feel a bit generic or slightly "robotic."
  • Functional Failure (5–6): The "Polite Failure" zone. While the agent remains professional in tone, it fails to actually resolve the core issue or misses fundamental technical requirements of the prompt.
  • Poor / Reactive (3–4): Significant breakdown. The agent fails the success criteria, misses the point of the customer's query, and may sound curt or defensive.
  • Unacceptable (1–2): Critical failure. These scores are reserved for agents that are hostile, incoherent, or actively subverting instructions, requiring immediate intervention.

High performance vs. Needs training

Each agent receives a performance rating for every scenario: either "High performance" or "Needs training." These ratings are determined by an LLM. To reach a reasoned conclusion, the LLM analyzes the collective feedback (the assessment narratives in specific) from all of an agent's conversations involving that specific scenario.

The Agent performance details window in a report, with a callout to the training needed and high performance metrics

FAQs

Do the performance metrics reveal which agents used which specific tools (like Conversation Assist or Copilot Rewrite)?

This is an important question. We know brands want to understand how the use of different agent tools impacts performance. However, currently, our performance metrics do not provide insights into which specific agents used which tools.

Can I download a report?

No, not in this release.

Conversations can be transferred from agent to agent. Which agents are reflected in a report?

The name of the agent who was last involved in the conversation is reflected in the report.