| Metric | Value |
|---|---|
| Total Simulations | 100 |
| Operational Pass Rate | 100% (zero errors) |
| Behavioral Quality Rate | ~82% |
| Escalation Rate | 0% |
| Workflow Categories | 6 |
| Scenarios Tested | 8 |
| Guardrails Active | 0 of 3 |
This is the baseline simulation run on the EverPass AI customer support agent (Grace). We ran 100 simulations (8 scenarios × ~12 iterations each) across 6 workflow categories covering channel changes, device troubleshooting, content availability, password reset, game scheduling, and new customer signup. The agent achieved 100% operational stability (zero errors, zero crashes) and a 100% assertion pass rate across all 100 iterations.
The ~82% behavioral quality rate reflects our analysis of agent behavior across iterations. Approximately 18% of iterations exhibited issues: unnecessary API calls in Game Scheduling (5 unprompted searches after task completion) and imprecise address capture in New Customer Signup. The 0% escalation rate is expected for this baseline — none of the scenarios are designed to trigger escalation — but all 3 guardrails remain inactive.
| Parameter | Value |
|---|---|
| Simulation Engine | Lorikeet Simulation Platform |
| Total Iterations | 100 (8 scenarios × ~12 iterations each, Signup with 16) |
| Batch Structure | 6 batches, grouped by workflow |
| Channel | Chat |
| Language | English |
| Workflows | 6 Response workflows + 1 FAQ catch-all (Reference) |
| Environment | Sandbox |
| Mock Profiles | 6 workflow-specific test customer personas |
| Quality Scoring | Assertion-based + behavioral analysis |
Quality scoring methodology: The 100% assertion pass rate confirms the agent produces correct responses containing expected key terms across all 100 iterations. The behavioral quality rate (~82%) is derived from analysis of actualActions data: tool usage efficiency, unnecessary API calls, and conversation flow patterns. Approximately 18 of 100 iterations exhibited behavioral issues (over-calling, imprecise data capture) despite passing assertions.
| Scenario | Iterations | Assertion Pass | Avg Tool Calls | Behavioral Issues |
|---|---|---|---|---|
| Happy Path (Bills Game) | 12 | 100% | 4.1 | 0 |
| Unauthorized Staff | 12 | 100% | 3.2 | 0 |
| Category Total | 24 | 100% | ~3.7 | 0 |
Both Channel Change scenarios perform excellently. The Happy Path correctly identifies the account, finds the live Bills game, and changes the channel on TV1 — all in 4 tool calls. The Unauthorized Staff scenario correctly checks authorization, discovers the bartender lacks permissions, and denies the request. This is the strongest workflow category.
| Scenario | Iterations | Assertion Pass | Avg Tool Calls | Behavioral Issues |
|---|---|---|---|---|
| Device Reset – Happy Path | 12 | 100% | 3.1 | 0 |
| Category Total | 12 | 100% | ~3.1 | 0 |
Device troubleshooting follows an ideal flow: lookup account → check device status → reset device. The agent correctly identifies TV3 in the Private Room as offline with ERR_STREAM_TIMEOUT, initiates a remote reset, and confirms estimated 30-second recovery. Clean, action-oriented support.
| Scenario | Iterations | Assertion Pass | Avg Tool Calls | Behavioral Issues |
|---|---|---|---|---|
| World Cup 2026 | 12 | 100% | 2.1 | 0 |
| NHL (Not Available) | 12 | 100% | 2.8 | 1 |
| Category Total | 24 | 100% | ~2.5 | 1 |
Content Availability handles both positive and negative cases well. The World Cup scenario confirms upcoming matches (USA vs England, Mexico vs Brazil, Final) and uses KB search to supplement. The NHL scenario correctly identifies that NHL content is not available on EverPass — despite the event schedule API returning only FIFA results for an NHL query, the agent interprets this correctly and informs the customer.
| Scenario | Iterations | Assertion Pass | Avg Tool Calls | Behavioral Issues |
|---|---|---|---|---|
| Password Reset – Happy Path | 12 | 100% | 1.0 | 0 |
| Category Total | 12 | 100% | 1.0 | 0 |
Password Reset is the most efficient workflow — a single tool call (send_password_reset) resolves the customer's issue. The agent collects the email, triggers the reset, and confirms the 24-hour expiry. Clean execution with zero unnecessary steps.
| Scenario | Iterations | Assertion Pass | Avg Tool Calls | Behavioral Issues |
|---|---|---|---|---|
| Chiefs vs 49ers | 12 | 100% | 6.3 | 8 |
| Category Total | 12 | 100% | ~6.3 | 8 |
Game Scheduling passes its assertions across all 12 iterations but exhibits a significant behavioral issue in 8 of 12 iterations (~67%): after completing the primary request in 3 tool calls, the agent makes 2–5 additional unnecessary get_event_schedule calls — speculatively searching for other teams and sports without being asked. This inflates the average tool call count to 6.3 (vs the ideal ~3). The agent needs access to the full game schedule to stop guessing what else might be on.
| Scenario | Iterations | Assertion Pass | Avg Tool Calls | Behavioral Issues |
|---|---|---|---|---|
| Bar Owner Wants Sunday Ticket | 16 | 100% | 5 | 9 |
| Category Total | 16 | 100% | 5 | 9 |
New Customer Signup is the most complex workflow. The agent consistently: searches KB for pricing info (3 searches), generates a pricing quote based on Fire Code Occupancy, and submits the signup. All 16 iterations correctly apply the Early Bird NFL 2026 promotion (15% off, $6,374.15 vs $7,499 base). However, 9 of 16 iterations (~56%) capture imprecise venue addresses ('Columbus, Ohio' or 'downtown area' vs full street address). The workflow should enforce complete address collection before submitting signups.
| Workflow Category | Scenarios | Iterations | Assertion Pass | Avg Tool Calls | Behavioral Issues |
|---|---|---|---|---|---|
| Channel Change | 2 | 24 | 100% | ~3.7 | 0 |
| Device Troubleshooting | 1 | 12 | 100% | ~3.1 | 0 |
| Content Availability | 2 | 24 | 100% | ~2.5 | 1 |
| Password Reset | 1 | 12 | 100% | 1.0 | 0 |
| Game Scheduling | 1 | 12 | 100% | 6.3 | 8 |
| New Customer Signup | 1 | 16 | 100% | 5 | 9 |
| Overall | 8 | 100 | 100% | ~3.4 | 18 |
None of the configured guardrails are active. This means: no protection against hostile customers, no forced escalation when a customer demands a human, and no safety net when the AI offers to escalate but doesn't follow through. These are standard guardrails that must be active before any production deployment.
| Guardrail | Action | Status |
|---|---|---|
| Customer is being hostile | ESCALATE | Inactive |
| Customer wants to talk to a human | ESCALATE | Inactive |
| AI offered escalation but did not | ADD_ACTION | Inactive |
After successfully scheduling the requested game, the agent searches for 5 additional teams (Cowboys, Packers, Lakers, Celtics, Warriors) without being asked. This wastes processing time, could confuse customers with unrequested information, and inflates cost. The workflow needs explicit 'task complete' guidance to prevent speculative searching.
18 of 100 iterations exhibit behavioral issues, and all 18 fall in just two workflows: Game Scheduling (8 issues from over-calling) and New Customer Signup (9 issues from imprecise address capture). The other four workflows — Channel Change, Device Troubleshooting, Password Reset, and Content Availability — are clean across all 60 of their combined iterations.
9 of 16 iterations submitted signups with incomplete addresses ('Columbus, Ohio' or 'downtown area') — missing street address, zip code, and other details that were captured in the remaining iterations. The workflow should enforce complete address validation before submission.
These three workflow categories demonstrate clean tool usage, correct authorization checks, and efficient resolution. They are the strongest candidates for early production deployment once guardrails are activated.
Every scenario has assertion-based quality scoring configured (expectedResponse criteria), and all 100 iterations pass. This is a strong foundation — the assertions validate that the agent's responses contain the correct key terms (team names, device locations, pricing data, etc.).
| # | Issue | Priority | Status |
|---|---|---|---|
| 1 | Activate all 3 guardrails — hostile customer escalation, human-request escalation, escalation safety net | High | Planned |
| 2 | Add address validation to New Customer Signup — enforce complete address before submission | High | Planned |
| 3 | Add edge-case scenarios — frustrated customers, device won't reset, invalid promo codes, multi-device scheduling | Medium | Planned |
| 4 | Add adversarial scenarios — test guardrail effectiveness once activated | Medium | Planned |
| # | Ask | Why It Matters |
|---|---|---|
| 1 | Provide the game schedule / event data feed — the agent needs access to the full schedule so it can answer definitively instead of speculatively searching | This is the root cause of the Game Scheduling over-calling (8 of 12 iterations). Without a complete schedule, the agent guesses what else might be on and makes 2–5 unnecessary API calls. A real data feed eliminates the guesswork entirely. |
| 2 | Authorization escalation flow — when an unauthorized staff member tries to change a channel, should the agent escalate to a manager, or just deny? | The current behavior is deny-only. If bars want a "text your manager for approval" flow, we need to build it. |
| 3 | Top 10–20 real support tickets — from your existing Zendesk queue, with expected resolution paths | Enables us to build realistic simulation scenarios that match actual customer behavior and expand coverage beyond happy paths. |
| 4 | Address validation requirements — what's the minimum acceptable address for a new signup? Full street + city + state + zip, or is city + state enough? | 9 of 16 iterations captured incomplete addresses. We need to know the minimum standard so we can enforce it in the workflow. |
| 5 | Review call this week — 30 minutes to walk through results and align on priorities for the next run | We want to validate this baseline together and align on priorities for Run 2 — guardrail activation, game schedule integration, and edge-case coverage. |
| Milestone | Quality Rate | Key Driver |
|---|---|---|
| April 14 (Baseline) | ~82% | 100% assertion pass, 100 iterations, 8 scenarios, behavioral analysis. |
| Next Run (Target) | 90%+ | Guardrails activated, game schedule data integrated, address validation, edge cases. |
| Production Ready | 95%+ | Full scenario coverage + adversarial testing + production Zendesk integration. |
| Component | Details |
|---|---|
| Workflows (7) | Channel Change, Device Troubleshooting & Reset, Content Availability, Password Reset, Game Scheduling, New Customer Signup (all NL Response) + FAQ catch-all (Reference) |
| Guardrails (3) | Customer is being hostile (ESCALATE), Customer wants to talk to a human (ESCALATE), AI offered escalation but did not (ADD_ACTION) — ALL INACTIVE |
| KB Articles (4) | Device & Hardware Troubleshooting Guide, How Game Scheduling Works, Troubleshooting: Game Not Showing, Network & Connectivity Requirements |
| Mock Profiles (6) | Manager Scheduling NFL Sunday, Authorized Manager – Device Offline, Manager Password Reset, World Cup Availability Question, New Bar Owner – Signup Inquiry, Unauthorized Bartender – Channel Change |
| Simulations (8) | Across 6 workflow categories — see scenario list above |
| Business Context | Grace, EverPass's AI support agent. Covers device troubleshooting, game scheduling, channel changes, password reset, content availability, new customer signup, and NFL Sunday Ticket for commercial venues. |
| Channels | Chat Widget, Voice Line (Grace, Australian feminine), Email, SMS |