Lorikeet
EverPass

Lorikeet AI Agent for EverPass

Simulation Report — Baseline Run
April 14, 2026

Executive Summary

100
Total Simulations
100%
Operational Pass
~82%
Behavioral Quality
0%
Escalation Rate
Metric Value
Total Simulations100
Operational Pass Rate100% (zero errors)
Behavioral Quality Rate~82%
Escalation Rate0%
Workflow Categories6
Scenarios Tested8
Guardrails Active0 of 3

This is the baseline simulation run on the EverPass AI customer support agent (Grace). We ran 100 simulations (8 scenarios × ~12 iterations each) across 6 workflow categories covering channel changes, device troubleshooting, content availability, password reset, game scheduling, and new customer signup. The agent achieved 100% operational stability (zero errors, zero crashes) and a 100% assertion pass rate across all 100 iterations.

The ~82% behavioral quality rate reflects our analysis of agent behavior across iterations. Approximately 18% of iterations exhibited issues: unnecessary API calls in Game Scheduling (5 unprompted searches after task completion) and imprecise address capture in New Customer Signup. The 0% escalation rate is expected for this baseline — none of the scenarios are designed to trigger escalation — but all 3 guardrails remain inactive.

Methodology

Parameter Value
Simulation EngineLorikeet Simulation Platform
Total Iterations100 (8 scenarios × ~12 iterations each, Signup with 16)
Batch Structure6 batches, grouped by workflow
ChannelChat
LanguageEnglish
Workflows6 Response workflows + 1 FAQ catch-all (Reference)
EnvironmentSandbox
Mock Profiles6 workflow-specific test customer personas
Quality ScoringAssertion-based + behavioral analysis

Quality scoring methodology: The 100% assertion pass rate confirms the agent produces correct responses containing expected key terms across all 100 iterations. The behavioral quality rate (~82%) is derived from analysis of actualActions data: tool usage efficiency, unnecessary API calls, and conversation flow patterns. Approximately 18 of 100 iterations exhibited behavioral issues (over-calling, imprecise data capture) despite passing assertions.

Results by Workflow Category

1. Channel Change — 2 Scenarios, 24 Iterations

Scenario Iterations Assertion Pass Avg Tool Calls Behavioral Issues
Happy Path (Bills Game) 12 100% 4.1 0
Unauthorized Staff 12 100% 3.2 0
Category Total 24 100% ~3.7 0

Both Channel Change scenarios perform excellently. The Happy Path correctly identifies the account, finds the live Bills game, and changes the channel on TV1 — all in 4 tool calls. The Unauthorized Staff scenario correctly checks authorization, discovers the bartender lacks permissions, and denies the request. This is the strongest workflow category.

2. Device Troubleshooting & Reset — 1 Scenario, 12 Iterations

Scenario Iterations Assertion Pass Avg Tool Calls Behavioral Issues
Device Reset – Happy Path 12 100% 3.1 0
Category Total 12 100% ~3.1 0

Device troubleshooting follows an ideal flow: lookup account → check device status → reset device. The agent correctly identifies TV3 in the Private Room as offline with ERR_STREAM_TIMEOUT, initiates a remote reset, and confirms estimated 30-second recovery. Clean, action-oriented support.

3. Content Availability — 2 Scenarios, 24 Iterations

Scenario Iterations Assertion Pass Avg Tool Calls Behavioral Issues
World Cup 2026 12 100% 2.1 0
NHL (Not Available) 12 100% 2.8 1
Category Total 24 100% ~2.5 1

Content Availability handles both positive and negative cases well. The World Cup scenario confirms upcoming matches (USA vs England, Mexico vs Brazil, Final) and uses KB search to supplement. The NHL scenario correctly identifies that NHL content is not available on EverPass — despite the event schedule API returning only FIFA results for an NHL query, the agent interprets this correctly and informs the customer.

4. Password Reset — 1 Scenario, 12 Iterations

Scenario Iterations Assertion Pass Avg Tool Calls Behavioral Issues
Password Reset – Happy Path 12 100% 1.0 0
Category Total 12 100% 1.0 0

Password Reset is the most efficient workflow — a single tool call (send_password_reset) resolves the customer's issue. The agent collects the email, triggers the reset, and confirms the 24-hour expiry. Clean execution with zero unnecessary steps.

5. Game Scheduling — 1 Scenario, 12 Iterations

Scenario Iterations Assertion Pass Avg Tool Calls Behavioral Issues
Chiefs vs 49ers 12 100% 6.3 8
Category Total 12 100% ~6.3 8

Game Scheduling passes its assertions across all 12 iterations but exhibits a significant behavioral issue in 8 of 12 iterations (~67%): after completing the primary request in 3 tool calls, the agent makes 2–5 additional unnecessary get_event_schedule calls — speculatively searching for other teams and sports without being asked. This inflates the average tool call count to 6.3 (vs the ideal ~3). The agent needs access to the full game schedule to stop guessing what else might be on.

6. New Customer Signup — 1 Scenario, 16 Iterations

Scenario Iterations Assertion Pass Avg Tool Calls Behavioral Issues
Bar Owner Wants Sunday Ticket 16 100% 5 9
Category Total 16 100% 5 9

New Customer Signup is the most complex workflow. The agent consistently: searches KB for pricing info (3 searches), generates a pricing quote based on Fire Code Occupancy, and submits the signup. All 16 iterations correctly apply the Early Bird NFL 2026 promotion (15% off, $6,374.15 vs $7,499 base). However, 9 of 16 iterations (~56%) capture imprecise venue addresses ('Columbus, Ohio' or 'downtown area' vs full street address). The workflow should enforce complete address collection before submitting signups.

Overall Results Summary

Workflow Category Scenarios Iterations Assertion Pass Avg Tool Calls Behavioral Issues
Channel Change 2 24 100% ~3.7 0
Device Troubleshooting 1 12 100% ~3.1 0
Content Availability 2 24 100% ~2.5 1
Password Reset 1 12 100% 1.0 0
Game Scheduling 1 12 100% 6.3 8
New Customer Signup 1 16 100% 5 9
Overall 8 100 100% ~3.4 18

Key Findings

1. All 3 guardrails are inactive — highest-priority fix.

None of the configured guardrails are active. This means: no protection against hostile customers, no forced escalation when a customer demands a human, and no safety net when the AI offers to escalate but doesn't follow through. These are standard guardrails that must be active before any production deployment.

Guardrail Action Status
Customer is being hostile ESCALATE Inactive
Customer wants to talk to a human ESCALATE Inactive
AI offered escalation but did not ADD_ACTION Inactive

2. Game Scheduling makes 5 unnecessary tool calls after task completion.

After successfully scheduling the requested game, the agent searches for 5 additional teams (Cowboys, Packers, Lakers, Celtics, Warriors) without being asked. This wastes processing time, could confuse customers with unrequested information, and inflates cost. The workflow needs explicit 'task complete' guidance to prevent speculative searching.

3. Behavioral issues are concentrated in two workflows — Game Scheduling and New Customer Signup.

18 of 100 iterations exhibit behavioral issues, and all 18 fall in just two workflows: Game Scheduling (8 issues from over-calling) and New Customer Signup (9 issues from imprecise address capture). The other four workflows — Channel Change, Device Troubleshooting, Password Reset, and Content Availability — are clean across all 60 of their combined iterations.

4. New Customer Signup accepts imprecise venue addresses in ~56% of iterations.

9 of 16 iterations submitted signups with incomplete addresses ('Columbus, Ohio' or 'downtown area') — missing street address, zip code, and other details that were captured in the remaining iterations. The workflow should enforce complete address validation before submission.

5. Channel Change, Device Troubleshooting, and Password Reset are production-ready workflows.

These three workflow categories demonstrate clean tool usage, correct authorization checks, and efficient resolution. They are the strongest candidates for early production deployment once guardrails are activated.

6. 100% assertion pass rate across all scenarios with configured expected responses.

Every scenario has assertion-based quality scoring configured (expectedResponse criteria), and all 100 iterations pass. This is a strong foundation — the assertions validate that the agent's responses contain the correct key terms (team names, device locations, pricing data, etc.).

What We're Fixing (Lorikeet Side)

# Issue Priority Status
1 Activate all 3 guardrails — hostile customer escalation, human-request escalation, escalation safety net High Planned
2 Add address validation to New Customer Signup — enforce complete address before submission High Planned
3 Add edge-case scenarios — frustrated customers, device won't reset, invalid promo codes, multi-device scheduling Medium Planned
4 Add adversarial scenarios — test guardrail effectiveness once activated Medium Planned

What We Need from EverPass

# Ask Why It Matters
1 Provide the game schedule / event data feed — the agent needs access to the full schedule so it can answer definitively instead of speculatively searching This is the root cause of the Game Scheduling over-calling (8 of 12 iterations). Without a complete schedule, the agent guesses what else might be on and makes 2–5 unnecessary API calls. A real data feed eliminates the guesswork entirely.
2 Authorization escalation flow — when an unauthorized staff member tries to change a channel, should the agent escalate to a manager, or just deny? The current behavior is deny-only. If bars want a "text your manager for approval" flow, we need to build it.
3 Top 10–20 real support tickets — from your existing Zendesk queue, with expected resolution paths Enables us to build realistic simulation scenarios that match actual customer behavior and expand coverage beyond happy paths.
4 Address validation requirements — what's the minimum acceptable address for a new signup? Full street + city + state + zip, or is city + state enough? 9 of 16 iterations captured incomplete addresses. We need to know the minimum standard so we can enforce it in the workflow.
5 Review call this week — 30 minutes to walk through results and align on priorities for the next run We want to validate this baseline together and align on priorities for Run 2 — guardrail activation, game schedule integration, and edge-case coverage.

Projected Improvement Path

Baseline · Apr 14
~82%
100% assertion pass, 100 iterations, 8 scenarios. Behavioral analysis reveals over-calling and address issues.
Next Run · Target
90%+
Guardrails activated, game schedule data integrated, address validation, edge-case scenarios added.
Production Ready
95%+
Full scenario coverage + adversarial testing + escalation policy alignment + production Zendesk integration.
Milestone Quality Rate Key Driver
April 14 (Baseline) ~82% 100% assertion pass, 100 iterations, 8 scenarios, behavioral analysis.
Next Run (Target) 90%+ Guardrails activated, game schedule data integrated, address validation, edge cases.
Production Ready 95%+ Full scenario coverage + adversarial testing + production Zendesk integration.

Configuration Summary

Component Details
Workflows (7) Channel Change, Device Troubleshooting & Reset, Content Availability, Password Reset, Game Scheduling, New Customer Signup (all NL Response) + FAQ catch-all (Reference)
Guardrails (3) Customer is being hostile (ESCALATE), Customer wants to talk to a human (ESCALATE), AI offered escalation but did not (ADD_ACTION) — ALL INACTIVE
KB Articles (4) Device & Hardware Troubleshooting Guide, How Game Scheduling Works, Troubleshooting: Game Not Showing, Network & Connectivity Requirements
Mock Profiles (6) Manager Scheduling NFL Sunday, Authorized Manager – Device Offline, Manager Password Reset, World Cup Availability Question, New Bar Owner – Signup Inquiry, Unauthorized Bartender – Channel Change
Simulations (8) Across 6 workflow categories — see scenario list above
Business Context Grace, EverPass's AI support agent. Covers device troubleshooting, game scheduling, channel changes, password reset, content availability, new customer signup, and NFL Sunday Ticket for commercial venues.
Channels Chat Widget, Voice Line (Grace, Australian feminine), Email, SMS