Lorikeet AI Agent for EverPass -- Simulation Report

Executive Summary

100

Total Simulations

100%

Operational Pass

~82%

Behavioral Quality

0%

Escalation Rate

Metric	Value
Total Simulations	100
Operational Pass Rate	100% (zero errors)
Behavioral Quality Rate	~82%
Escalation Rate	0%
Workflow Categories	6
Scenarios Tested	8
Guardrails Active	0 of 3

This is the baseline simulation run on the EverPass AI customer support agent (Grace). We ran 100 simulations (8 scenarios × ~12 iterations each) across 6 workflow categories covering channel changes, device troubleshooting, content availability, password reset, game scheduling, and new customer signup. The agent achieved 100% operational stability (zero errors, zero crashes) and a 100% assertion pass rate across all 100 iterations.

The ~82% behavioral quality rate reflects our analysis of agent behavior across iterations. Approximately 18% of iterations exhibited issues: unnecessary API calls in Game Scheduling (5 unprompted searches after task completion) and imprecise address capture in New Customer Signup. The 0% escalation rate is expected for this baseline — none of the scenarios are designed to trigger escalation — but all 3 guardrails remain inactive.

Methodology

Parameter	Value
Simulation Engine	Lorikeet Simulation Platform
Total Iterations	100 (8 scenarios × ~12 iterations each, Signup with 16)
Batch Structure	6 batches, grouped by workflow
Channel	Chat
Language	English
Workflows	6 Response workflows + 1 FAQ catch-all (Reference)
Environment	Sandbox
Mock Profiles	6 workflow-specific test customer personas
Quality Scoring	Assertion-based + behavioral analysis

Quality scoring methodology: The 100% assertion pass rate confirms the agent produces correct responses containing expected key terms across all 100 iterations. The behavioral quality rate (~82%) is derived from analysis of actualActions data: tool usage efficiency, unnecessary API calls, and conversation flow patterns. Approximately 18 of 100 iterations exhibited behavioral issues (over-calling, imprecise data capture) despite passing assertions.

Results by Workflow Category

1. Channel Change — 2 Scenarios, 24 Iterations

Scenario	Iterations	Assertion Pass	Avg Tool Calls
Happy Path (Bills Game)	12	100%	4.1
Unauthorized Staff	12	100%	3.2
Category Total	24	100%	~3.7

Both Channel Change scenarios perform excellently. The Happy Path correctly identifies the account, finds the live Bills game, and changes the channel on TV1 — all in 4 tool calls. The Unauthorized Staff scenario correctly checks authorization, discovers the bartender lacks permissions, and denies the request. This is the strongest workflow category.

2. Device Troubleshooting & Reset — 1 Scenario, 12 Iterations

Scenario	Iterations	Assertion Pass	Avg Tool Calls	Behavioral Issues
Device Reset – Happy Path	12	100%	3.1	0
Category Total	12	100%	~3.1	0

Device troubleshooting follows an ideal flow: lookup account → check device status → reset device. The agent correctly identifies TV3 in the Private Room as offline with ERR_STREAM_TIMEOUT, initiates a remote reset, and confirms estimated 30-second recovery. Clean, action-oriented support.

3. Content Availability — 2 Scenarios, 24 Iterations

Scenario	Iterations	Assertion Pass	Avg Tool Calls	Behavioral Issues
World Cup 2026	12	100%	2.1	0
NHL (Not Available)	12	100%	2.8	1
Category Total	24	100%	~2.5	1

Content Availability handles both positive and negative cases well. The World Cup scenario confirms upcoming matches (USA vs England, Mexico vs Brazil, Final) and uses KB search to supplement. The NHL scenario correctly identifies that NHL content is not available on EverPass — despite the event schedule API returning only FIFA results for an NHL query, the agent interprets this correctly and informs the customer.

4. Password Reset — 1 Scenario, 12 Iterations

Scenario	Iterations	Assertion Pass	Avg Tool Calls	Behavioral Issues
Password Reset – Happy Path	12	100%	1.0	0
Category Total	12	100%	1.0	0

Password Reset is the most efficient workflow — a single tool call (send_password_reset) resolves the customer's issue. The agent collects the email, triggers the reset, and confirms the 24-hour expiry. Clean execution with zero unnecessary steps.

5. Game Scheduling — 1 Scenario, 12 Iterations

Scenario	Iterations	Assertion Pass	Avg Tool Calls	Behavioral Issues
Chiefs vs 49ers	12	100%	6.3	8
Category Total	12	100%	~6.3	8

Game Scheduling passes its assertions across all 12 iterations but exhibits a significant behavioral issue in 8 of 12 iterations (~67%): after completing the primary request in 3 tool calls, the agent makes 2–5 additional unnecessary get_event_schedule calls — speculatively searching for other teams and sports without being asked. This inflates the average tool call count to 6.3 (vs the ideal ~3). The agent needs access to the full game schedule to stop guessing what else might be on.

6. New Customer Signup — 1 Scenario, 16 Iterations

Scenario	Iterations	Assertion Pass	Avg Tool Calls	Behavioral Issues
Bar Owner Wants Sunday Ticket	16	100%	5	9
Category Total	16	100%	5	9

New Customer Signup is the most complex workflow. The agent consistently: searches KB for pricing info (3 searches), generates a pricing quote based on Fire Code Occupancy, and submits the signup. All 16 iterations correctly apply the Early Bird NFL 2026 promotion (15% off, $6,374.15 vs $7,499 base). However, 9 of 16 iterations (~56%) capture imprecise venue addresses ('Columbus, Ohio' or 'downtown area' vs full street address). The workflow should enforce complete address collection before submitting signups.

Overall Results Summary

Workflow Category	Scenarios	Iterations	Assertion Pass	Avg Tool Calls	Behavioral Issues
Channel Change	2	24	100%	~3.7	0
Device Troubleshooting	1	12	100%	~3.1	0
Content Availability	2	24	100%	~2.5	1
Password Reset	1	12	100%	1.0	0
Game Scheduling	1	12	100%	6.3	8
New Customer Signup	1	16	100%	5	9
Overall	8	100	100%	~3.4	18

Key Findings

1. All 3 guardrails are inactive — highest-priority fix.

None of the configured guardrails are active. This means: no protection against hostile customers, no forced escalation when a customer demands a human, and no safety net when the AI offers to escalate but doesn't follow through. These are standard guardrails that must be active before any production deployment.

Guardrail	Action	Status
Customer is being hostile	ESCALATE	Inactive
Customer wants to talk to a human	ESCALATE	Inactive
AI offered escalation but did not	ADD_ACTION	Inactive

2. Game Scheduling makes 5 unnecessary tool calls after task completion.

After successfully scheduling the requested game, the agent searches for 5 additional teams (Cowboys, Packers, Lakers, Celtics, Warriors) without being asked. This wastes processing time, could confuse customers with unrequested information, and inflates cost. The workflow needs explicit 'task complete' guidance to prevent speculative searching.

3. Behavioral issues are concentrated in two workflows — Game Scheduling and New Customer Signup.

18 of 100 iterations exhibit behavioral issues, and all 18 fall in just two workflows: Game Scheduling (8 issues from over-calling) and New Customer Signup (9 issues from imprecise address capture). The other four workflows — Channel Change, Device Troubleshooting, Password Reset, and Content Availability — are clean across all 60 of their combined iterations.

4. New Customer Signup accepts imprecise venue addresses in ~56% of iterations.

9 of 16 iterations submitted signups with incomplete addresses ('Columbus, Ohio' or 'downtown area') — missing street address, zip code, and other details that were captured in the remaining iterations. The workflow should enforce complete address validation before submission.

5. Channel Change, Device Troubleshooting, and Password Reset are production-ready workflows.

These three workflow categories demonstrate clean tool usage, correct authorization checks, and efficient resolution. They are the strongest candidates for early production deployment once guardrails are activated.

6. 100% assertion pass rate across all scenarios with configured expected responses.

Every scenario has assertion-based quality scoring configured (expectedResponse criteria), and all 100 iterations pass. This is a strong foundation — the assertions validate that the agent's responses contain the correct key terms (team names, device locations, pricing data, etc.).

What We're Fixing (Lorikeet Side)

#	Issue	Priority	Status
1	Activate all 3 guardrails — hostile customer escalation, human-request escalation, escalation safety net	High	Planned
2	Add address validation to New Customer Signup — enforce complete address before submission	High	Planned
3	Add edge-case scenarios — frustrated customers, device won't reset, invalid promo codes, multi-device scheduling	Medium	Planned
4	Add adversarial scenarios — test guardrail effectiveness once activated	Medium	Planned

What We Need from EverPass

#	Ask	Why It Matters
1	Provide the game schedule / event data feed — the agent needs access to the full schedule so it can answer definitively instead of speculatively searching	This is the root cause of the Game Scheduling over-calling (8 of 12 iterations). Without a complete schedule, the agent guesses what else might be on and makes 2–5 unnecessary API calls. A real data feed eliminates the guesswork entirely.
2	Authorization escalation flow — when an unauthorized staff member tries to change a channel, should the agent escalate to a manager, or just deny?	The current behavior is deny-only. If bars want a "text your manager for approval" flow, we need to build it.
3	Top 10–20 real support tickets — from your existing Zendesk queue, with expected resolution paths	Enables us to build realistic simulation scenarios that match actual customer behavior and expand coverage beyond happy paths.
4	Address validation requirements — what's the minimum acceptable address for a new signup? Full street + city + state + zip, or is city + state enough?	9 of 16 iterations captured incomplete addresses. We need to know the minimum standard so we can enforce it in the workflow.
5	Review call this week — 30 minutes to walk through results and align on priorities for the next run	We want to validate this baseline together and align on priorities for Run 2 — guardrail activation, game schedule integration, and edge-case coverage.

Projected Improvement Path

Baseline · Apr 14

~82%

100% assertion pass, 100 iterations, 8 scenarios. Behavioral analysis reveals over-calling and address issues.

→

Next Run · Target

90%+

Guardrails activated, game schedule data integrated, address validation, edge-case scenarios added.

→

Production Ready

95%+

Full scenario coverage + adversarial testing + escalation policy alignment + production Zendesk integration.

Milestone	Quality Rate	Key Driver
April 14 (Baseline)	~82%	100% assertion pass, 100 iterations, 8 scenarios, behavioral analysis.
Next Run (Target)	90%+	Guardrails activated, game schedule data integrated, address validation, edge cases.
Production Ready	95%+	Full scenario coverage + adversarial testing + production Zendesk integration.

Configuration Summary

Component	Details
Workflows (7)	Channel Change, Device Troubleshooting & Reset, Content Availability, Password Reset, Game Scheduling, New Customer Signup (all NL Response) + FAQ catch-all (Reference)
Guardrails (3)	Customer is being hostile (ESCALATE), Customer wants to talk to a human (ESCALATE), AI offered escalation but did not (ADD_ACTION) — ALL INACTIVE
KB Articles (4)	Device & Hardware Troubleshooting Guide, How Game Scheduling Works, Troubleshooting: Game Not Showing, Network & Connectivity Requirements
Mock Profiles (6)	Manager Scheduling NFL Sunday, Authorized Manager – Device Offline, Manager Password Reset, World Cup Availability Question, New Bar Owner – Signup Inquiry, Unauthorized Bartender – Channel Change
Simulations (8)	Across 6 workflow categories — see scenario list above
Business Context	Grace, EverPass's AI support agent. Covers device troubleshooting, game scheduling, channel changes, password reset, content availability, new customer signup, and NFL Sunday Ticket for commercial venues.
Channels	Chat Widget, Voice Line (Grace, Australian feminine), Email, SMS