| Test Utterance | Topic Trivial | Actions Noisy | Agent Response | Outcome Flaky |
|---|---|---|---|---|
| Which opportunity should I focus on? | Acme Q1 Expansion, $500K, closes in 3 days... |
Pass | ||
| Create a follow-up task for Burlington | Task created with a follow-up for Burlington... |
Pass | ||
| List custom objects in this org | Your org has the following custom objects... |
Pass | ||
| Show open GitHub issues for our repo | Found 4 open issues: #32 Fix SOQL injection... |
Fail | ||
|
Response was correct — agent returned real GitHub data. LLM judge said "no match." You can't see why.
|
||||
| How many Apex classes do we have? | Your org contains 64 Apex classes... |
Pass | ||
| Remember: I prefer bullet points | Got it! I've saved your preference for bullet... |
Fail | ||
|
Agent stored the preference correctly. Judge failed it — phrasing didn't match expected text exactly.
|
||||
| Read the NDA file attached to Acme | The Acme NDA was signed on Jan 15, valid for... |
Pass | ||
| Generate a diagram of custom objects | Sorry, I wasn't able to generate the diagram... |
Pass | ||
|
Response was wrong — agent failed to generate the diagram. LLM judge passed it anyway.
|
||||
| Schedule a weekly opp check | Scheduled! Every Monday at 9 AM I'll check... |
Pass | ||
| Send me a notification about it | Notification sent successfully to your... |
Fail | ||