How to Audit Your AI Chatbot's Accuracy
Your chatbot vendor says 80% accuracy. Your customers say it's useless. Here's how to test the real number yourself.
Vendor Accuracy Numbers Are Marketing
Every AI support vendor claims high accuracy. "90% resolution rate." "85% customer satisfaction." "95% correct responses." These numbers come from their best deployments, measured generously, in ideal conditions.
Your results will be different. The only way to know how well your chatbot actually performs is to test it with your real customer messages.
The 100-Ticket Audit
Pull your last 100 customer messages (from email, chat, or whatever channel). Don't cherry-pick. Take the most recent 100 in order. This is your test set.
For each message, run it through your AI chatbot and record: what the AI responded with, what the correct response should have been, and whether the AI's response was correct, partially correct, or wrong.
"Correct" means the customer would have gotten their problem solved without human intervention. "Partially correct" means the AI understood the topic but gave an incomplete or slightly off response. "Wrong" means the AI misunderstood the question or gave a harmful answer.
Scoring
Count your results: - Correct: the AI nailed it - Partially correct: right topic, wrong details - Wrong: misunderstood entirely - No response/confusion: the AI admitted it didn't know or asked the customer to rephrase
Your accuracy rate is (correct / total). Your "safe" rate is (correct + no response) / total, since admitting uncertainty is better than being confidently wrong.
Benchmarks: below 70% accuracy, the chatbot is hurting more than helping. 70-80% is mediocre. 80-90% is solid. Above 90% is excellent. Most production chatbots land between 60-80% on honest testing.
What to Look For in Failures
Group the wrong answers by category. You'll find patterns:
Topic confusion. The AI confuses billing questions with account questions, or returns with exchanges. This usually means the underlying categories or knowledge base articles overlap.
Hallucinated details. The AI invents specific information (wrong prices, wrong policies, made-up features). This is the most dangerous failure mode. One confidently wrong answer can cost you a customer or, in the Air Canada case, an $812 payout.
Missed context. The AI gives a generic answer when the customer's message implied specific context. "I want to return my order" gets a general return policy answer instead of pulling up the specific order.
Edge cases. Unusual requests, multi-part questions, or questions that combine two topics. AI struggles with "I want to return item A and get a different size for item B" because it's two intents in one message.
Fix the Biggest Gaps First
Rank your failure categories by volume. If 15 of your 100 test messages were billing questions and the AI got 10 of them wrong, that's your biggest opportunity.
For knowledge-base chatbots: improve the billing articles, add more FAQ entries, and make sure there's no contradictory information.
For classification-based tools: check if the billing intents are configured correctly and if the automated responses match what customers actually need.
Repeat Monthly
Your customer questions change over time. New features create new questions. Pricing changes create new billing inquiries. Seasonal shifts change what people ask about.
Run the 100-ticket audit monthly. Track your accuracy over time. If it drops, investigate which categories degraded. This 30-minute monthly exercise is worth more than any vendor dashboard.
When to Fire Your Chatbot
If after two months of tuning, your accuracy is still below 70% on honest testing, the tool isn't right for your support volume or complexity. Some businesses have support queries that AI just doesn't handle well yet. Complex B2B technical support, highly regulated industries, and situations requiring lots of account-specific context are common trouble spots.
Better to have no chatbot than a bad one. A bad chatbot actively drives customers away. No chatbot just means slower responses.