How to Audit Your AI Chatbot's Accuracy

Your chatbot vendor says 80% accuracy. Your customers say it's useless. Here's how to test the real number yourself.

Vendor Accuracy Numbers Are Marketing

Every AI support vendor claims high accuracy. "90% resolution rate." "85% customer satisfaction." "95% correct responses." These numbers come from their best deployments, measured generously, in ideal conditions.

Your results will be different. The only way to know how well your chatbot actually performs is to test it with your real customer messages.

The 100-Ticket Audit

Pull your last 100 customer messages (from email, chat, or whatever channel). Don't cherry-pick. Take the most recent 100 in order. This is your test set.

For each message, run it through your AI chatbot and record: what the AI responded with, what the correct response should have been, and whether the AI's response was correct, partially correct, or wrong.

"Correct" means the customer would have gotten their problem solved without human intervention. "Partially correct" means the AI understood the topic but gave an incomplete or slightly off response. "Wrong" means the AI misunderstood the question or gave a harmful answer.

Scoring

Count your results:

Correct: the AI nailed it
Partially correct: right topic, wrong details
Wrong: misunderstood entirely
No response/confusion: the AI admitted it didn't know or asked the customer to rephrase

Your accuracy rate is (correct / total). Your "safe" rate is (correct + no response) / total, since admitting uncertainty is better than being confidently wrong.

Benchmarks: below 70% accuracy, the chatbot is hurting more than helping. 70-80% is mediocre. 80-90% is solid. Above 90% is excellent. Most production chatbots land between 60-80% on honest testing.

What to Look For in Failures

Group the wrong answers by category. You'll find patterns:

Topic confusion. The AI confuses billing questions with account questions, or returns with exchanges. This usually means the underlying categories or knowledge base articles overlap.

Hallucinated details. The AI invents specific information (wrong prices, wrong policies, made-up features). This is the most dangerous failure mode. One confidently wrong answer can cost you a customer or, in the Air Canada case, an $812 payout.

Missed context. The AI gives a generic answer when the customer's message implied specific context. "I want to return my order" gets a general return policy answer instead of pulling up the specific order.

Edge cases. Unusual requests, multi-part questions, or questions that combine two topics. AI struggles with "I want to return item A and get a different size for item B" because it's two intents in one message.

Fix the Biggest Gaps First

Rank your failure categories by volume. If 15 of your 100 test messages were billing questions and the AI got 10 of them wrong, that's your biggest opportunity.

For knowledge-base chatbots: improve the billing articles, add more FAQ entries, and make sure there's no contradictory information.

For classification-based tools: check if the billing intents are configured correctly and if the automated responses match what customers actually need.

Repeat Monthly

Your customer questions change over time. New features create new questions. Pricing changes create new billing inquiries. Seasonal shifts change what people ask about.

Run the 100-ticket audit monthly. Track your accuracy over time. If it drops, investigate which categories degraded. This 30-minute monthly exercise is worth more than any vendor dashboard.

When to Fire Your Chatbot

If after two months of tuning, your accuracy is still below 70% on honest testing, the tool isn't right for your support volume or complexity. Some businesses have support queries that AI just doesn't handle well yet. Complex B2B technical support, highly regulated industries, and situations requiring lots of account-specific context are common trouble spots.

Better to have no chatbot than a bad one. A bad chatbot actively drives customers away. No chatbot just means slower responses.

Try 92% Accuracy Out of the Box

$5 in free credits. No credit card required. Set up in under 15 minutes.

Try 92% Accuracy Out of the Box

AI chatbot accuracy testingtest chatbot accuracychatbot QA testingaudit AI chatbotchatbot accuracy benchmarkshow to measure chatbot accuracy

How-To

How to Audit Your AI Chatbot's Accuracy

Vendor Accuracy Numbers Are Marketing

The 100-Ticket Audit

Scoring

What to Look For in Failures

Fix the Biggest Gaps First

Repeat Monthly

When to Fire Your Chatbot

Try 92% Accuracy Out of the Box

Related Posts

How to Write Support Macros That Don't Sound Like a Robot Wrote Them

How to Build an Internal Knowledge Base Your Support Team Will Actually Use

How to Run a Support Retrospective That Produces Action Items (Not Just Talk)