Supp/Blog/79% Prefer Humans: The AI Support Quality Gap
AI & Technology7 min read· Updated

79% Prefer Humans: The AI Support Quality Gap

Most customers still prefer human support. AI fails 4x more than other AI tasks. But some companies make AI support work. Here's what separates them.


The Stats Paint a Rough Picture

79% of customers prefer human support over AI (SurveyMonkey, 2025). AI-powered customer service fails at 4x the rate of AI in other domains (Qualtrics, 2025). 75% of consumers report frustration with AI customer service (Glance, 2025). In a 2023 Gartner survey, only 8% of customers had actually used a chatbot in their most recent support interaction.

These numbers suggest AI support is a disaster. But the picture is more specific than that. Some AI support implementations work really well. Most fail because of how they're deployed, not what model they use.

Why Most AI Support Fails

The dominant approach in 2024-2025 was: take a large language model, point it at your help center, and let it generate answers. This is what Intercom Fin, Zendesk AI, and most "AI support" products do.

It fails for predictable reasons.

LLMs generate plausible text, not verified answers. When a customer asks about your refund policy and the help center article is slightly ambiguous, the LLM picks an interpretation and states it confidently. Sometimes that interpretation is wrong. The Air Canada case (where a chatbot invented a bereavement fare policy) is the famous example, but smaller versions of this happen constantly.

Support requires actions, not just answers. A customer doesn't want to read about your refund policy. They want a refund. An LLM can explain how refunds work but can't process one. This creates the frustrating experience of a bot that "understands" your problem but can't actually help.

Context is missing. The AI doesn't know that this particular customer has complained twice before, or that their order was already delayed, or that they're on a premium plan with different terms. Without context, responses feel generic.

Escalation paths are broken. When the AI can't help, customers need to reach a human quickly. Many implementations make this difficult (to keep automation metrics high), which transforms a minor issue into a major frustration.

What the Top 25% Do Differently

The companies where AI support actually works share a few patterns:

They constrain what AI does. Instead of letting an LLM generate freeform answers to any question, they use AI for classification first. The AI identifies what the customer wants (refund, order status, password reset, bug report), then triggers a pre-defined action for that intent. No freeform generation means no hallucinated answers.

They set confidence thresholds. If the AI isn't confident about the customer's intent (below 85-90% confidence), it routes to a human immediately instead of guessing. Fast human response beats wrong AI response.

They make escalation easy. A visible, one-click path to a human in every interaction. Not hidden behind "was this helpful? no? try rephrasing your question." If the customer wants a human, they get one.

They separate simple from complex. Order status, business hours, password resets, basic FAQ: AI handles these. Billing disputes, technical issues, complaints, anything involving money or emotions: human handles. Clear boundaries, no gray zone.

They measure satisfaction on AI interactions specifically. Not just overall CSAT. They track satisfaction for AI-resolved conversations separately. If AI satisfaction drops below a threshold, they adjust.

Classification-First vs Generation-First

This is the fundamental split in the AI support market right now.

Generation-first tools (Intercom Fin, Zendesk AI) use LLMs to read your docs and generate answers. They're flexible and can handle a wide range of questions. But they hallucinate, they're expensive per resolution ($0.99-1.50), and accuracy depends on your knowledge base quality.

Classification-first tools (like Supp) use purpose-built ML models to identify intent, then trigger pre-set actions. They're faster (100-200ms vs 1-3 seconds), cheaper ($0.20-0.30 vs $0.99-1.50), and can't hallucinate because they don't generate freeform text. But they don't handle open-ended questions well.

For most small teams, classification-first is the safer bet. 60-70% of support messages have clear intents that map to known actions. The remaining 30-40% go to humans. Total cost is lower, accuracy is higher, and there's zero risk of the AI telling a customer something false.

What Good Looks Like

A customer messages: "I need to return the shoes I ordered last week."

Bad AI: Generates a paragraph about the return policy, includes some details that might not apply to this specific order, suggests the customer visit a help page.

Good AI: Classifies as "return request." Pulls up the order. Confirms: "I see your order #4521 for the running shoes delivered March 10. Would you like to start a return?" Customer confirms. Return label is emailed automatically. Total time: 30 seconds.

The difference is that the good AI didn't generate an answer. It identified the intent and executed a workflow. Faster, cheaper, and correct.

See How Supp Is Different

$5 in free credits. No credit card required. Set up in under 15 minutes.

See How Supp Is Different
AI support qualitycustomers prefer human supportAI customer service failure ratewhy AI support failsgood AI customer supportAI chatbot qualityimprove AI support quality
79% Prefer Humans: The AI Support Quality Gap | Supp Blog