🧐 Naive Bayes Chatbot: Step-by-Step Example

Machine Learning Tutorial
8 min read
AI, Chatbot, Probability

What is Naive Bayes?

Naive Bayes is a simple but powerful machine learning algorithm used for classification tasks. It's called "naive" because it assumes that all features (words in our case) are independent of each other, which isn't always true in real language but works surprisingly well!

🎯 How it Works for Chatbots:

1. Training Phase: The model learns from examples of user messages and their correct categories (intents)

2. Prediction Phase: When a new message comes in, it calculates the probability that the message belongs to each category

3. Decision: It picks the category with the highest probability

Formula: P(category | words) = P(words | category) × P(category) / P(words) In simple terms: "What's the chance this message belongs to this category, given the words it contains?"
Why it's perfect for chatbots:
• Fast and efficient for text classification
• Works well with small datasets
• Easy to understand and implement
• Handles new/unseen words gracefully with smoothing

Let's See It in Action!

Now let's walk through a complete example to see how Naive Bayes predicts the intent of a user message step by step.

Input Sentence

"hi can I order pizza"

Training Data

data = [ ("hi", "greeting"), ("hello", "greeting"), ("hi there", "greeting"), ("can I order a pizza", "order_pizza"), ("I'd like a burger", "order_burger") ]

Word Frequency Table (Before Smoothing)

Word Greeting Order_Pizza Order_Burger
hi200
hello100
there100
can010
i011
order010
a010
pizza010
i'd001
like001
burger001

What is Laplace Smoothing? 🤔

Problem: What happens if a word appears in our test sentence but NEVER appears in a training category?

Without smoothing: P(word | category) = 0/total = 0
This makes the entire probability = 0 (since we multiply all probabilities together)

Solution: Add +1 to every word count (even if it's 0) to avoid zero probabilities!

📊 Let's Count Everything:

🔤 All unique words in our vocabulary:

["hi", "hello", "there", "can", "i", "order", "a", "pizza", "i'd", "like", "burger"] → That's 11 unique words total

📝 Words that appear in "greeting" category:

Training sentences for "greeting": - "hi" → 1 word - "hello" → 1 word - "hi there" → 2 words (hi + there) Total = 1 + 1 + 2 = 4 words

Step-by-Step Calculation for "greeting"

🧮 Laplace Smoothing Formula:

P(word | category) = (word_count_in_category + 1) / (total_words_in_category + vocabulary_size) For "greeting" category: - total_words_in_category = 4 - vocabulary_size = 11 - denominator = 4 + 11 = 15
Why add vocabulary_size to denominator?
Since we add +1 to EVERY word in our vocabulary (11 words), we need to add +11 to the denominator to keep the probabilities valid (they must sum to 1).

Input words:

["hi", "can", "i", "order", "pizza"]

Word-by-word Probabilities with Smoothing:

🔍 Let's break down "hi" step by step:

Looking at our frequency table: - "hi" appears 2 times in "greeting" category - With Laplace smoothing: (2 + 1) = 3 - Denominator: (4 + 11) = 15 - P(hi | greeting) = 3/15 = 0.200
Why "hi" = 2 + 1?
• Original count: "hi" appears 2 times in greeting sentences
• Laplace smoothing: Add +1 to avoid zero probabilities
• Final count: 2 + 1 = 3

Multiply all probabilities (assuming independence):

Total P(words | greeting) = 0.200 × 0.067 × 0.067 × 0.067 × 0.067 = 0.200 × 0.00002016 ≈ 0.00000403 → Very small probability
So even though "hi" is a strong match for greeting, the other words don't appear in any training sentences for greeting, so the overall probability becomes very low.

Quick Summary: Why Laplace Smoothing?

✅ With Laplace Smoothing:

❌ Without Laplace Smoothing:

Real-world analogy: It's like giving everyone a "participation trophy" of +1, so nobody gets completely ignored, even if they never showed up to practice! 🏆

Final Prediction

→ The model predicts: order_pizza 🍕
Because words like "order", "pizza", and "can" are strongly associated with that category.