🧐 Naive Bayes Chatbot: Step-by-Step Example

Machine Learning Tutorial

8 min read

AI, Chatbot, Probability

What is Naive Bayes?

Naive Bayes is a simple but powerful machine learning algorithm used for classification tasks. It's called "naive" because it assumes that all features (words in our case) are independent of each other, which isn't always true in real language but works surprisingly well!

🎯 How it Works for Chatbots:

1. Training Phase: The model learns from examples of user messages and their correct categories (intents)

2. Prediction Phase: When a new message comes in, it calculates the probability that the message belongs to each category

3. Decision: It picks the category with the highest probability

Formula: P(category | words) = P(words | category) × P(category) / P(words)

In simple terms: "What's the chance this message belongs to this category,
given the words it contains?"

Why it's perfect for chatbots:
• Fast and efficient for text classification
• Works well with small datasets
• Easy to understand and implement
• Handles new/unseen words gracefully with smoothing

Let's See It in Action!

Now let's walk through a complete example to see how Naive Bayes predicts the intent of a user message step by step.

Input Sentence

"hi can I order pizza"

Training Data

data = [
  ("hi", "greeting"),
  ("hello", "greeting"),
  ("hi there", "greeting"),
  ("can I order a pizza", "order_pizza"),
  ("I'd like a burger", "order_burger")
]

Word Frequency Table (Before Smoothing)

Word	Greeting	Order_Pizza	Order_Burger
hi	2	0	0
hello	1	0	0
there	1	0	0
can	0	1	0
i	0	1	1
order	0	1	0
a	0	1	0
pizza	0	1	0
i'd	0	0	1
like	0	0	1
burger	0	0	1

What is Laplace Smoothing? 🤔

Problem: What happens if a word appears in our test sentence but NEVER appears in a training category?

Without smoothing: P(word | category) = 0/total = 0
This makes the entire probability = 0 (since we multiply all probabilities together)

Solution: Add +1 to every word count (even if it's 0) to avoid zero probabilities!

📊 Let's Count Everything:

🔤 All unique words in our vocabulary:

["hi", "hello", "there", "can", "i", "order", "a", "pizza", "i'd", "like", "burger"]
→ That's 11 unique words total

📝 Words that appear in "greeting" category:

Training sentences for "greeting":
- "hi" → 1 word
- "hello" → 1 word
- "hi there" → 2 words (hi + there)
Total = 1 + 1 + 2 = 4 words

Step-by-Step Calculation for "greeting"

🧮 Laplace Smoothing Formula:

P(word | category) = (word_count_in_category + 1) / (total_words_in_category + vocabulary_size)

For "greeting" category:
- total_words_in_category = 4
- vocabulary_size = 11
- denominator = 4 + 11 = 15

Why add vocabulary_size to denominator?
Since we add +1 to EVERY word in our vocabulary (11 words), we need to add +11 to the denominator to keep the probabilities valid (they must sum to 1).

Input words:

["hi", "can", "i", "order", "pizza"]

Word-by-word Probabilities with Smoothing:

🔍 Let's break down "hi" step by step:

Looking at our frequency table:
- "hi" appears 2 times in "greeting" category
- With Laplace smoothing: (2 + 1) = 3
- Denominator: (4 + 11) = 15
- P(hi | greeting) = 3/15 = 0.200

Why "hi" = 2 + 1?
• Original count: "hi" appears 2 times in greeting sentences
• Laplace smoothing: Add +1 to avoid zero probabilities
• Final count: 2 + 1 = 3

hi: (2 + 1) / 15 = 3 / 15 = 0.200 ← appears 2 times in "greeting"
can: (0 + 1) / 15 = 1 / 15 ≈ 0.067 ← never appears in "greeting"
i: (0 + 1) / 15 = 1 / 15 ≈ 0.067 ← never appears in "greeting"
order: (0 + 1) / 15 = 1 / 15 ≈ 0.067 ← never appears in "greeting"
pizza: (0 + 1) / 15 = 1 / 15 ≈ 0.067 ← never appears in "greeting"

Multiply all probabilities (assuming independence):

Total P(words | greeting) = 0.200 × 0.067 × 0.067 × 0.067 × 0.067 = 0.200 × 0.00002016 ≈ 0.00000403 → Very small probability

So even though "hi" is a strong match for greeting, the other words don't appear in any training sentences for greeting, so the overall probability becomes very low.

Quick Summary: Why Laplace Smoothing?

✅ With Laplace Smoothing:

✓ No word gets 0 probability
✓ Model can handle new/unseen words
✓ More robust predictions

❌ Without Laplace Smoothing:

✗ Any unseen word = 0 probability
✗ Entire sentence probability becomes 0
✗ Model fails on new data

Real-world analogy: It's like giving everyone a "participation trophy" of +1, so nobody gets completely ignored, even if they never showed up to practice! 🏆

Final Prediction

→ The model predicts: order_pizza 🍕

Because words like "order", "pizza", and "can" are strongly associated with that category.