Naive Bayes is a simple but powerful machine learning algorithm used for classification tasks. It's called "naive" because it assumes that all features (words in our case) are independent of each other, which isn't always true in real language but works surprisingly well!
🎯 How it Works for Chatbots:
1. Training Phase: The model learns from examples of user messages and their correct categories (intents)
2. Prediction Phase: When a new message comes in, it calculates the probability that the message belongs to each category
3. Decision: It picks the category with the highest probability
Formula: P(category | words) = P(words | category) × P(category) / P(words)
In simple terms: "What's the chance this message belongs to this category,
given the words it contains?"
Why it's perfect for chatbots:
• Fast and efficient for text classification
• Works well with small datasets
• Easy to understand and implement
• Handles new/unseen words gracefully with smoothing
Now let's walk through a complete example to see how Naive Bayes predicts the intent of a user message step by step.
"hi can I order pizza"
data = [
("hi", "greeting"),
("hello", "greeting"),
("hi there", "greeting"),
("can I order a pizza", "order_pizza"),
("I'd like a burger", "order_burger")
]
Word |
Greeting |
Order_Pizza |
Order_Burger |
hi | 2 | 0 | 0 |
hello | 1 | 0 | 0 |
there | 1 | 0 | 0 |
can | 0 | 1 | 0 |
i | 0 | 1 | 1 |
order | 0 | 1 | 0 |
a | 0 | 1 | 0 |
pizza | 0 | 1 | 0 |
i'd | 0 | 0 | 1 |
like | 0 | 0 | 1 |
burger | 0 | 0 | 1 |
Problem: What happens if a word appears in our test sentence but NEVER appears in a training category?
Without smoothing: P(word | category) = 0/total = 0
This makes the entire probability = 0 (since we multiply all probabilities together)
Solution: Add +1 to every word count (even if it's 0) to avoid zero probabilities!
📊 Let's Count Everything:
🔤 All unique words in our vocabulary:
["hi", "hello", "there", "can", "i", "order", "a", "pizza", "i'd", "like", "burger"]
→ That's 11 unique words total
📝 Words that appear in "greeting" category:
Training sentences for "greeting":
- "hi" → 1 word
- "hello" → 1 word
- "hi there" → 2 words (hi + there)
Total = 1 + 1 + 2 = 4 words
🧮 Laplace Smoothing Formula:
P(word | category) = (word_count_in_category + 1) / (total_words_in_category + vocabulary_size)
For "greeting" category:
- total_words_in_category = 4
- vocabulary_size = 11
- denominator = 4 + 11 = 15
Why add vocabulary_size to denominator?
Since we add +1 to EVERY word in our vocabulary (11 words), we need to add +11 to the denominator to keep the probabilities valid (they must sum to 1).
["hi", "can", "i", "order", "pizza"]
🔍 Let's break down "hi" step by step:
Looking at our frequency table:
- "hi" appears 2 times in "greeting" category
- With Laplace smoothing: (2 + 1) = 3
- Denominator: (4 + 11) = 15
- P(hi | greeting) = 3/15 = 0.200
Why "hi" = 2 + 1?
• Original count: "hi" appears 2 times in greeting sentences
• Laplace smoothing: Add +1 to avoid zero probabilities
• Final count: 2 + 1 = 3
- hi: (2 + 1) / 15 = 3 / 15 = 0.200 ← appears 2 times in "greeting"
- can: (0 + 1) / 15 = 1 / 15 ≈ 0.067 ← never appears in "greeting"
- i: (0 + 1) / 15 = 1 / 15 ≈ 0.067 ← never appears in "greeting"
- order: (0 + 1) / 15 = 1 / 15 ≈ 0.067 ← never appears in "greeting"
- pizza: (0 + 1) / 15 = 1 / 15 ≈ 0.067 ← never appears in "greeting"
Total P(words | greeting) =
0.200 × 0.067 × 0.067 × 0.067 × 0.067
= 0.200 × 0.00002016 ≈ 0.00000403
→ Very small probability
So even though "hi" is a strong match for greeting
, the other words don't appear in any training sentences for greeting
, so the overall probability becomes very low.
✅ With Laplace Smoothing:
- ✓ No word gets 0 probability
- ✓ Model can handle new/unseen words
- ✓ More robust predictions
❌ Without Laplace Smoothing:
- ✗ Any unseen word = 0 probability
- ✗ Entire sentence probability becomes 0
- ✗ Model fails on new data
Real-world analogy: It's like giving everyone a "participation trophy" of +1, so nobody gets completely ignored, even if they never showed up to practice! 🏆
→ The model predicts: order_pizza 🍕
Because words like "order", "pizza", and "can" are strongly associated with that category.