Naive Bayes is a simple but powerful machine learning algorithm used for classification tasks.
It's called "naive" because it assumes that all features (words in our case) are independent of each other,
which
isn't always true in real language but works surprisingly well!
๐ฏ How it Works for Chatbots:
1. Training Phase: The model learns from examples of user messages and their correct
categories (intents)
2. Prediction Phase: When a new message comes in, it calculates the probability that the
message belongs to each category
3. Decision: It picks the category with the highest probability
Why it's perfect for chatbots:
โข Fast and efficient for text classification
โข Works well with small datasets
โข Easy to understand and implement
โข Handles new/unseen words gracefully with smoothing
Now let's walk through a complete example to see how Naive Bayes predicts the intent of a user message step by
step.
"hi can I order pizza"
data = [
("hi", "greeting"),
("hello", "greeting"),
("hi there", "greeting"),
("can I order a pizza", "order_pizza"),
("I'd like a burger", "order_burger")
]
| Word |
Greeting |
Order_Pizza |
Order_Burger |
| hi |
2 |
0 |
0 |
| hello |
1 |
0 |
0 |
| there |
1 |
0 |
0 |
| can |
0 |
1 |
0 |
| i |
0 |
1 |
0 |
| order |
0 |
1 |
0 |
| a |
0 |
1 |
1 |
| pizza |
0 |
1 |
0 |
| i'd |
0 |
0 |
1 |
| like |
0 |
0 |
1 |
| burger |
0 |
0 |
1 |
Problem: What happens if a word appears in our test sentence but NEVER appears in a training
category?
Without smoothing: \(P(\text{word} \mid \text{category}) = \frac{0}{\text{total}} = 0\)
This makes the entire probability = 0 (since we multiply all probabilities together)
Solution: Add +1 to every word count (even if it's 0) to avoid zero probabilities!
๐ Let's Count Everything:
๐ค All unique words in our vocabulary:
["hi", "hello", "there", "can", "i", "order", "a", "pizza", "i'd", "like", "burger"]
โ That's 11 unique words total
๐ Total words in each category:
- "greeting" total words: 4 (hi, hello, hi, there)
- "order_pizza" total words: 5 (can, I, order, a, pizza)
- "order_burger" total words: 4 (I'd, like, a, burger)
Why add vocabulary_size to denominator?
Since we add +1 to EVERY word in our vocabulary (11 words), we need to add +11 to the denominator to keep the
probabilities valid (they must sum to 1).
["hi", "can", "i", "order", "pizza"]
๐ Let's break down "hi" step by step:
Why "hi" = 2 + 1?
โข Original count: "hi" appears 2 times in greeting sentences
โข Laplace smoothing: Add +1 to avoid zero probabilities
โข Final count: 2 + 1 = 3
- hi: \(\frac{2 + 1}{15} = \frac{3}{15} = \mathbf{0.200}\) โ appears 2 times in "greeting"
- can: \(\frac{0 + 1}{15} = \frac{1}{15} \approx \mathbf{0.067}\) โ never appears in
"greeting"
- i: \(\frac{0 + 1}{15} = \frac{1}{15} \approx \mathbf{0.067}\) โ never appears in "greeting"
- order: \(\frac{0 + 1}{15} = \frac{1}{15} \approx \mathbf{0.067}\) โ never appears in
"greeting"
- pizza: \(\frac{0 + 1}{15} = \frac{1}{15} \approx \mathbf{0.067}\) โ never appears in
"greeting"
So even though "hi" is a strong match for greeting, the other words don't appear in any training
sentences for greeting, so the overall probability becomes very low (~0.000004).
- hi: \(\frac{0 + 1}{15} = \frac{1}{15} \approx \mathbf{0.067}\)
- can: \(\frac{0 + 1}{15} = \frac{1}{15} \approx \mathbf{0.067}\)
- i: \(\frac{0 + 1}{15} = \frac{1}{15} \approx \mathbf{0.067}\) (Note: "i" is not in "I'd
like
a burger")
- order: \(\frac{0 + 1}{15} = \frac{1}{15} \approx \mathbf{0.067}\)
- pizza: \(\frac{0 + 1}{15} = \frac{1}{15} \approx \mathbf{0.067}\)
- hi: \(\frac{0 + 1}{16} = \frac{1}{16} = \mathbf{0.0625}\)
- can: \(\frac{1 + 1}{16} = \frac{2}{16} = \mathbf{0.125}\)
- i: \(\frac{1 + 1}{16} = \frac{2}{16} = \mathbf{0.125}\)
- order: \(\frac{1 + 1}{16} = \frac{2}{16} = \mathbf{0.125}\)
- pizza: \(\frac{1 + 1}{16} = \frac{2}{16} = \mathbf{0.125}\)
โ๏ธ Now You Try: Find the Probability for order_pizza
Use the same input: "hi can I order pizza"
Hints:
- Total words in order_pizza = 5
- Vocabulary size = 11
- Denominator = 5 + 11 = 16
- P(word|order_pizza) = (count + 1) / 16
- Use word counts from the training sentence: "can I order a pizza"
What is the final probability for order_pizza?
Check your answer
It is 0.00001526! This is much higher than the greeting category (0.000004) and the burger
category (0.00000135).
โ
With Laplace Smoothing:
- โ No word gets 0 probability
- โ Model can handle new/unseen words
- โ More robust predictions
โ Without Laplace Smoothing:
- โ Any unseen word = 0 probability
- โ Entire sentence probability becomes 0
- โ Model fails on new data
Real-world analogy: It's like giving everyone a "participation trophy" of +1, so nobody gets
completely ignored, even if they never showed up to practice! ๐
Now that we have the raw scores (likelihoods), we need to turn them into percentages. This
process is called normalization.
1. Apply Prior Probabilities
Since we have 3 possible categories (Greeting, Order_Pizza, Order_Burger), the "prior" chance for each is
1/3 (or 0.3333).
2. Divide by Total (Normalization)
Divide each value by the total sum to get the final probability percentage:
- Greeting: 0.0000013167 รท 0.0000064122 โ 20.5%
- Order_Pizza: 0.0000050867 รท 0.0000064122 โ 79.3%
- Order_Burger: 0.0000000088 รท 0.0000064122 โ 0.1%
โ The model predicts: order_pizza (79.3%) ๐
The model is 79.3% confident that the message "hi can I order pizza" belongs to the
order_pizza category based on the word frequencies it learned during training!
Final Percentage Breakdown:
โข ๐ Order Pizza: 79.3%
โข ๐ Greeting: 20.5%
โข ๐ Order Burger: 0.1%
๐ Learn More AI Concepts
If you enjoyed this, check out our guide on TF-IDF Explained to see another way computers
understand text!