๐Ÿง Naive Bayes Explained: Step-by-Step Chatbot Intent Prediction

Machine Learning Tutorial
8 min read
AI, Chatbot, Probability

What is Naive Bayes?

Naive Bayes is a simple but powerful machine learning algorithm used for classification tasks. It's called "naive" because it assumes that all features (words in our case) are independent of each other, which isn't always true in real language but works surprisingly well!

๐ŸŽฏ How it Works for Chatbots:

1. Training Phase: The model learns from examples of user messages and their correct categories (intents)

2. Prediction Phase: When a new message comes in, it calculates the probability that the message belongs to each category

3. Decision: It picks the category with the highest probability

Bayes' Theorem Formula
\[P(\text{category} \mid \text{words}) = \frac{P(\text{words} \mid \text{category}) \times P(\text{category})}{P(\text{words})}\]
In simple terms: "What's the chance this message belongs to this category, given the words it contains?"
Why it's perfect for chatbots:
โ€ข Fast and efficient for text classification
โ€ข Works well with small datasets
โ€ข Easy to understand and implement
โ€ข Handles new/unseen words gracefully with smoothing

Let's See It in Action!

Now let's walk through a complete example to see how Naive Bayes predicts the intent of a user message step by step.

Input Sentence

"hi can I order pizza"

Training Data

data = [ ("hi", "greeting"), ("hello", "greeting"), ("hi there", "greeting"), ("can I order a pizza", "order_pizza"), ("I'd like a burger", "order_burger") ]

Word Frequency Table (Before Smoothing)

Word Greeting Order_Pizza Order_Burger
hi 2 0 0
hello 1 0 0
there 1 0 0
can 0 1 0
i 0 1 0
order 0 1 0
a 0 1 1
pizza 0 1 0
i'd 0 0 1
like 0 0 1
burger 0 0 1

What is Laplace Smoothing? ๐Ÿค”

Problem: What happens if a word appears in our test sentence but NEVER appears in a training category?

Without smoothing: \(P(\text{word} \mid \text{category}) = \frac{0}{\text{total}} = 0\)
This makes the entire probability = 0 (since we multiply all probabilities together)

Solution: Add +1 to every word count (even if it's 0) to avoid zero probabilities!

๐Ÿ“Š Let's Count Everything:

๐Ÿ”ค All unique words in our vocabulary:

["hi", "hello", "there", "can", "i", "order", "a", "pizza", "i'd", "like", "burger"] โ†’ That's 11 unique words total

๐Ÿ“ Total words in each category:

- "greeting" total words: 4 (hi, hello, hi, there)
- "order_pizza" total words: 5 (can, I, order, a, pizza)
- "order_burger" total words: 4 (I'd, like, a, burger)

Step-by-Step Calculation for "greeting"

๐Ÿงฎ Laplace Smoothing Formula
\[P(\text{word} \mid \text{category}) = \frac{\text{count}(\text{word in category}) + 1}{\text{total words in category} + |\text{vocabulary}|}\]
For "greeting" category:
โ€ข total_words_in_category = 4
โ€ข vocabulary_size = 11
โ€ข denominator = 4 + 11 = 15
Why add vocabulary_size to denominator?
Since we add +1 to EVERY word in our vocabulary (11 words), we need to add +11 to the denominator to keep the probabilities valid (they must sum to 1).

Input words:

["hi", "can", "i", "order", "pizza"]

Word-by-word Probabilities with Smoothing:

๐Ÿ” Let's break down "hi" step by step:

Looking at our frequency table:
โ€ข "hi" appears 2 times in "greeting" category
โ€ข With Laplace smoothing: (2 + 1) = 3
โ€ข Denominator: (4 + 11) = 15
\[P(\text{hi} \mid \text{greeting}) = \frac{2 + 1}{4 + 11} = \frac{3}{15} = 0.200\]
Why "hi" = 2 + 1?
โ€ข Original count: "hi" appears 2 times in greeting sentences
โ€ข Laplace smoothing: Add +1 to avoid zero probabilities
โ€ข Final count: 2 + 1 = 3

Multiply all probabilities (assuming independence):

Final Probability Calculation
\[\begin{align} P(\text{words} \mid \text{greeting}) &= P(\text{hi}) \times P(\text{can}) \times P(\text{i}) \times P(\text{order}) \times P(\text{pizza}) \\ &= 0.200 \times 0.067 \times 0.067 \times 0.067 \times 0.067 \\ &= 0.200 \times 0.00002016 \\ &\approx \mathbf{0.00000403} \end{align}\]
โ†’ Very small probability
So even though "hi" is a strong match for greeting, the other words don't appear in any training sentences for greeting, so the overall probability becomes very low (~0.000004).

Step-by-Step Calculation for "order_burger"

For "order_burger" category:
โ€ข total_words_in_category = 4
โ€ข vocabulary_size = 11
โ€ข denominator = 4 + 11 = 15
Final Probability for Order_Burger
\[\begin{align} P(\text{words} \mid \text{order\_burger}) &= 0.067 \times 0.067 \times 0.067 \times 0.067 \times 0.067 \\ &\approx \mathbf{0.00000135} \end{align}\]

Step-by-Step Calculation for "order_pizza"

For "order_pizza" category:
โ€ข total_words_in_category = 5
โ€ข vocabulary_size = 11
โ€ข denominator = 5 + 11 = 16
Final Probability for Order_Pizza
\[\begin{align} P(\text{words} \mid \text{order\_pizza}) &= 0.0625 \times 0.125 \times 0.125 \times 0.125 \times 0.125 \\ &\approx \mathbf{0.00001526} \end{align}\]

โœ๏ธ Now You Try: Find the Probability for order_pizza

Use the same input: "hi can I order pizza"

Hints:

What is the final probability for order_pizza?

Check your answer
It is 0.00001526! This is much higher than the greeting category (0.000004) and the burger category (0.00000135).

Quick Summary: Why Laplace Smoothing?

โœ… With Laplace Smoothing:

โŒ Without Laplace Smoothing:

Real-world analogy: It's like giving everyone a "participation trophy" of +1, so nobody gets completely ignored, even if they never showed up to practice! ๐Ÿ†

Normalizing the Results (The Final Step)

Now that we have the raw scores (likelihoods), we need to turn them into percentages. This process is called normalization.

1. Apply Prior Probabilities

Since we have 3 possible categories (Greeting, Order_Pizza, Order_Burger), the "prior" chance for each is 1/3 (or 0.3333).

โ€ข Greeting: 0.00000395 ร— 0.3333 = 0.0000013167
โ€ข Order_Pizza: 0.00001526 ร— 0.3333 = 0.0000050867
โ€ข Order_Burger: 0.0000000264 ร— 0.3333 = 0.0000000088

Total Sum: 0.0000013167 + 0.0000050867 + 0.0000000088 = 0.0000064122

2. Divide by Total (Normalization)

Divide each value by the total sum to get the final probability percentage:

Final Prediction

โ†’ The model predicts: order_pizza (79.3%) ๐Ÿ•
The model is 79.3% confident that the message "hi can I order pizza" belongs to the order_pizza category based on the word frequencies it learned during training!

Final Percentage Breakdown:
โ€ข ๐Ÿ• Order Pizza: 79.3%
โ€ข ๐Ÿ‘‹ Greeting: 20.5%
โ€ข ๐Ÿ” Order Burger: 0.1%

๐Ÿ”— Learn More AI Concepts

If you enjoyed this, check out our guide on TF-IDF Explained to see another way computers understand text!