Naive Bayes: Text Classification Example

Naive Bayes: Text Classification Example

Text Classification with Naive Bayes and Laplace Smoothing

Introduction to Text Classification

  • The video introduces a text classification example using the Naive Bayes algorithm, focusing on Laplace smoothing. It aims to compare categorical data with text classification.

Dataset Overview

  • The dataset consists of Chinese and Japanese documents, where words serve as features. In contrast to typical categorical data (e.g., Outlook, humidity), here the actual words are the features.

Class Definitions

  • Two classes are defined: 'C' for Chinese documents and 'J' for Japanese documents. The goal is to classify a test example based on these classes.

Probability Calculations

  • To determine the class of a test document containing both Chinese and Japanese words, initial probabilities P(C) and P(J) are calculated based on document counts.

Prior Probabilities

  • The prior probability P(C) is calculated as 3/4 (three Chinese documents out of four total), while P(J) is 1/4 (one Japanese document).

Feature Probability Calculation for Class C

  • For calculating P(textChinese | C) :
  • Count occurrences of "Chinese" in class C: five instances.
  • Apply Laplace smoothing by adding one to the count.

Total Words in Class C

  • There are eight total words across all three Chinese documents. This value is used in conjunction with the cardinality of unique words across all documents.

Feature Probability Calculation for Class J

  • For calculating P(textChinese | J):
  • "Chinese" occurs once in the single available Japanese document.
  • Total word count across all Japanese documents is three.

Finalizing Probabilities

  • After applying Laplace smoothing:
  • Calculate probabilities such as P(textTokyo | C). Since "Tokyo" does not occur in any Chinese document, it requires smoothing to avoid zero probability issues.

Importance of Laplace Smoothing

  • Laplace smoothing ensures that even unseen features yield non-zero probabilities, allowing meaningful comparisons between different feature probabilities without skewing results due to zero values.

Conclusion Steps

  • After calculating necessary probabilities for each word in the test dataset, including combinations like P(textTokyo | J), these values will be multiplied together along with prior probabilities to arrive at a final classification decision.

Understanding Probabilities in Document Classification

Calculating Probabilities for Document D5

  • The discussion begins with the calculation of probabilities related to document classification, specifically focusing on P of C given D and various conditional probabilities such as P of Chinese given C.
  • For the first word, the probability P of Chinese given C is calculated as 3/7. This value is raised to the power of 3 for three occurrences, while both P of Tokyo given C and P of Japan given C are each set at 1/14.
  • The process continues by calculating P of J given D5, incorporating similar conditional probabilities like P of Chinese given J and others for subsequent words in the document.
  • The values used include 2/9 for multiple occurrences, which are substituted into the overall formula. The final prediction indicates that the probability for Chinese given D5 is higher than other classifications.

Summary Insights

  • The calculations involve a systematic approach to determining how often certain terms appear within specific contexts (C or J).
  • Each step builds upon previous calculations, emphasizing the importance of understanding conditional probabilities in document classification tasks.
  • Final predictions rely heavily on these computed values, showcasing their significance in determining outcomes based on input data.
Video description

In this video, I explain the workings of the naive bayes algorithm using a text classification example. This channel is part of CSEdu4All, an educational initiative that aims to make computer science education accessible to all! We believe that everyone has the right to good education, and geographical and political boundaries should not be a barrier to obtaining knowledge and information. We hope that you will join and support us in this endeavor! ---------Help us spread computer science knowledge to everyone around the world! Please support the channel and CSEdu4All by hitting "LIKE" and the "SUBSCRIBE" button. Your support encourages us to create more accessible computer science educational content. Patreon: https://www.patreon.com/csedu4all GoFundMe: https://www.gofundme.com/f/csedu4all --------- Find more interesting courses and videos in our website Website: https://csedu4all.org/ --------- Find and Connect with us on Social Media: Facebook: https://www.facebook.com/csedu4all LinkedIn: https://www.linkedin.com/in/arti-ramesh01/