Naive Bayes Classifier for Spam-Ham Analysis

Naive Bayes Email Classification

Summary: The Naive Bayes Email Classification project aims to classify emails as either legitimate (ham) or spam using the Multinomial Naive Bayes algorithm. The project involves preprocessing the email data, including cleaning and vectorizing the text, training different Naive Bayes classifiers, evaluating their performance, and visualizing the results using confusion matrices.

Description: This project focuses on email classification, a common task in email filtering systems where emails are automatically categorized as legitimate or spam. The dataset used contains both legitimate and spam emails, with features such as sender, recipient, subject, date, and content. After loading and preprocessing the data, including cleaning and tokenizing the text, the emails are vectorized using the Bag of Words (BoW) representation. Three different Naive Bayes classifiers (BernoulliNB, ComplementNB, and MultinomialNB) are trained on the vectorized data to build classification models.

Key Features:

Data Loading and Preprocessing: The project begins by loading the email data and cleaning it to remove unnecessary information such as headers and metadata. The text content of the emails is tokenized and processed to remove stop words and non-English words.
Vectorization: The cleaned email text is then vectorized using the CountVectorizer from scikit-learn, which converts the text into numerical features using the Bag of Words approach.
Model Training and Evaluation: Three different Naive Bayes classifiers (BernoulliNB, ComplementNB, and MultinomialNB) are trained on the vectorized email data. The models’ performance is evaluated using accuracy score metrics on test data.
Visualization of Results: Confusion matrices are generated and visualized for each model to provide insights into their performance in terms of true positive, true negative, false positive, and false negative predictions.

Results:

Bernoulli Naive Bayes: Achieves an accuracy of approximately 92.14% on the test data.
Complement Naive Bayes: Achieves an accuracy of approximately 89.40% on the test data.
Multinomial Naive Bayes: Achieves an accuracy of approximately 91.31% on the test data.

Conclusion: The Naive Bayes Email Classification project demonstrates the effectiveness of the Multinomial Naive Bayes algorithm in classifying emails as legitimate or spam. While all three Naive Bayes classifiers achieve relatively high accuracies, the Bernoulli Naive Bayes model performs slightly better than the others. The project highlights the importance of preprocessing techniques and model selection in email classification tasks. Further enhancements could involve experimenting with different vectorization methods, exploring ensemble methods, and fine-tuning hyperparameters to improve classification performance further. Overall, this project provides valuable insights into email classification techniques and offers opportunities for future research and development in email filtering systems.