Email remains one of the most widely used communication channels in the digital era. However, misuse in the form of spam—unwanted, often fraudulent or aggressively promotional messages—poses a significant threat. According to Statista (2024), over 45% of global email traffic consists of spam messages. Spam can lead to distractions, financial losses, and security risks such as phishing and malware attacks (Statista, 2024). Hence, an automated system is necessary to detect and filter spam efficiently.
How can we build a machine learning model that accurately classifies email messages as spam or ham (non-spam)?
To develop a robust classification model capable of separating spam emails from ham emails with high accuracy and strong generalization performance.
- Develop a baseline classification model using Logistic Regression.
- Implement a more complex and potentially higher-performing model using Random Forest.
- Create a prediction function to perform inference on new, unseen email text.
- Compare the performance of both models to determine the optimal solution.
(Dataset) The dataset used for this project includes two main columns:
text: The content of the email.label: The category of the email, eitherspamorham.
- Label Encoding: Converted textual labels ('spam', 'ham') into numerical values (1 for spam, 0 for ham).
- Missing Value Check: Ensured the dataset contained no null values.
- Text Length Analysis: Added a
lengthcolumn to analyze the distribution of email lengths. - WordCloud & Histogram Visualization: Used to explore keyword frequency and message structure for each class.
- Downsampling: Since the dataset was imbalanced (ham outnumbered spam), downsampling was performed on the majority class to create a balanced dataset for fair model training.
- ✅ Pros: Simple, fast to train, easy to interpret.
- ❌ Cons: May underperform on complex or non-linear feature relationships.
- 🔧 Key Parameters:
solver=lbfgs,random_state=42(default settings).
- ✅ Pros: High accuracy, robust to overfitting, handles non-linear relationships well.
- ❌ Cons: Slower training time, less interpretable.
- 🔧 Key Parameters:
n_estimators=100,random_state=42.
| Model | Accuracy |
|---|---|
| Random Forest | 95% |
| Logistic Regression | 94% |
Random Forest was selected as the best-performing model. Despite slightly longer training times, it achieved the highest accuracy and demonstrated stable performance in identifying spam emails effectively.
-
Accuracy
[ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} ]
Measures the overall correctness of the model. -
Confusion Matrix
Provides a breakdown of correct and incorrect predictions per class (Spam = 1, Ham = 0). -
Classification Report
Includes:- Precision: Proportion of positive identifications that were actually correct.
- Recall: Proportion of actual positives that were correctly identified.
- F1-Score: Harmonic mean of precision and recall.
These metrics ensure that the model not only achieves high accuracy but is also reliable in detecting spam with minimal false positives.