A machine learning project focused on predicting movie popularity using both Regression and Multiclass Classification techniques.
The project explores how movie metadata such as genres, production companies, release dates, vote counts, and financial information influence audience popularity.
Built as part of the Pattern Recognition / Machine Learning Projects 2026 coursework.
This project is divided into two milestones:
Predict the continuous numerical value of movie popularity.
Classify movies into one of four popularity categories:
- Very Low
- Low
- Medium
- High
- Build a complete ML pipeline from raw data to prediction
- Apply preprocessing and feature engineering techniques
- Experiment with multiple regression and classification models
- Perform feature selection and hyperparameter tuning
- Evaluate models using proper metrics and visualizations
- Save trained models and preprocessing steps for unseen test datasets
The dataset contains movie metadata such as:
- Genres
- Production Companies
- Production Countries
- Spoken Languages
- Budget
- Revenue
- Vote Count
- Release Date
- Overview
- Posters / Backdrops / Homepage
- Adult Flag
- Runtime-related metadata
- Titles & Original Titles
The preprocessing stage was designed to transform noisy real-world movie metadata into a clean machine-learning-ready dataset.
- Removed duplicate rows
- Removed rows with missing target values
Created additional informative features such as:
is_title_changedtitle_lengthoverview_lengthhas_backdrophas_homepagehas_posterhas_tagline
Extracted:
release_dayrelease_monthrelease_year
from the original release date.
Processed:
- genres
- production companies
- countries
- languages
by extracting:
- first category
- number of categories
- Numerical β Mean Imputation
- Categorical β Mode Imputation
Applied log1p() transformation on:
- budget
- revenue
- vote_count
Encoded high-cardinality categorical features using frequency encoding.
Applied Winsorization using:
- 1st percentile
- 99th percentile
Applied:
- Z-score Standardization
Used:
SelectKBest(f_regression)for RegressionSelectKBest(f_classif)for Classification
Ensemble-based model capable of handling nonlinear relationships and reducing overfitting.
Tree-based regression model with interpretable decision paths.
| Model | MAE | RMSE | RΒ² Score |
|---|---|---|---|
| Random Forest | 0.5200 | 6.4667 | 0.4539 |
| Decision Tree | 0.5267 | 7.3917 | 0.2865 |
β Best Regression Model: Random Forest Regressor
- Logistic Regression
- Linear SVC
- Decision Tree Classifier
- Tuned Decision Tree
- Random Forest Classifier
- Tuned Random Forest
Applied:
GridSearchCVRandomizedSearchCV- Cross Validation
Optimized parameters such as:
max_depthmin_samples_leafn_estimators
Evaluation metric:
f1_weighted
β
Random Forest Classifier
Accuracy: 82.59%
The project includes several visual analysis plots:
- Distribution of Popularity
- Correlation Heatmaps
- Actual vs Predicted
- Residual Plots
- Error Distribution
- Feature Importance
- Confusion Matrices
- Accuracy Comparison
- Training Time Comparison
- Testing Time Comparison
- MAE
- RMSE
- RΒ² Score
- Accuracy
- Classification Report
- Confusion Matrix
- F1 Weighted Score
The project saves:
- trained models
- preprocessing objects
- feature selection steps
- scaling parameters
using:
pickleSaved files include:
best_classification_model.pklstepsForPreprocessing.pklms1_regression_data.pkl
The dataset is too large to be uploaded directly to GitHub.
Download the dataset from the following link:
π Download MS1 Dataset
π Download MS2 Dataset
pip install pandas numpy scikit-learn matplotlib seabornpython ms1.pypython ms2.pypython testscript.pyThe script automatically:
- loads preprocessing steps
- loads saved models
- preprocesses unseen test data
- predicts outputs
- prints evaluation metrics
Movie-Popularity-Prediction/
β
βββ train_data.csv
βββ unseenTestSample.csv
β
βββ ms1.py
βββ ms2.py
βββ testscript.py
β
βββ best_classification_model.pkl
βββ stepsForPreprocessing.pkl
βββ ms1_regression_data.pkl
β
βββ README.md
βββ report.pdfπ§ Initial assumptions suggested that movie budget alone determines popularity.
π The analysis revealed that:
- vote count
- release year
- audience engagement
- metadata categories
have stronger predictive power than financial information alone.
The project demonstrated that movie popularity is a complex nonlinear problem best handled using ensemble learning techniques.
- Python
- Pandas
- NumPy
- Scikit-learn
- Matplotlib
- Seaborn
- Pickle
CS_1
Pattern Recognition / Machine Learning Projects 2026
This project successfully built an end-to-end machine learning pipeline capable of handling real-world movie metadata for both regression and classification tasks.
The final tuned ensemble models achieved strong generalization performance while maintaining efficient preprocessing and scalable deployment through saved preprocessing pipelines and serialized models.
The dataset turned out to be less about βbig budget = big popularityβ and more like a cinematic ecosystem where timing, audience engagement, and metadata quietly pull strings behind the curtain π₯β¨