Skip to content

AlaaH0ssam/Movie-Popularity-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 

Repository files navigation

🎬 Movie Popularity Prediction

Pattern Recognition Project 2026 | Team CS_1

A machine learning project focused on predicting movie popularity using both Regression and Multiclass Classification techniques.
The project explores how movie metadata such as genres, production companies, release dates, vote counts, and financial information influence audience popularity.

Built as part of the Pattern Recognition / Machine Learning Projects 2026 coursework.


πŸ“Œ Project Overview

This project is divided into two milestones:

πŸ“ˆ Milestone 1 β†’ Regression

Predict the continuous numerical value of movie popularity.

🎯 Milestone 2 β†’ Classification

Classify movies into one of four popularity categories:

  • Very Low
  • Low
  • Medium
  • High

🧠 Main Objectives

  • Build a complete ML pipeline from raw data to prediction
  • Apply preprocessing and feature engineering techniques
  • Experiment with multiple regression and classification models
  • Perform feature selection and hyperparameter tuning
  • Evaluate models using proper metrics and visualizations
  • Save trained models and preprocessing steps for unseen test datasets

πŸ“‚ Dataset Features

The dataset contains movie metadata such as:

  • Genres
  • Production Companies
  • Production Countries
  • Spoken Languages
  • Budget
  • Revenue
  • Vote Count
  • Release Date
  • Overview
  • Posters / Backdrops / Homepage
  • Adult Flag
  • Runtime-related metadata
  • Titles & Original Titles

βš™οΈ Preprocessing Pipeline

The preprocessing stage was designed to transform noisy real-world movie metadata into a clean machine-learning-ready dataset.

πŸ”Ή Data Cleaning

  • Removed duplicate rows
  • Removed rows with missing target values

πŸ”Ή Feature Engineering

Created additional informative features such as:

  • is_title_changed
  • title_length
  • overview_length
  • has_backdrop
  • has_homepage
  • has_poster
  • has_tagline

πŸ”Ή Temporal Feature Extraction

Extracted:

  • release_day
  • release_month
  • release_year

from the original release date.

πŸ”Ή Multi-label Handling

Processed:

  • genres
  • production companies
  • countries
  • languages

by extracting:

  • first category
  • number of categories

πŸ”Ή Missing Values Handling

  • Numerical β†’ Mean Imputation
  • Categorical β†’ Mode Imputation

πŸ”Ή Skewness Correction

Applied log1p() transformation on:

  • budget
  • revenue
  • vote_count

πŸ”Ή Frequency Encoding

Encoded high-cardinality categorical features using frequency encoding.

πŸ”Ή Outlier Treatment

Applied Winsorization using:

  • 1st percentile
  • 99th percentile

πŸ”Ή Feature Scaling

Applied:

  • Z-score Standardization

πŸ”Ή Feature Selection

Used:

  • SelectKBest(f_regression) for Regression
  • SelectKBest(f_classif) for Classification

πŸ“Š Milestone 1 β†’ Regression Models

Models Used

🌲 Random Forest Regressor

Ensemble-based model capable of handling nonlinear relationships and reducing overfitting.

🌳 Decision Tree Regressor

Tree-based regression model with interpretable decision paths.


πŸ“ˆ Regression Results

Model MAE RMSE RΒ² Score
Random Forest 0.5200 6.4667 0.4539
Decision Tree 0.5267 7.3917 0.2865

βœ… Best Regression Model: Random Forest Regressor


🎯 Milestone 2 β†’ Classification Models

Models Used

  • Logistic Regression
  • Linear SVC
  • Decision Tree Classifier
  • Tuned Decision Tree
  • Random Forest Classifier
  • Tuned Random Forest

πŸ›  Hyperparameter Tuning

Applied:

  • GridSearchCV
  • RandomizedSearchCV
  • Cross Validation

Optimized parameters such as:

  • max_depth
  • min_samples_leaf
  • n_estimators

Evaluation metric:

  • f1_weighted

πŸ† Best Classification Model

βœ… Random Forest Classifier
Accuracy: 82.59%


πŸ“‰ Visualizations

The project includes several visual analysis plots:

  • Distribution of Popularity
  • Correlation Heatmaps
  • Actual vs Predicted
  • Residual Plots
  • Error Distribution
  • Feature Importance
  • Confusion Matrices
  • Accuracy Comparison
  • Training Time Comparison
  • Testing Time Comparison

πŸ§ͺ Model Evaluation Metrics

Regression

  • MAE
  • RMSE
  • RΒ² Score

Classification

  • Accuracy
  • Classification Report
  • Confusion Matrix
  • F1 Weighted Score

πŸ’Ύ Model Persistence

The project saves:

  • trained models
  • preprocessing objects
  • feature selection steps
  • scaling parameters

using:

pickle

Saved files include:

  • best_classification_model.pkl
  • stepsForPreprocessing.pkl
  • ms1_regression_data.pkl

πŸ“₯ Dataset

The dataset is too large to be uploaded directly to GitHub.

Download the dataset from the following link:

πŸ“ Milestone 1 Dataset

πŸ”— Download MS1 Dataset

πŸ“ Milestone 2 Dataset

πŸ”— Download MS2 Dataset


πŸš€ Running The Project

1️⃣ Install Dependencies

pip install pandas numpy scikit-learn matplotlib seaborn

2️⃣ Run Milestone 1 (Regression)

python ms1.py

3️⃣ Run Milestone 2 (Classification)

python ms2.py

4️⃣ Run Test Script On Unseen Data

python testscript.py

The script automatically:

  • loads preprocessing steps
  • loads saved models
  • preprocesses unseen test data
  • predicts outputs
  • prints evaluation metrics

πŸ“ Suggested Project Structure

Movie-Popularity-Prediction/
β”‚
β”œβ”€β”€ train_data.csv
β”œβ”€β”€ unseenTestSample.csv
β”‚
β”œβ”€β”€ ms1.py
β”œβ”€β”€ ms2.py
β”œβ”€β”€ testscript.py
β”‚
β”œβ”€β”€ best_classification_model.pkl
β”œβ”€β”€ stepsForPreprocessing.pkl
β”œβ”€β”€ ms1_regression_data.pkl
β”‚
β”œβ”€β”€ README.md
└── report.pdf

πŸ” Key Insights

🧠 Initial assumptions suggested that movie budget alone determines popularity.

πŸ“Š The analysis revealed that:

  • vote count
  • release year
  • audience engagement
  • metadata categories

have stronger predictive power than financial information alone.

The project demonstrated that movie popularity is a complex nonlinear problem best handled using ensemble learning techniques.


🧰 Technologies Used

  • Python
  • Pandas
  • NumPy
  • Scikit-learn
  • Matplotlib
  • Seaborn
  • Pickle

πŸ‘₯ Team Information

Team ID

CS_1

Course

Pattern Recognition / Machine Learning Projects 2026


🌟 Final Conclusion

This project successfully built an end-to-end machine learning pipeline capable of handling real-world movie metadata for both regression and classification tasks.

The final tuned ensemble models achieved strong generalization performance while maintaining efficient preprocessing and scalable deployment through saved preprocessing pipelines and serialized models.

The dataset turned out to be less about β€œbig budget = big popularity” and more like a cinematic ecosystem where timing, audience engagement, and metadata quietly pull strings behind the curtain πŸŽ₯✨

About

Predicting movie popularity with machine learning using preprocessing, feature engineering, hyperparameter tuning, and ensemble models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages