FicZon Sales Effectiveness Analysis

Machine learning classification model for predicting sales lead quality. Achieves 81.06% ROC AUC and 84.74% recall with XGBoost classifier on 7,420 B2B IT sales leads.

Overview

Predicts lead quality (High Potential vs Low Potential) from 6 features including geographic location, product ID, lead source, and delivery mode. Addresses class imbalance (38.4% / 61.6%), high-cardinality categoricals (26 sources, 18 locations), and missing data (24.4% in Mobile field).

Final Model: XGBoost Classifier (tuned)

Test: Accuracy = 72.44%, Recall = 84.74%, ROC AUC = 81.06%
Cross-validation: F1-Score = 0.630 +/- 0.020
Features: 8 (Location, Product_ID, Source, Delivery_Mode, Created_Month)
Decision threshold: 0.40 (selected for F1-Score balance)

See Complete_Data_Analysis_Report.md for full methodology and model selection rationale.

Quick Start

Installation

pip install -r requirements.txt
pip install -r requirements-dev.txt  # For development

Load Trained Model

import joblib
model = joblib.load('models/final_xgb_model.pkl')
predictions = model.predict(X_new)  # Requires 8 engineered features
probabilities = model.predict_proba(X_new)[:, 1]  # For lead scoring

Run Full Pipeline

Open notebooks/PRCL-0019 Sales Effectiveness.ipynb for complete data preparation, modeling, and evaluation workflow.

Dataset

Property	Details
Source	FicZon Inc. Sales Database (April-November 2018)
Samples	7,420 leads (7,422 original, 2 duplicates removed)
Features	9 attributes (6 retained + 3 PII removed)
Split	5,936 train / 1,484 test (stratified 80/20)

See Exploratory_Data_Analysis_Report.md for detailed feature analysis.

Project Structure

Core directories:

data/raw/ - Original dataset (project_sales.csv from MySQL)
notebooks/ - Full analysis pipeline (PRCL-0019 Sales Effectiveness.ipynb)
src/ - Reusable modules (utils, statistical_analysis, model_evaluation)
models/ - Trained model artifacts (final_xgb_model.pkl)
reports/ - Analysis reports (Complete, EDA, Business Insights, Business Recommendations)
results/figures/ - Visualization gallery

Working with the Notebook

Import pattern used: The notebook imports functions from src/ modules using:

from src.utils import memory_usage, dataframe_memory_usage, garbage_collection
from src.statistical_analysis import normality_test_with_skew_kurt, spearman_correlation
from src.model_evaluation import evaluate_model, threshold_analysis, cross_validation_analysis_table

Running analysis: The notebook contains the full ML pipeline. Execute cells sequentially for:

Data loading from MySQL and cleaning
Statistical analysis (normality tests, correlation, VIF)
Feature engineering (frequency encoding, temporal features)
Model comparison (XGBoost, CatBoost, LightGBM, ensembles)
Hyperparameter tuning (GridSearchCV)
Threshold analysis and business insights

Model Training Workflow

Base model evaluation:

from src.model_evaluation import evaluate_model

metrics = evaluate_model(model, X_train, y_train, X_test, y_test)
# Returns: 17 metrics including accuracy, precision, recall, F1, ROC AUC, MCC, Cohen's Kappa

Cross-validation analysis:

from src.model_evaluation import cross_validation_analysis_table

cv_results = cross_validation_analysis_table(
    model=model,
    X_train=X_train,
    y_train=y_train,
    cv_folds=5,
    scoring_metric='f1'
)

Threshold analysis:

from src.model_evaluation import threshold_analysis

threshold_df = threshold_analysis(
    model=model,
    X_test=X_test,
    y_test=y_test,
    thresholds=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
)

Statistical Analysis Functions

Normality testing:

from src.statistical_analysis import normality_test_with_skew_kurt

normal_df, not_normal_df = normality_test_with_skew_kurt(df)
# Uses Shapiro-Wilk (n<=5000) or Kolmogorov-Smirnov (n>5000)

Multicollinearity detection:

from src.statistical_analysis import calculate_vif

vif_data, high_vif_features = calculate_vif(
    data,
    exclude_target='Lead_Category',
    multicollinearity_threshold=8.0
)
# Returns VIF scores and features exceeding threshold

Spearman correlation:

from src.statistical_analysis import spearman_correlation

corr_matrix, high_corr_pairs = spearman_correlation(
    data,
    non_normal_cols=['Product_ID', 'Created_Hour', 'Created_Month'],
    exclude_target='Lead_Category',
    multicollinearity_threshold=0.80
)

Model Persistence

Loading the final model:

import joblib
model = joblib.load('models/final_xgb_model.pkl')

# Predict class
predictions = model.predict(X_new)

# Get probability scores for lead prioritization
lead_scores = model.predict_proba(X_new)[:, 1]

# Apply selected threshold (0.40)
high_potential = lead_scores >= 0.40

Key Design Decisions

Model selection criteria (weighted):

Recall (capture High Potential leads) - 40%
ROC AUC (overall discrimination) - 30%
F1-Score (precision-recall balance) - 20%
Cross-validation stability - 10%

Why XGBoost over CatBoost:

XGBoost achieved 84.74% recall vs CatBoost 56.32% (required for business objective)
CatBoost had better accuracy (73.05% vs 72.44%) but significant recall drop after tuning (56.32% vs 84.74%)
XGBoost: ROC AUC 81.06%, CV F1 0.630 +/- 0.020 (stable)
Trade-off: Accept 0.61% lower accuracy for 28.42 points higher recall

Feature importance insights:

Location frequency: 32.27% (Bangalore contributes 44.5% of High Potential leads)
Product_ID: 25.34% (Higher IDs 16-28 correlate with better lead quality)
Delivery_Mode_5: 14.10% (Low conversion 24.7%, negative indicator)
Created_Month: 7.57% (Temporal patterns: Q4 shows 45%+ conversion)

Threshold selection (0.40):

Achieves F1-Score of 67.27%
Achieves 84.74% recall (captures 484 of 571 High Potential leads)
Precision: 55.77% (acceptable false positive rate for lead scoring application)
Standard 0.50 threshold would miss 89 additional high-value leads

Reports

Detailed analysis in reports/:

Complete_Data_Analysis_Report.md - Full methodology, statistical analysis, and model evaluation
Exploratory_Data_Analysis_Report.md - Feature analysis, distributions, and correlations
Sales_Effectiveness_Insights.md - Business insights and strategic recommendations
GALLERY.md - Visualizations

Business Impact Analysis

Current inefficiencies:

61.6% low-potential leads consume equal sales resources
20.7% junk leads (1,536) waste 384 hours annually
Manual categorization creates inconsistent prioritization

Model-driven improvements:

Automated lead scoring provides 4-tier prioritization system
45% reduction in junk lead processing
23% increase in sales team productivity
Estimated annual savings: $142,000 + revenue gains: $380,000

Implementation recommendations:

Deploy model with 0.40 threshold for binary classification
Implement tiered routing: Score ≥0.70 → senior reps, 0.40-0.69 → standard reps
Reallocate marketing budget: Increase Bangalore +50%, eliminate US Website channel
Scale referral programs (89.4% conversion) from 3.79% to 10% of volume

Development

Code Quality

# Format code
black .
isort .

# Lint code
flake8

# Format notebooks
nbqa black notebooks/

# Run pre-commit hooks
pre-commit run --all-files

Pre-commit Hooks

black (88-char lines)
isort (black-compatible, src first-party)
flake8 (max-complexity: 10, ignore: E203, W503, E501)
nbqa-black (notebooks)
Validation (YAML, JSON, trailing whitespace)

Running Jupyter Notebook

jupyter notebook "notebooks/PRCL-0019 Sales Effectiveness.ipynb"

GitHub: https://github.com/dhaneshbb
Project: DataMites PM-PR-0019

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
models		models
notebooks		notebooks
reports		reports
results/figures		results/figures
src		src
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FicZon Sales Effectiveness Analysis

Overview

Quick Start

Installation

Load Trained Model

Run Full Pipeline

Dataset

Project Structure

Working with the Notebook

Model Training Workflow

Statistical Analysis Functions

Model Persistence

Key Design Decisions

Reports

Business Impact Analysis

Development

Code Quality

Pre-commit Hooks

Running Jupyter Notebook

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

dhaneshbb/FicZon-Sales-Effectiveness

Folders and files

Latest commit

History

Repository files navigation

FicZon Sales Effectiveness Analysis

Overview

Quick Start

Installation

Load Trained Model

Run Full Pipeline

Dataset

Project Structure

Working with the Notebook

Model Training Workflow

Statistical Analysis Functions

Model Persistence

Key Design Decisions

Reports

Business Impact Analysis

Development

Code Quality

Pre-commit Hooks

Running Jupyter Notebook

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages