Data Scientist & Machine Learning Engineer
I design data-centric, explainable, and interactive AI systems.
All Repositories • LinkedIn • Email
- Data Scientist & ML Engineer with a Physics background and hands-on experience across:
- Time series & forecasting
- NLP and text classification
- Computer vision & recommendation systems
- Algorithmic trading & risk modeling
- Comfortable with the full ML lifecycle:
- Problem framing → data generation/collection → feature engineering → modeling → evaluation → deployment (Streamlit/FastAPI dashboards).
- Strong focus on:
- Synthetic data & robustness
- Interactive dashboards & decision-support tools
- LLMs, RAG & hybrid AI architectures
- Explainability & human-centered AI
- I enjoy writing deep, structured explanations, many repos are part code, part essay.
Languages & Tools
- Languages: Python, SQL, MQL5, a bit of Solidity
- ML / DL: PyTorch, TensorFlow/Keras, scikit-learn, XGBoost, LightGBM
- Time Series & Forecasting: ARIMA/SARIMA, Prophet, LSTMs, rolling windows
- NLP: BERT, TF-IDF, classic ML (LogReg, Naive Bayes, SVM)
- Apps & Services: Streamlit, FastAPI, Plotly, SQLite/SQLAlchemy
- MLOps-ish / Analysis: SHAP, feature importance, evaluation frameworks, synthetic data benchmarks
Domains & Topics
- Synthetic & tabular data quality
- Forecasting & scenario simulation
- Recommender systems
- Smart contract risk analytics
- Interactive data storytelling
- LLMs & Retrieval-Augmented Generation (RAG)
- Algorithmic trading & financial modeling
I organize my projects into a few clusters so you can jump directly to what interests you:
- Synthetic Data, Data Realism & Anomaly Detection
- Forecasting, Dashboards & Interactive ML
- LLMs, RAG & Hybrid AI Systems
- Core ML, Deep Learning & Portfolio Projects
- Explainability, Thinking Like a Data Scientist & AI Philosophy
- Smart Contracts, Security & Risk Analytics
Generating realistic tables, probing data “authenticity”, and stress-testing models.
These projects focus on fidelity, coverage, privacy, and utility of synthetic data, plus anomaly detection in tabular domains.
-
Autocurator-Synthetic-Data-Benchmark
A benchmarking toolkit for synthetic tabular data generators:- Compares different models (GANs, VAEs, copulas, etc.)
- Evaluates distribution fidelity, feature coverage, privacy leakage, and downstream ML utility
- Produces visual reports (PCA, correlations, histograms) to understand where generators succeed or fail
Goal: make it easier to choose the right synthetic data approach for a business use case.
-
Synthetic-Data-Artist
Deep dive into Gaussian Copula vs VAE for tabular data:- Side-by-side comparison of marginal and joint distributions
- PCA visualizations of real vs synthetic embeddings
- Correlation matrix similarity, pair plots, and coverage analysis
Think of it as a “microscope” for synthetic tabular data.
-
Anomaly-Detection
End-to-end anomaly detection on synthetic transactions/sales:- Data generation with realistic “weird” patterns injected
- Uses Isolation Forest, Local Outlier Factor, and classic statistical methods
- Visual diagnostics and confusion-matrix-style evaluations
Useful for building intuition about anomalies in finance/ops data.
-
Market-Basket-Analysis
Retail-style synthetic purchase data:- Apriori & FP-Growth frequent itemset mining
- Association rules with support, confidence, and lift
- Exportable rules + quick visual summaries
Foundation for recommendation, cross-sell, and promo design.
-
Sales-Data-Analysis
Lightweight but complete:- Synthetic sales dataset generation
- Cleaning, aggregation, KPI dashboards
- Time-based trend analysis and segmentation
Great for explaining analytics pipelines to non-technical stakeholders.
-
Missing-Data-Doctor
Toolkit for missingness profiling & imputation:- Visual missingness maps and patterns (by column, row, time)
- Simple and advanced imputation strategies
- Before/after comparisons for ML performance
Focus: understanding how missing data distorts models.
-
Noise-Injection-Techniques
Experiments on robustness via controlled noise:- Add noise to tabular features/labels during training
- Explore how different noise types affect generalization
- PyTorch-based training loops and results visualization
Bridge between data augmentation and robustness in non-vision domains.
Treat forecasting and analytics as interactive tools, not static reports.
These projects focus on Streamlit dashboards, scenario analysis, and business-friendly UIs.
-
Forecast-Factory
Forecasting & simulation app:- Streamlit UI to upload time series (sales, traffic, revenue)
- Uses Prophet (and/or other models) for forecasting with confidence intervals
- Lets users run “what if we change X?” simulations on key drivers
Designed for business teams to explore future scenarios without touching code.
-
Market-IQ
BI-style web app:- Ingests transactional/sales-like data
- Computes core KPIs (revenue, retention, AOV, etc.)
- Time-series charts, comparisons, and exportable reports
Acts like a focused analytics tool for small/medium businesses.
-
Data-Storytelling-Dashboard
End-to-end narrative dashboard:- E-commerce style dataset with customers, orders, and products
- KPIs, cohort analysis, and retention curves
- Visuals + narrative “takeaways” to interpret the charts
Focuses on storytelling, not just plotting.
-
Beyond-Charts-Interactive-Storytelling
Code + essay:- RFM segmentation, cohort tracking, user lifecycle
- Interactive views that adapt to user selections
- Conceptual guide on how to build narrative dashboards
For people who want to turn dashboards into decision tools.
-
AI-Report-Factory
Automated reporting:- Input: structured data + configuration
- Output: KPIs, visualizations, and narrative sections in Markdown/HTML
- Uses templating to make the reporting repeatable
Ideal for recurring reports that still need a “human-readable” style.
-
AI-Personal-Study-Tracker
Productivity & study analytics:- Streamlit interface for logging study sessions
- SQLite backend for persistence
- ML model (RandomForestRegressor) to predict productivity and surface patterns
Example of sending ML back to the user as personal feedback.
-
Demand-Forecasting
Classic time-series pipeline:- Synthetic demand & seasonality
- ARIMA/SARIMA modeling workflow
- Forecast evaluation and plots
Template for demand planning and inventory decisions.
-
ML-Playground-Autodetect
Auto ML playground:- Streamlit UI where you upload a dataset
- Automatically detects classification vs regression
- Builds sensible ML pipelines + evaluation
Useful for teaching and quick sanity checks.
Building explainable, data-grounded LLM systems with retrieval & graphs.
-
Graph-RAG-Engine
A more structured take on RAG:- Vector search (FAISS) for semantic retrieval
- Knowledge graph to add structure and relationships
- FastAPI backend and optional Streamlit front-end
- Emphasis on traceability and explaining why an answer was given
Great for recommendation, research assistants, or domain-specific QA.
-
Designing-Hybrid-AI-Systems
Conceptual + practical:- How to combine vector search, knowledge graphs, and LLMs
- Design patterns for hybrid intelligence
- Notes on failure modes and interpretability
A “systems thinking” view for building LLM-powered apps.
-
RAG-vs-Fine-Tuning
Decision framework:- When to use RAG, when to fine-tune, when to do both
- Cost, latency, maintenance, and data constraints
- Includes examples and architectural diagrams (where applicable)
Helpful for teams deciding how to productionize LLMs.
Classic ML projects done with clean structure and clear evaluation.
-
Stock-LSTM-Forecasting
Time-series forecasting with LSTMs:- Data preparation with sliding windows
- PyTorch LSTM architecture
- Loss curves + forecast vs actual plots
Use case: financial time series, sensor data, or demand.
-
Image-Captioning-CNN-LSTM
Computer vision + language:- Pretrained ResNet as image encoder
- LSTM decoder generating captions word by word
- BLEU score and qualitative examples
Classic example of multimodal ML.
-
Sentiment-Analysis-BERT
Transformer-based text classification:- Fine-tuning BERT on sentiment data (tweets)
- Training/evaluation pipeline
- Confusion matrix, ROC curves, and example predictions
Template for other classification tasks with BERT-style models.
-
Sentiment-Analysis-NLP
Classical ML for text:- Tokenization, stopword removal, lemmatization
- TF-IDF vectorization
- Models: Logistic Regression, Naive Bayes, Random Forest
Shows how far you can go with “non-deep” NLP.
-
Movie-Recommendation-System
Hybrid recommender:- Content-based filtering (TF-IDF + cosine similarity)
- Collaborative filtering via matrix factorization
- Evaluation with ranking metrics and examples
Useful base for product/content recommendations.
-
LSTM-Time-Series-Forecasting
Generic LSTM-forecast template:- Works on many univariate series
- Clear code structure and visualizations
Good starting point for experimenting with sequence models.
-
Handwritten-Digit-GAN
Generative modeling:- DCGAN on MNIST
- Training loop, generated samples, and latent space interpolation
Intro to generative models in vision.
Code + essays about how we think about and interpret AI systems.
-
Shap-Mini
Minimal SHAP explainability demo:- Tabular ML model (tree-based)
- Global and local SHAP plots
- Good for teaching “why did the model decide this?”
Brings explainability down to a small, digestible example.
-
Think-Like-a-Data-Scientist
Long-form essay:- Framing questions, hypotheses, and experiments
- How to move from raw data → insight → action
- Balancing rigor with storytelling
More about mindset than code.
-
Forecasting-The-Future-of-Forecasting
Strategic perspective:- How forecasting tools shape decisions
- Interactive foresight vs static point estimates
- Reflections on feedback loops and reflexive systems
-
The-Future-of-Interactive-ML
Why interactivity matters:- Benefits of human-in-the-loop ML
- Examples with Streamlit-style interfaces
- How UI/UX changes modeling choices
-
Algorithmic-Empath-Human-Fallibility
Ethics & “algorithmic empathy”:- Modeling human mistakes, uncertainty, and disagreement
- Thinking beyond accuracy: fairness, robustness, trust
Explores what it means for algorithms to “understand” humans.
-
Measuring-The-Soul-of-Data
Philosophy of data realism:- What makes data feel “alive” or “authentic”
- Exploring relationships, diversity, and subtle patterns
- Especially in the context of synthetic vs organic data
-
Quiet-Machines-Minimalist-AI
Minimalist AI:- Preference for quiet, non-intrusive, human-respecting AI systems
- Thoughts on attention, overload, and calm technology
Applying ML & static analysis ideas to smart contracts and risk.
-
Python-Solidity-Feature-Engineering
Feature extraction for Solidity contracts:- Parse Solidity code using Python tooling
- Extract structural & semantic features (complexity, patterns)
- Basis for risk models, security classification, or audit support
-
Smart-Contract-Risk-Analyzer
Static analysis plus heuristics:- Identify risky patterns in smart contracts
- Compute risk scores or categories
- Designed as a step toward automated security triage
If you’re just browsing and want a quick sense of my work, start here:
- Graph-RAG-Engine
- Forecast-Factory
- Synthetic-Data-Artist
- Data-Storytelling-Dashboard
- Sentiment-Analysis-BERT
- Think-Like-a-Data-Scientist
“AI is not just about models; it’s about systems that solve real problems for real people.”






