A comprehensive machine learning framework for analyzing cryptocurrency trading bot performance and on-chain metrics using multiple data providers.
ML_driven_on-chain_metrics/
βββ .env # API keys (secure storage)
βββ .gitignore # Protects sensitive data
βββ README.md # This file
βββ requirements.txt # Python dependencies
βββ example_usage.py # Quick start examples
βββ ideas.txt # Project roadmap and ML strategies
βββ notebooks/
β βββ Initial_EDA.ipynb # Main analysis notebook
βββ src/
β βββ __init__.py
β βββ pipeline.py # Automated data collection
β βββ data_providers/ # Multi-provider API clients
β βββ __init__.py
β βββ base.py # Abstract base class
β βββ dune.py # Dune Analytics client
β βββ hyperliquid.py # Hyperliquid DEX client
β βββ factory.py # Provider factory and manager
βββ data/ # Local data storage (auto-created)
βββ raw/ # Raw API data
βββ processed/ # Processed datasets
βββ cache/ # Cached responses
# Clone the repository
git clone <your-repo-url>
cd ML_driven_on-chain_metrics
# Install dependencies
pip install dune-client requests pandas python-dotenv plotly scikit-learn
# Optional: pip install xgboost lightgbm ta-lib schedule
# Set up your API keys
echo "DUNE_API_KEY=your_dune_api_key_here" > .envfrom src.data_providers import setup_providers
# Setup all providers
manager = setup_providers()
print(f"Active providers: {manager.get_active_providers()}")
# Get Dune data
dune = manager.get_provider('dune')
bot_data = dune.get_bot_volume_data()
# Get Hyperliquid data
hyperliquid = manager.get_provider('hyperliquid')
eth_data = hyperliquid.get_market_data('ETH', '1h')python example_usage.py- Purpose: On-chain analytics and custom SQL queries
- Features: Query caching, rate limiting, trading bot metrics
- Auth: Requires
DUNE_API_KEYin.env
Key Methods:
dune.get_query_result(query_id) # Execute any Dune query
dune.get_bot_volume_data() # Your specific bot data
dune.clear_cache() # Clear query cache- Purpose: DEX trading data and market information
- Features: OHLCV data, funding rates, order book, user trading history
- Auth: Public endpoints (no API key required for market data)
Key Methods:
hyperliquid.get_market_data('ETH', '1h') # OHLCV candlestick data
hyperliquid.get_funding_rates('ETH') # Funding rate history
hyperliquid.get_recent_trades('ETH') # Recent trade data
hyperliquid.get_user_fills(user_address) # User trading history- Create provider class inheriting from
BaseDataProvider - Implement required methods:
_get_auth_headers(),get_market_data(),validate_connection() - Register with factory:
DataProviderFactory.register_provider('name', YourProvider)
from src.pipeline import run_data_collection
summary = run_data_collection()from src.pipeline import start_automated_collection
start_automated_collection(interval_minutes=60) # Collect every hour- Automatic data collection from all active providers
- Local storage in Parquet format for efficiency
- Data consolidation and basic preprocessing
- Collection logging and statistics
- Rate limit compliance across all providers
- Feature Engineering: Price movements, volume ratios, technical indicators
- Classification: Profitable vs unprofitable trading periods
- Visualization: Interactive Plotly charts and dashboards
# Core ML
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Advanced ML (optional)
import xgboost as xgb
import lightgbm as lgb
# Technical Analysis
import ta
# Time Series
from statsmodels.tsa.arima.model import ARIMAraw/dune/: Raw blockchain data from Dune Analyticsraw/hyperliquid/: Raw DEX data from Hyperliquidraw/backup/: Critical dataset backups
processed/daily/: Daily aggregated metricsprocessed/hourly/: Hourly features for real-time analysisprocessed/features/: ML-ready feature datasets
cache/: Temporary processing filescache/api_responses/: Cached API calls (1-hour TTL)
models/: Trained ML models and scalersexports/: Clean datasets for sharingmetadata/: Data schemas and quality reports
- Raw:
{source}_{dataset}_{YYYYMMDD_HHMMSS}.parquet - Processed:
{feature_type}_{timeframe}_{YYYYMMDD}.parquet - Models:
{model_type}_{version}_{YYYYMMDD}.pkl
- Volume Anomaly Detection - Identify unusual trading patterns
- Cross-Exchange Arbitrage - Price difference signals
- Funding Rate Momentum - Perpetual futures funding trends
- Whale Movement Detection - Large transaction analysis
- Market Sentiment Analysis - Social + on-chain signals
- Automatic rate limiting with exponential backoff
- Connection health monitoring and automatic retries
- Comprehensive logging for debugging and monitoring
- Efficient storage using Parquet format
- Automatic caching to reduce API calls
- Data versioning with timestamps
- Memory optimization for large datasets
- Environment variables for API keys
- Comprehensive .gitignore prevents key exposure
- No hardcoded credentials in any code files
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Create sophisticated trading dashboards
fig = make_subplots(
rows=3, cols=1,
subplot_titles=['Volume Analysis', 'Price Action', 'Bot Performance'],
shared_xaxes=True
)- Stacked area charts for volume composition
- Candlestick charts for price analysis
- Scatter plots for correlation analysis
- Heatmaps for performance metrics
Use notebooks/Initial_EDA.ipynb for:
- Initial data analysis
- Feature engineering experiments
- Model prototyping
- Visualization development
Move stable code to src/ modules:
- New data providers β
src/data_providers/ - ML models β
src/models/(create as needed) - Utilities β
src/utils/(create as needed)
# Test provider connections
manager = setup_providers()
health = manager.test_all_connections()
# Validate data quality
pipeline = DataPipeline()
stats = pipeline.get_collection_stats()dune-client>=1.2.0
requests>=2.28.0
pandas>=1.5.0
python-dotenv>=0.19.0
plotly>=5.11.0
scikit-learn>=1.1.0
xgboost>=1.6.0 # Gradient boosting
lightgbm>=3.3.0 # Fast gradient boosting
ta>=0.10.0 # Technical analysis
schedule>=1.2.0 # Job scheduling
statsmodels>=0.13.0 # Time series analysis
- Add new data providers following the
BaseDataProviderpattern - Enhance ML models with new features and algorithms
- Improve data pipeline with better processing and storage
- Add comprehensive tests for reliability
- Complete provider implementations for other exchanges (Binance, Coinbase, etc.)
- Advanced ML pipelines with automated model training and evaluation
- Real-time data streaming for live trading signals
- Web dashboard for monitoring and visualization
- Backtesting framework for strategy validation
- Dune Analytics API Documentation
- Hyperliquid API Documentation
- Plotly Python Documentation
- Scikit-learn User Guide
Phase 1: Planning & Data Collection Define key objectives, target coins, and success metrics (e.g., identifying coins surviving pump-and-dumps).
Identify and connect to data sources: On-chain analytics (Dune Analytics, Glassnode) Price data (CoinGecko, CryptoDataDownload) Social sentiment scraping/APIs (X/Twitter, Reddit) Bot trading volume or exchange volume spikes Build initial data ingestion scripts and store raw data locally or in a database.
Phase 2: Feature Engineering & Labeling Preprocess and clean raw data, handle missing values. Develop and compute composite on-chain, price, and sentiment features with appropriate time windows and aggregations. Create labels for supervised learning (e.g., pump-and-dump survival, price movement classification). Explore data visualization for feature understanding and correlation.
Phase 3: Model Development & Validation Train baseline models (logistic regression, random forest) with cross-validation respecting temporal order.
Analyse feature importances and refine feature set. Experiment with advanced models (gradient boosting, LSTM) for sequential pattern recognition. Validate with backtesting on holdout periods, focusing on real-world profitability and false positive rates.
Phase 4: Deployment & Automation Build pipeline to retrain models regularly and update features. Develop alert/dashboard system highlighting buy signals and warnings from model predictions. Automate data collection, prediction generation, and notifications (e.g., email, message). Monitor model performance and drift, set up logging and error handling