**Case Study 2: Project LANTERN **
Automating quarterly earnings report collection and analysis for Dow Jones 30 companies.
FinFlow is an automated data pipeline that streamlines the collection of quarterly earnings reports from Dow 30 companies. The system programmatically discovers investor relations pages, identifies the latest earnings reports, downloads and parses them, including metadata, and stores the results in cloud storage.
- Automated IR Page Discovery: Programmatically finds investor relations pages for all Dow 30 companies
- Smart Report Detection: Identifies the latest quarterly earnings reports using keywords
- Multi-format Parsing: Extracts text, tables, and charts from PDFs, HTML, and other formats
- Cloud Storage Integration: Stores raw and parsed data in Google Cloud Storage
- Airflow Orchestration: Manages the entire workflow with Apache Airflow for reliability and scheduling
βββββββββββββββββββ
β Dow 30 List β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ ββββββββββββββββββββ
β Find IR Pages βββββββΆβ IR Page URLs β
ββββββββββ¬βββββββββ ββββββββββββββββββββ
β
βΌ
βββββββββββββββββββ ββββββββββββββββββββ
β Find Reports βββββββΆβ Report URLs β
ββββββββββ¬βββββββββ ββββββββββββββββββββ
β
βΌ
βββββββββββββββββββ ββββββββββββββββββββ
β Download & ParseβββββββΆβ Parsed Data β
ββββββββββ¬βββββββββ ββββββββββββββββββββ
β
βΌ
βββββββββββββββββββ ββββββββββββββββββββ
βExtract Metadata βββββββΆβ Structured Data β
ββββββββββ¬βββββββββ ββββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Upload to GCS β
βββββββββββββββββββ
- Docker Desktop
- Docker Compose
- Python 3.12+
- Google Cloud Platform account (for GCS)
- Git
- Clone the repository
git clone https://github.com/yourusername/finflow.git
cd finflow- Build and start services
docker-compose build
docker-compose up -d- Access Airflow UI
- Open browser: http://localhost:8080
- Username:
your Username - Password:
your Password
- Navigate to http://localhost:8080
- Find the
dow30_pipelineDAG - Toggle it to "Active" (if paused)
- Click the "Play" button to trigger a manual run
finflow/
βββ dags/ # Airflow DAG definitions
β βββ dow30_pipeline.py # Main pipeline orchestration
βββ src/ # Core pipeline logic
β βββ find_ir_pages.py # IR page discovery
β βββ find_latest_reports.py # Report identification
β βββ upload_to_cloud.py # Cloud storage upload
β βββ parsers/ # Data parsing modules
β β βββ docling_or_fallback_parser.py # β Main parsing logic with Docling
β βββ representations/ # Data structure and metadata
β βββ metadata_builder.py # β Metadata extraction and construction
β βββ metadata_storage_formats.py # β Metadata schemas and formats
βββ config/ # Configuration files
β βββ dow30_companies.json # Dow 30 reference list
βββ data/ # Local data storage
β βββ raw/ # Raw downloaded files
β βββ parsed/ # Parsed outputs with metadata
βββ logs/ # Airflow logs
βββ docker-compose.yml # Docker services configuration
βββ .airflow # Airflow schedule
βββ .env # Environment variables
βββ requirements.txt # Python dependencies
βββ README.md # This file
# Database
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres/airflow
# Celery
AIRFLOW__CELERY__RESULT_BACKEND=db+postgresql://airflow:airflow@postgres/airflow
AIRFLOW__CELERY__BROKER_URL=redis://:@redis:6379/0
# Admin credentials
_AIRFLOW_WWW_USER_USERNAME=admin
_AIRFLOW_WWW_USER_PASSWORD=admin
- Create a GCS bucket:
gsutil mb gs://your-finflow-bucket- Create a service account with Storage Admin role
- Download the JSON key and save as
gcp-key.json
apache-airflow==2.10.2- Workflow orchestrationrequests- HTTP requestsbeautifulsoup4- HTML parsingpandas- Data manipulationpython-dotenv- Environment managementgoogle-cloud-storage- GCS integration
- PostgreSQL 15 - Airflow metadata database
- Redis 7 - Celery message broker
- Apache Airflow 2.10.2 - Workflow orchestration
The core parsing module that handles document processing:
-
Primary Method: Uses Docling for advanced document understanding
-
Output: Structured data with confidence scores and parsing metadata
Programmatically discovers investor relations pages by:
- Analyzing company websites
- Looking for common IR page patterns
- Using keywords
Identifies the most recent quarterly earnings by:
- Scanning IR pages for report links
- Filtering by publication date
- Matching keywords like "quarterly results", "earnings release"
Downloads reports and extracts content using:
- Docling Parser: Advanced document parsing with layout understanding
- Fallback Parsers: Alternative extraction methods for edge cases
Extracts:
- Text content
- Financial tables
- Charts and images
- Document structure
Builds structured metadata using metadata_builder.py:
- Company information (ticker, name, sector)
- Report metadata (quarter, year, publication date)
- Parsing metadata (method used, confidence score)
Stores in standardized formats defined by metadata_storage_formats.py:
- JSON, nmd, txt
Organizes and uploads to GCS:
gs://your-bucket/
βββ AAPL/
β βββ 2025Q2/
β β βββ raw data
β β βββ parsed data
β β βββ metadata
β βββ ...
βββ ...
The full reflection report for Team 2 β FinFlow Project is available here:
π Download Team_2_reflection.docx
This document includes the project reflection, challenges, and future extensions.
- Codelabs Documentation: View Codelabs Guide
- Demo Video: Watch Project Demo