FinFlow - Automated Dow 30 Earnings Pipeline

**Case Study 2: Project LANTERN **
Automating quarterly earnings report collection and analysis for Dow Jones 30 companies.

📋 Project Overview

FinFlow is an automated data pipeline that streamlines the collection of quarterly earnings reports from Dow 30 companies. The system programmatically discovers investor relations pages, identifies the latest earnings reports, downloads and parses them, including metadata, and stores the results in cloud storage.

Key Features

Automated IR Page Discovery: Programmatically finds investor relations pages for all Dow 30 companies
Smart Report Detection: Identifies the latest quarterly earnings reports using keywords
Multi-format Parsing: Extracts text, tables, and charts from PDFs, HTML, and other formats
Cloud Storage Integration: Stores raw and parsed data in Google Cloud Storage
Airflow Orchestration: Manages the entire workflow with Apache Airflow for reliability and scheduling

🏗️ Architecture

┌─────────────────┐
│  Dow 30 List    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐      ┌──────────────────┐
│ Find IR Pages   │─────▶│  IR Page URLs    │
└────────┬────────┘      └──────────────────┘
         │
         ▼
┌─────────────────┐      ┌──────────────────┐
│ Find Reports    │─────▶│  Report URLs     │
└────────┬────────┘      └──────────────────┘
         │
         ▼
┌─────────────────┐      ┌──────────────────┐
│ Download & Parse│─────▶│  Parsed Data     │
└────────┬────────┘      └──────────────────┘
         │
         ▼
┌─────────────────┐      ┌──────────────────┐
│Extract Metadata │─────▶│  Structured Data │
└────────┬────────┘      └──────────────────┘
         │
         ▼
┌─────────────────┐
│ Upload to GCS   │
└─────────────────┘

🚀 Quick Start

Prerequisites

Docker Desktop
Docker Compose
Python 3.12+
Google Cloud Platform account (for GCS)
Git

One-Command Setup

Clone the repository

git clone https://github.com/yourusername/finflow.git
cd finflow

Build and start services

docker-compose build
docker-compose up -d

Access Airflow UI

Open browser: http://localhost:8080
Username: your Username
Password: your Password

Running the Pipeline

Navigate to http://localhost:8080
Find the dow30_pipeline DAG
Toggle it to "Active" (if paused)
Click the "Play" button to trigger a manual run

📁 Project Structure

finflow/
├── dags/                                    # Airflow DAG definitions
│   └── dow30_pipeline.py                   # Main pipeline orchestration
├── src/                                     # Core pipeline logic
│   ├── find_ir_pages.py                    # IR page discovery
│   ├── find_latest_reports.py              # Report identification
│   ├── upload_to_cloud.py                  # Cloud storage upload
│   ├── parsers/                            # Data parsing modules
│   │   ├── docling_or_fallback_parser.py  # ⭐ Main parsing logic with Docling
│   └── representations/                    # Data structure and metadata
│       ├── metadata_builder.py             # ⭐ Metadata extraction and construction
│       ├── metadata_storage_formats.py     # ⭐ Metadata schemas and formats
├── config/                                  # Configuration files
│   └── dow30_companies.json                # Dow 30 reference list
├── data/                                    # Local data storage
│   ├── raw/                                # Raw downloaded files
│   └── parsed/                             # Parsed outputs with metadata
├── logs/                                    # Airflow logs
├── docker-compose.yml                      # Docker services configuration
├── .airflow                                # Airflow schedule
├── .env                                    # Environment variables
├── requirements.txt                        # Python dependencies
└── README.md                               # This file

🔧 Configuration

Environment Variables (.env)

# Database
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres/airflow

# Celery
AIRFLOW__CELERY__RESULT_BACKEND=db+postgresql://airflow:airflow@postgres/airflow
AIRFLOW__CELERY__BROKER_URL=redis://:@redis:6379/0

# Admin credentials
_AIRFLOW_WWW_USER_USERNAME=admin
_AIRFLOW_WWW_USER_PASSWORD=admin

GCP Setup

Create a GCS bucket:

gsutil mb gs://your-finflow-bucket

Create a service account with Storage Admin role
Download the JSON key and save as gcp-key.json

📦 Dependencies

Python Packages

apache-airflow==2.10.2 - Workflow orchestration
requests - HTTP requests
beautifulsoup4 - HTML parsing
pandas - Data manipulation
python-dotenv - Environment management
google-cloud-storage - GCS integration

Docker Services

PostgreSQL 15 - Airflow metadata database
Redis 7 - Celery message broker
Apache Airflow 2.10.2 - Workflow orchestration

🔑 Key Components

Parsing Engine: `docling_or_fallback_parser.py`

The core parsing module that handles document processing:

Primary Method: Uses Docling for advanced document understanding
Output: Structured data with confidence scores and parsing metadata

🔍 Pipeline Details

Step 1: Find IR Pages

Programmatically discovers investor relations pages by:

Analyzing company websites
Looking for common IR page patterns
Using keywords

Step 2: Find Latest Reports

Identifies the most recent quarterly earnings by:

Scanning IR pages for report links
Filtering by publication date
Matching keywords like "quarterly results", "earnings release"

Step 3: Download and Parse

Downloads reports and extracts content using:

Docling Parser: Advanced document parsing with layout understanding
Fallback Parsers: Alternative extraction methods for edge cases

Extracts:

Text content
Financial tables
Charts and images
Document structure

Step 4: Extract Metadata

Builds structured metadata using metadata_builder.py:

Company information (ticker, name, sector)
Report metadata (quarter, year, publication date)
Parsing metadata (method used, confidence score)

Stores in standardized formats defined by metadata_storage_formats.py:

JSON, nmd, txt

Step 5: Upload to Cloud

Organizes and uploads to GCS:

gs://your-bucket/
├── AAPL/
│   ├── 2025Q2/
│   │   ├── raw data 
│   │   ├── parsed data
│   │   └── metadata
│   └── ...
└── ...

📄 Final Report

The full reflection report for Team 2 – FinFlow Project is available here:

👉 Download Team_2_reflection.docx

This document includes the project reflection, challenges, and future extensions.

🔗 Links

Codelabs Documentation: View Codelabs Guide
Demo Video: Watch Project Demo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FinFlow - Automated Dow 30 Earnings Pipeline

📋 Project Overview

Key Features

🏗️ Architecture

🚀 Quick Start

Prerequisites

One-Command Setup

Running the Pipeline

📁 Project Structure

🔧 Configuration

Environment Variables (.env)

GCP Setup

📦 Dependencies

Python Packages

Docker Services

🔑 Key Components

Parsing Engine: `docling_or_fallback_parser.py`

🔍 Pipeline Details

Step 1: Find IR Pages

Step 2: Find Latest Reports

Step 3: Download and Parse

Step 4: Extract Metadata

Step 5: Upload to Cloud

📄 Final Report

🔗 Links

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
dags		dags
data		data
modules		modules
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
Team_2_reflection.docx		Team_2_reflection.docx
docker-compose.yml		docker-compose.yml

Effyrt/FinFlow

Folders and files

Latest commit

History

Repository files navigation

FinFlow - Automated Dow 30 Earnings Pipeline

📋 Project Overview

Key Features

🏗️ Architecture

🚀 Quick Start

Prerequisites

One-Command Setup

Running the Pipeline

📁 Project Structure

🔧 Configuration

Environment Variables (.env)

GCP Setup

📦 Dependencies

Python Packages

Docker Services

🔑 Key Components

Parsing Engine: docling_or_fallback_parser.py

🔍 Pipeline Details

Step 1: Find IR Pages

Step 2: Find Latest Reports

Step 3: Download and Parse

Step 4: Extract Metadata

Step 5: Upload to Cloud

📄 Final Report

🔗 Links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Parsing Engine: `docling_or_fallback_parser.py`

Packages