Skip to content

Human-Gechi/Premierleaguewebscraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

priemierleaguewebscraping

πŸ•ΈοΈ Web Scraping Pipeline with Selenium & Spreadsheet Export

πŸ“Œ Project Overview

This project automates the end-to-end process of discovering web pages via keyword search, extracting tabular data from those pages using Selenium, and delivering the results as structured spreadsheets via email. Each spreadsheet reflects a single webpage, with each table saved on a separate sheet, clearly labeled by the section header found just above it on the original page.


πŸ”„ Workflow

1. πŸ” Keyword-Based URL Discovery

  • Selenium is used to search the target website for pages containing specific keywords.
  • All matching page links are saved to selenium_scrapped.txt.

2. 🌐 Webpage Table Extraction

  • The scraper reads each URL from selenium_scraped.txt.
  • Selenium opens each page to ensure all dynamic content is fully rendered.
  • All HTML tables are located.
  • For each table, the nearest header just above it (e.g., <h2>,) is used to name the corresponding spreadsheet sheet/tab.

3. πŸ“Š Spreadsheet Generation

  • Each webpage becomes one Excel spreadsheet
  • Spreadsheet names are the generate by splitting after /comps/ 'https://fbref.com/en/comps/season/1956-1957' so the spreadsheet name becomes season-1956-1957
  • Each spreadsheet contains:
    • Multiple sheets (one per table)
    • Sheet names based on the section heading above each table
    • Clean formatting via pandas
  • The original URL is optionally included in the spreadsheet metadata or first sheet for traceability.

4. πŸ“€ Email Delivery

  • Once all spreadsheets are generated, they are automatically emailed to designated recipients.

🧰 Technologies Used

  • Selenium – For automated web navigation and keyword-based link extraction
  • Pandas – For table parsing and cleanup
  • Googleapi/servics - For sheets handling
  • urllib - For links parsing slash sending request

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages