Skip to content

kenjpais/web-page-summarizer-ai

Repository files navigation

Release Page Summarizer AI

A Python-based web scraper and summarization tool designed to extract, filter, and analyze data from OpenShift release stream pages. The tool offers integration with JIRA and GitHub, supporting both issue-id based and username based data retrieval and summarization. Supports environment-based configuration, and includes automated testing and CI workflows.


Key Features

  • Automated Web Scraping: Extracts metadata and release information from OpenShift release page urls and/or JIRA and/or GITHUB urls and correlates related release data under matching projects.
  • Summary Generation: Creates structured summaries of release data using LLM (Local Mistral or Google Gemini).
  • Configurable Data Sources: Supports multiple backends including GitHub (PRs, Commits) and JIRA (summary and description fields of all JIRA Artifacts). Fetches publicly available data only.
  • Secure Environment Management: Credentials and configuration managed via .env.
  • Test Coverage: Includes unit tests for core logic and controllers.
  • CI/CD Integration: GitHub Actions used for continuous integration and scheduled execution.
  • AI-Powered Issue Triage: Automatic issue labeling and assessment using AI to streamline issue management.

Getting Started

Prerequisites

  • Python 3.8+
  • pip package manager
  • Ollama (for local LLM) or Google API Key (for Gemini)

Environment Setup

  1. Set up required environment variables:
# Required for GitHub access
export GH_API_TOKEN=your_github_api_token_here

# Required only if using Gemini as LLM
export GOOGLE_API_KEY=your_gemini_token_here

Get your github API key here: https://docs.github.com/en/rest/authentication/authenticating-to-the-rest-api

Get your gemini API key here: https://ai.google.dev/gemini-api/docs/api-key

  1. Install dependencies and set up the environment:
sh setup.sh
  1. Start the LLM server (if using local Mistral):
sh start_llm.sh

For detailed LLM setup instructions, refer to LLM_SETUP.md.

Usage

The tool provides three main commands:

1. Scrape Data

# Scrape from a URL
python main.py scrape --url <release_page_url>

# Scrape from JIRA
python main.py scrape --issue-ids "ISSUE-1,ISSUE-2" --jira-server "https://jira.example.com"

# Scrape from GitHub with authentication
python main.py scrape --url <release_page_url> --github-token <token> --github-server "https://github.com"

# Enable filtering while scraping
python main.py scrape --url <url> --filter-on

2. Correlate Data

# Correlate previously scraped data
python main.py correlate

3. Generate Summaries

# Generate summaries from correlated data
python main.py summarize --url <url>

# Generate summaries with specific data sources
python main.py summarize --url <url> --issue-ids "ISSUE-1,ISSUE-2" --github-token <token>

Common Options

  • --filter-on: Enable filtering of data based on configured rules
  • --url: URL to scrape data from
  • --issue-ids: Comma-separated list of JIRA issue IDs
  • --jira-usernames: Comma-separated list of JIRA usernames to fetch data for
  • --jira-server: JIRA server URL
  • --jira-username: JIRA username (optional)
  • --jira-password: JIRA password (optional)
  • --github-server: GitHub server URL
  • --github-token: GitHub API token
  • --github-username: GitHub username (optional)
  • --github-password: GitHub password (optional)

Output

Generated summaries and data will be stored in the following locations:

  • Scraped data: data/
  • Summaries: data/summaries/

Configuration

  • To disable summary generation, set SUMMARIZE_ENABLED=False in your environment.
  • For LLM configuration, refer to LLM_SETUP.md.
  • Additional configuration options can be found in the config/ directory.

Issue Management

The repository uses AI-powered issue triage to automatically categorize and label new issues:

Issue Labels

  • bug: Bug reports and software issues
  • enhancement: Feature requests and improvements
  • question: General questions and help requests
  • documentation: Documentation-related issues
  • dependencies: Package and dependency-related issues
  • security: Security-related concerns

When a new issue is created:

  1. AI automatically analyzes the issue content
  2. Appropriate labels are applied based on the analysis
  3. A comment is added with the AI assessment
  4. Maintainers are notified for review

The AI assessment helps prioritize and route issues to the appropriate team members more efficiently.

About

Filter data and summarize using LLMs for JIRA, GITHUB and HTML sources

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages