A Python-based web scraper and summarization tool designed to extract, filter, and analyze data from OpenShift release stream pages. The tool offers integration with JIRA and GitHub, supporting both issue-id based and username based data retrieval and summarization. Supports environment-based configuration, and includes automated testing and CI workflows.
- Automated Web Scraping: Extracts metadata and release information from OpenShift release page urls and/or JIRA and/or GITHUB urls and correlates related release data under matching projects.
- Summary Generation: Creates structured summaries of release data using LLM (Local Mistral or Google Gemini).
- Configurable Data Sources: Supports multiple backends including GitHub (PRs, Commits) and JIRA (summary and description fields of all JIRA Artifacts). Fetches publicly available data only.
- Secure Environment Management: Credentials and configuration managed via
.env. - Test Coverage: Includes unit tests for core logic and controllers.
- CI/CD Integration: GitHub Actions used for continuous integration and scheduled execution.
- AI-Powered Issue Triage: Automatic issue labeling and assessment using AI to streamline issue management.
- Python 3.8+
pippackage manager- Ollama (for local LLM) or Google API Key (for Gemini)
- Set up required environment variables:
# Required for GitHub access
export GH_API_TOKEN=your_github_api_token_here
# Required only if using Gemini as LLM
export GOOGLE_API_KEY=your_gemini_token_hereGet your github API key here: https://docs.github.com/en/rest/authentication/authenticating-to-the-rest-api
Get your gemini API key here: https://ai.google.dev/gemini-api/docs/api-key
- Install dependencies and set up the environment:
sh setup.sh- Start the LLM server (if using local Mistral):
sh start_llm.shFor detailed LLM setup instructions, refer to LLM_SETUP.md.
The tool provides three main commands:
# Scrape from a URL
python main.py scrape --url <release_page_url>
# Scrape from JIRA
python main.py scrape --issue-ids "ISSUE-1,ISSUE-2" --jira-server "https://jira.example.com"
# Scrape from GitHub with authentication
python main.py scrape --url <release_page_url> --github-token <token> --github-server "https://github.com"
# Enable filtering while scraping
python main.py scrape --url <url> --filter-on# Correlate previously scraped data
python main.py correlate# Generate summaries from correlated data
python main.py summarize --url <url>
# Generate summaries with specific data sources
python main.py summarize --url <url> --issue-ids "ISSUE-1,ISSUE-2" --github-token <token>--filter-on: Enable filtering of data based on configured rules--url: URL to scrape data from--issue-ids: Comma-separated list of JIRA issue IDs--jira-usernames: Comma-separated list of JIRA usernames to fetch data for--jira-server: JIRA server URL--jira-username: JIRA username (optional)--jira-password: JIRA password (optional)--github-server: GitHub server URL--github-token: GitHub API token--github-username: GitHub username (optional)--github-password: GitHub password (optional)
Generated summaries and data will be stored in the following locations:
- Scraped data:
data/ - Summaries:
data/summaries/
- To disable summary generation, set
SUMMARIZE_ENABLED=Falsein your environment. - For LLM configuration, refer to
LLM_SETUP.md. - Additional configuration options can be found in the
config/directory.
The repository uses AI-powered issue triage to automatically categorize and label new issues:
bug: Bug reports and software issuesenhancement: Feature requests and improvementsquestion: General questions and help requestsdocumentation: Documentation-related issuesdependencies: Package and dependency-related issuessecurity: Security-related concerns
When a new issue is created:
- AI automatically analyzes the issue content
- Appropriate labels are applied based on the analysis
- A comment is added with the AI assessment
- Maintainers are notified for review
The AI assessment helps prioritize and route issues to the appropriate team members more efficiently.