HCG is a standalone HeartBioPortal module for scraping cardiovascular guideline PDFs and converting them into structured JSON artifacts.
The repository ships the current ACC/AHA extraction corpus and now includes a dataset-aware scraper/sync pipeline for:
acc_ahaACC guideline discovery onacc.org, with browser-backed PDF resolution for JACC-hosted files.escESC guideline discovery onescardio.org, with article-first capture from the linked Oxford Academic guideline pages.
src/hcgPython package with the scraper, OpenAI extractor, release builder, schemas, and CLI.data/acc_aha/source_pdfsACC/AHA guideline PDFs and methodology PDFs used for the current extraction run.data/acc_aha/openai_outputsRaw page-level JSON and aggregated document JSON from the current OpenAI run.data/acc_aha/manual_gene_review.jsonHuman-reviewed title-to-gene mappings used by the release builder.data/acc_aha/releases/heartbioportal_guideline_json_release_2026-03-16Current HeartBioPortal handoff artifact.data/reference/gene_names.jsonCanonical gene reference used during normalization.data/escESC dataset workspace for scraped PDFs, scraper manifests, and extracted outputs.docs/project_audit.mdCurrent project audit and remaining caveats.
- The ACC/AHA raw page set is complete and currently has
0remaining page-level extraction errors. - The current ACC/AHA release contains
37document JSON files. - The scraper/sync workflow is in place for both
acc_ahaandesc. - The remaining content caveat is curation quality for the
16ACC/AHA auto-normalized documents that do not yet have full manual review.
python3 -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
playwright install chromiumOn Ubuntu or other minimal Linux hosts, you may also need:
sudo .venv/bin/playwright install --with-deps chromiumpdf2image requires Poppler on the host system.
- macOS:
brew install poppler - Ubuntu/Debian:
sudo apt-get install poppler-utils
Scrape both upstream sources and download any missing PDFs:
hcg scrapeInspect live discovery without downloading files:
hcg scrape --datasets esc --limit 5 --dry-runRun the end-to-end update flow. This scrapes the source sites, downloads missing PDFs, extracts newly downloaded pages to JSON, aggregates outputs, and rebuilds the ACC/AHA release if that dataset changed:
OPENAI_API_KEY=... hcg syncTarget a specific dataset:
OPENAI_API_KEY=... hcg sync --datasets esc --model gpt-5-miniFor ACC/AHA updates on a desktop session, prefer the visible browser mode because JACC can block headless Chromium with a Cloudflare verification page:
OPENAI_API_KEY=... hcg sync --datasets acc_aha --model gpt-5-mini --show-browserExtract pages directly for a single dataset:
hcg extract --dataset acc_aha --api-key "$OPENAI_API_KEY"Rerun only stored error pages:
OPENAI_API_KEY=... hcg extract --rerun-error-pagesBuild the ACC/AHA release from raw outputs:
hcg build-releaseWithout installing the package:
PYTHONPATH=src python -m hcg sync --datasets all --model gpt-5-minipytest
python -m hcg scrape --datasets esc --limit 1 --dry-run
python -m hcg build-release- ACC scraping uses Playwright because the ACC site links out to JACC-hosted documents that are not reliably downloadable through plain HTTP requests.
- If Playwright Chromium is not installed, ACC scraping now fails with a direct instruction to run
.venv/bin/playwright install chromium. - JACC can still block automated access behind a Cloudflare verification page, even in a visible browser. When that happens, the ACC scraper records the item as
blockedin the manifest and continues instead of hanging. - ESC scraping now ignores ESC declaration-of-interest attachments, follows the linked journal article, and renders the article page to PDF for extraction.
- Existing ESC PDFs that look like declaration-of-interest reports are treated as stale and replaced on the next
hcg scrapeorhcg syncrun. - Scraper logs are written to
data/<dataset>/scraper.log. - Scraper manifests are written to
data/<dataset>/scraper_manifest.json. hcg syncandhcg extractnow fail immediately with a clear error ifOPENAI_API_KEYis not set.hcg syncextracts any tracked PDFs that are still missing JSON outputs, even if those PDFs were downloaded in an earlier run.hcg syncdoes not redownload PDFs that already exist locally and match the upstream scraper catalog.
The repository is intentionally data-heavy because it ships the exact inputs and outputs used for the current HeartBioPortal ACC/AHA guideline JSON release.