A free, zero-infrastructure PDF accessibility scanner.
Point it at any website, it will begin a process that might take an hour to crawl
your site, discover most of the PDFs and then flag common accessibility issues.
It then posts a public report of the discoveries. All through a GitHub issue.
No servers to deploy, no accounts to configure, no software to install.
How it works in three steps:
- Fill out the web form — it
creates a GitHub issue titled
SCAN: https://…with a single click. - GitHub Actions crawls the site for up to one hour, analyses every PDF it finds, and posts the full results as a comment on that issue.
- The issue is automatically closed once the report is ready. The report is public. Reopen the issue any time to re-run the scan.
-
Crawls a website for PDF (and other document) files using Scrapy.
-
Maintains a YAML manifest (
reports/manifest.yaml) with every discovered file's URL, MD5 hash, and accessibility results. Files whose MD5 hash has not changed since the last run are skipped automatically. -
Analyses each pending PDF for the following accessibility issues (based on WCAG 2.x / EN 301 549):
Check WCAG SC Description TaggedTest– Is the document tagged? EmptyTextTest1.4.5 Does it contain real text (not just images)? ProtectedTest– Is it protected against assistive technologies? TitleTest2.4.2 Does it have a title with DisplayDocTitleset?LanguageTest3.1.1 Does it have a valid default language? BookmarksTest2.4.1 For documents > 20 pages, does it have bookmarks? ImageAltTextTest1.1.1 Do all Figure structure elements carry alternate text? -
Generates reports in Markdown and JSON.
-
Deletes the PDF files after analysis to keep the repository small; only the YAML manifest is committed.
The automated checks above are a first step, not a complete accessibility audit.
| Level | What it tells you |
|---|---|
| All checks pass | The document meets a basic set of machine-testable criteria. |
| veraPDF pass | The document also conforms to PDF/A or PDF/UA as verified by the leading open-source conformance checker (run separately). |
| Manual review complete | The document has been tested by a person using assistive technology — the only way to confirm true accessibility. |
Passing every automated check is good. Passing veraPDF as well is better. Manual testing is still required and will always be required. Automated tools cannot evaluate reading order, meaningful link text, appropriate use of heading levels, table header associations, or the accessibility of form fields, among other criteria.
- PDF Accessibility Checklist – Canada.ca — a practical, human-centred checklist for evaluating PDF accessibility.
- Tagged PDF Q&A – PDF Association — authoritative answers on PDF tagging from the PDF standards body.
- PDF Accessibility (pdfa11y) – Drupal.org — a Drupal module that validates PDFs for accessibility before they are published on Drupal sites. Its checks (tagging, title, language, alternate text for figures) inform the approach taken by this project. We do not duplicate the module but draw on the same underlying standards (WCAG 2.x / PDF/UA) to complement it.
How does this tool compare to commercial PDF accessibility checkers such as Clarity by CommonLook?
This project is a free, open-source proof-of-concept built on simplA11yPDFCrawler (MIT) and veraPDF. It is not intended to replace commercial tools that offer deeper validation, remediation workflows, or dedicated support. Commercial tools typically cover a wider range of checks, integrate with document-authoring workflows, and carry vendor support agreements. This tool is best suited for a quick, no-cost, first-pass audit across many PDFs at once.
How accurate is the tool at detecting PDF accessibility problems?
The automated checks cover a specific subset of machine-detectable criteria (tagging, language, title, bookmarks, image alt text, encryption — see What it does above). They can surface common structural issues quickly and at scale, but they cannot replace a manual review. Automated tools cannot evaluate reading order, meaningful link text, appropriate heading levels, table header associations, or the accessibility of form fields, among other criteria. A passing result means the document meets this basic set of machine-testable criteria — it does not certify full accessibility. Running veraPDF separately provides a more rigorous conformance check against PDF/A and PDF/UA standards.
Can the tool scan multiple domains?
Not in its current form — each scan request targets a single domain. The
architecture (GitHub Actions + YAML manifest) could be extended to support
multi-domain batching, and submitting multiple separate SCAN: issues is the
simplest workaround today. This project is not set up to replace commercial
providers, but wider-scale scanning is a potential future direction.
Can WCAG or Section 508 conformance levels be configured?
Not yet. The checks are mapped to specific WCAG 2.x success criteria (see the table in What it does), but there is no option to restrict or expand checks by conformance level. The tool exposes what the underlying open-source checkers provide; targeted conformance-level filtering is a potential future enhancement.
Why are some PDFs on a site not found?
PDFs hosted on a different domain from the one being crawled may not be discovered. The spider follows links within the target domain; PDF links that point to an external domain are not followed in the current implementation.
Open the PDF Crawler form,
enter a URL, and click Submit Crawl Request. You will be taken to GitHub
with the issue title pre-filled as SCAN: https://… — just click
Submit new issue to start the crawl.
The SCAN: prefix triggers the Crawl Site for PDFs workflow automatically.
The workflow will post a comment when the crawl starts and another comment with
the full accessibility report links when analysis is complete.
Note: Issues are processed when opened or reopened. Editing the issue body will not re-trigger a scan, so there is no risk of accidental recurring scans. The legacy
PDF-CRAWL:prefix is still accepted for backward compatibility.
If a crawl fails (the issue is labelled scan-failed), you can restart it by
closing and then reopening the issue. The crawler will pick up the
reopened event and start a fresh crawl.
Occasionally the analysis step does not start after a successful crawl. This
can happen when many issues are opened at the same time and GitHub's
workflow_run event queue becomes saturated. The scan then appears stuck with
a scan-in-progress label indefinitely.
The 4 – Rescue Abandoned Scans workflow detects these stuck issues automatically:
- It runs every day at 06:00 UTC.
- It can also be triggered manually via Actions → 4 – Rescue Abandoned Scans → Run workflow.
- Any issue that has been in
scan-in-progressfor more than 3 hours without ascan-completeorscan-failedlabel is markedscan-failedand a comment is posted explaining what happened and how to retry. - Issues that carry both
scan-in-progressandscan-complete(a stale label left over from an earlier run) are silently tidied up.
| Label | Meaning |
|---|---|
scan-in-progress |
Crawl or analysis is currently running |
scan-failed |
The crawl workflow failed; reopen the issue to retry |
scan-complete |
Analysis finished and reports have been generated |
Issues are automatically closed once the accessibility report is posted.
Open a new issue and set the title to:
SCAN: https://example.com
Go to Actions → 1 – Crawl Site for PDFs → Run workflow and enter the URL you want to crawl.
Once the crawl finishes, the 2 – Analyse PDFs for Accessibility workflow starts automatically. You can also trigger it manually.
Large sites or sites that serve large PDFs can cause the crawl job to time out (the hard limit is 75 minutes). Use the options below to keep jobs within that budget.
Add a Number: line to the body of your SCAN: issue to cap the maximum
number of pages (URLs) the spider will visit:
SCAN: https://example.com
Number: 200
The default is 2,500 pages. For sites that are large or slow, start with a lower value such as 200–500 and increase it on subsequent scans once you have a feel for the site's size.
Tip: The web form on the PDF Crawler page has a Max pages field that inserts the
Number:line for you.
When triggering the workflow manually (Actions → 1 – Crawl Site for PDFs → Run workflow) you can set:
| Input | Default | Description |
|---|---|---|
max_pages |
2500 |
Maximum number of URLs/pages to visit |
timeout |
3600 |
Maximum crawl time in seconds (1 hour) |
If the job is cancelled because it exceeded the 75-minute limit:
- The workflow automatically halves the page cap (minimum: 100 pages) and
writes the new
Number:value back into the issue body. - A comment is posted on the issue explaining what happened and showing the new cap.
- Close and reopen the issue to retry the crawl with the smaller batch.
Repeat this process until the crawl completes within the time limit.
- Start small – use
Number: 100for an initial probe, then increase. - Check the workflow logs – the Run PDF crawler step shows how many pages were visited and how many PDFs were found before the timeout.
- Large PDFs slow analysis – even a crawl of 50 pages can time out during
the Analyse PDFs step if the individual files are very large. The
scan-failedlabel signals an analysis failure; reopen the issue to retry. - Sequential queue – if multiple scans are queued, use Actions → 3 – Process Scan Queue to run them one-at-a-time instead of triggering them all simultaneously.
| Workflow | File | Trigger |
|---|---|---|
| Crawl Site for PDFs | .github/workflows/crawl.yml |
Manual dispatch or issue opened/reopened with SCAN: title (legacy: PDF-CRAWL:) |
| Analyse PDFs for Accessibility | .github/workflows/analyse.yml |
After crawl succeeds, or manual dispatch |
| Process Scan Queue | .github/workflows/process_scan_queue.yml |
Manual dispatch |
| Rescue Abandoned Scans | .github/workflows/rescue_abandoned_scans.yml |
Daily at 06:00 UTC, or manual dispatch |
| File | Description |
|---|---|
reports/manifest.yaml |
YAML tracking file – one entry per PDF |
reports/report.md |
Human-readable Markdown report |
reports/report.json |
Machine-readable JSON report |
See reports/README.md for the full manifest schema.
# Clone the repo
git clone https://github.com/mgifford/pdf-crawler.git
cd pdf-crawler
# Set up a Python virtual environment
python3 -m venv env
source env/bin/activate
# Install dependencies
pip install -r requirements.txt
# Crawl a site (runs for up to 1 hour by default)
python scripts/crawl.py --url https://example.com
# Analyse the downloaded PDFs
python scripts/pdf_analyser.py
# Generate reports
python scripts/generate_report.pypdf-crawler/
├── .github/
│ └── workflows/
│ ├── crawl.yml # Step 1: crawl a site for PDFs
│ └── analyse.yml # Step 2: analyse PDFs for accessibility
├── docs/
│ └── index.html # GitHub Pages submission form
├── scripts/
│ ├── pdf_spider.py # Scrapy spider (downloads PDF files)
│ ├── crawl.py # Crawl wrapper + manifest update
│ ├── manifest.py # YAML manifest management (MD5 dedup)
│ ├── pdf_analyser.py # Accessibility checks (pikepdf-based)
│ └── generate_report.py # Markdown + JSON report generator
├── reports/
│ ├── README.md # Manifest schema docs
│ ├── manifest.yaml # ← committed; grows over time
│ ├── report.md # ← committed; regenerated each run
│ └── report.json # ← committed; regenerated each run
├── requirements.txt # Python dependencies
└── README.md
This section documents all AI tools used in this project. Transparency about AI involvement is a core commitment — see SUSTAINABILITY.md for the full AI usage policy.
The following LLMs were used during development and are the only AI tools known to have been applied to this repository:
| LLM / tool | Provider | Used for |
|---|---|---|
| GitHub Copilot (GPT-4-class) | GitHub / OpenAI | Code suggestions, CI workflow improvements, PR support |
| GPT-4-class models via Copilot Chat | GitHub / OpenAI | Content drafting, structural editing, documentation |
| Claude (Anthropic) | Anthropic via GitHub Copilot Coding Agent | Automated issue resolution and code changes via the GitHub Copilot coding agent |
Each use involved human review and editing before the output was merged.
No AI runs automatically at runtime. When a crawl or analysis job executes,
all processing is performed by deterministic Python scripts
(pdf_spider.py, pdf_analyser.py, generate_report.py). No LLM is called
during a scan.
No browser-based AI is enabled. The docs/index.html submission form is
a static HTML page with no runtime AI features. Browser built-in AI APIs (if
supported by the visitor's browser) are not activated by this page. Any future
use of browser AI would require explicit user opt-in per the AI usage policy in
SUSTAINABILITY.md.
- Accessibility checks are based on simplA11yPDFCrawler by SIP Luxembourg (MIT licence).
- Architecture inspired by mgifford/open-scans.
Building on simplA11yPDFCrawler, this is released under the MIT.