pdf-crawler

A free, zero-infrastructure PDF accessibility scanner. Point it at any website, it will begin a process that might take an hour to crawl your site, discover most of the PDFs and then flag common accessibility issues. It then posts a public report of the discoveries. All through a GitHub issue.
No servers to deploy, no accounts to configure, no software to install.

How it works in three steps:

Fill out the web form — it creates a GitHub issue titled SCAN: https://… with a single click.
GitHub Actions crawls the site for up to one hour, analyses every PDF it finds, and posts the full results as a comment on that issue.
The issue is automatically closed once the report is ready. The report is public. Reopen the issue any time to re-run the scan.

What it does

Crawls a website for PDF (and other document) files using Scrapy.
Maintains a YAML manifest (reports/manifest.yaml) with every discovered file's URL, MD5 hash, and accessibility results. Files whose MD5 hash has not changed since the last run are skipped automatically.

Analyses each pending PDF for the following accessibility issues (based on WCAG 2.x / EN 301 549):

Check	WCAG SC	Description
`TaggedTest`	–	Is the document tagged?
`EmptyTextTest`	1.4.5	Does it contain real text (not just images)?
`ProtectedTest`	–	Is it protected against assistive technologies?
`TitleTest`	2.4.2	Does it have a title with `DisplayDocTitle` set?
`LanguageTest`	3.1.1	Does it have a valid default language?
`BookmarksTest`	2.4.1	For documents > 20 pages, does it have bookmarks?
`ImageAltTextTest`	1.1.1	Do all Figure structure elements carry alternate text?

Generates reports in Markdown and JSON.
Deletes the PDF files after analysis to keep the repository small; only the YAML manifest is committed.

Interpreting results

The automated checks above are a first step, not a complete accessibility audit.

Level	What it tells you
All checks pass	The document meets a basic set of machine-testable criteria.
veraPDF pass	The document also conforms to PDF/A or PDF/UA as verified by the leading open-source conformance checker (run separately).
Manual review complete	The document has been tested by a person using assistive technology — the only way to confirm true accessibility.

Passing every automated check is good. Passing veraPDF as well is better. Manual testing is still required and will always be required. Automated tools cannot evaluate reading order, meaningful link text, appropriate use of heading levels, table header associations, or the accessibility of form fields, among other criteria.

Frequently asked questions

How does this tool compare to commercial PDF accessibility checkers such as Clarity by CommonLook?

This project is a free, open-source proof-of-concept built on simplA11yPDFCrawler (MIT) and veraPDF. It is not intended to replace commercial tools that offer deeper validation, remediation workflows, or dedicated support. Commercial tools typically cover a wider range of checks, integrate with document-authoring workflows, and carry vendor support agreements. This tool is best suited for a quick, no-cost, first-pass audit across many PDFs at once.

How accurate is the tool at detecting PDF accessibility problems?

The automated checks cover a specific subset of machine-detectable criteria (tagging, language, title, bookmarks, image alt text, encryption — see What it does above). They can surface common structural issues quickly and at scale, but they cannot replace a manual review. Automated tools cannot evaluate reading order, meaningful link text, appropriate heading levels, table header associations, or the accessibility of form fields, among other criteria. A passing result means the document meets this basic set of machine-testable criteria — it does not certify full accessibility. Running veraPDF separately provides a more rigorous conformance check against PDF/A and PDF/UA standards.

Can the tool scan multiple domains?

Not in its current form — each scan request targets a single domain. The architecture (GitHub Actions + YAML manifest) could be extended to support multi-domain batching, and submitting multiple separate SCAN: issues is the simplest workaround today. This project is not set up to replace commercial providers, but wider-scale scanning is a potential future direction.

Can WCAG or Section 508 conformance levels be configured?

Not yet. The checks are mapped to specific WCAG 2.x success criteria (see the table in What it does), but there is no option to restrict or expand checks by conformance level. The tool exposes what the underlying open-source checkers provide; targeted conformance-level filtering is a potential future enhancement.

Why are some PDFs on a site not found?

PDFs hosted on a different domain from the one being crawled may not be discovered. The spider follows links within the target domain; PDF links that point to an external domain are not followed in the current implementation.

Quick start

1 – Submit a crawl via the web form

Open the PDF Crawler form, enter a URL, and click Submit Crawl Request. You will be taken to GitHub with the issue title pre-filled as SCAN: https://… — just click Submit new issue to start the crawl.

The SCAN: prefix triggers the Crawl Site for PDFs workflow automatically. The workflow will post a comment when the crawl starts and another comment with the full accessibility report links when analysis is complete.

Note: Issues are processed when opened or reopened. Editing the issue body will not re-trigger a scan, so there is no risk of accidental recurring scans. The legacy PDF-CRAWL: prefix is still accepted for backward compatibility.

Restarting a failed scan

If a crawl fails (the issue is labelled scan-failed), you can restart it by closing and then reopening the issue. The crawler will pick up the reopened event and start a fresh crawl.

Abandoned scans

Occasionally the analysis step does not start after a successful crawl. This can happen when many issues are opened at the same time and GitHub's workflow_run event queue becomes saturated. The scan then appears stuck with a scan-in-progress label indefinitely.

The 4 – Rescue Abandoned Scans workflow detects these stuck issues automatically:

It runs every day at 06:00 UTC.
It can also be triggered manually via Actions → 4 – Rescue Abandoned Scans → Run workflow.
Any issue that has been in scan-in-progress for more than 3 hours without a scan-complete or scan-failed label is marked scan-failed and a comment is posted explaining what happened and how to retry.
Issues that carry both scan-in-progress and scan-complete (a stale label left over from an earlier run) are silently tidied up.

Issue lifecycle

Label	Meaning
`scan-in-progress`	Crawl or analysis is currently running
`scan-failed`	The crawl workflow failed; reopen the issue to retry
`scan-complete`	Analysis finished and reports have been generated

Issues are automatically closed once the accessibility report is posted.

2 – Submit a crawl manually

Open a new issue and set the title to:

SCAN: https://example.com

3 – Trigger manually

Go to Actions → 1 – Crawl Site for PDFs → Run workflow and enter the URL you want to crawl.

Once the crawl finishes, the 2 – Analyse PDFs for Accessibility workflow starts automatically. You can also trigger it manually.

Limiting crawl scope

Large sites or sites that serve large PDFs can cause the crawl job to time out (the hard limit is 75 minutes). Use the options below to keep jobs within that budget.

Setting a page cap via the issue body

Add a Number: line to the body of your SCAN: issue to cap the maximum number of pages (URLs) the spider will visit:

SCAN: https://example.com

Number: 200

The default is 2,500 pages. For sites that are large or slow, start with a lower value such as 200–500 and increase it on subsequent scans once you have a feel for the site's size.

Tip: The web form on the PDF Crawler page has a Max pages field that inserts the Number: line for you.

Setting a page cap via workflow dispatch

When triggering the workflow manually (Actions → 1 – Crawl Site for PDFs → Run workflow) you can set:

Input	Default	Description
`max_pages`	`2500`	Maximum number of URLs/pages to visit
`timeout`	`3600`	Maximum crawl time in seconds (1 hour)

What happens when a crawl times out

If the job is cancelled because it exceeded the 75-minute limit:

The workflow automatically halves the page cap (minimum: 100 pages) and writes the new Number: value back into the issue body.
A comment is posted on the issue explaining what happened and showing the new cap.
Close and reopen the issue to retry the crawl with the smaller batch.

Repeat this process until the crawl completes within the time limit.

Tips for large or slow sites

Start small – use Number: 100 for an initial probe, then increase.
Check the workflow logs – the Run PDF crawler step shows how many pages were visited and how many PDFs were found before the timeout.
Large PDFs slow analysis – even a crawl of 50 pages can time out during the Analyse PDFs step if the individual files are very large. The scan-failed label signals an analysis failure; reopen the issue to retry.
Sequential queue – if multiple scans are queued, use Actions → 3 – Process Scan Queue to run them one-at-a-time instead of triggering them all simultaneously.

Workflows

Workflow	File	Trigger
Crawl Site for PDFs	`.github/workflows/crawl.yml`	Manual dispatch or issue opened/reopened with `SCAN:` title (legacy: `PDF-CRAWL:`)
Analyse PDFs for Accessibility	`.github/workflows/analyse.yml`	After crawl succeeds, or manual dispatch
Process Scan Queue	`.github/workflows/process_scan_queue.yml`	Manual dispatch
Rescue Abandoned Scans	`.github/workflows/rescue_abandoned_scans.yml`	Daily at 06:00 UTC, or manual dispatch

Output files

File	Description
`reports/manifest.yaml`	YAML tracking file – one entry per PDF
`reports/report.md`	Human-readable Markdown report
`reports/report.json`	Machine-readable JSON report

See reports/README.md for the full manifest schema.

Local development

# Clone the repo
git clone https://github.com/mgifford/pdf-crawler.git
cd pdf-crawler

# Set up a Python virtual environment
python3 -m venv env
source env/bin/activate

# Install dependencies
pip install -r requirements.txt

# Crawl a site (runs for up to 1 hour by default)
python scripts/crawl.py --url https://example.com

# Analyse the downloaded PDFs
python scripts/pdf_analyser.py

# Generate reports
python scripts/generate_report.py

Architecture

pdf-crawler/
├── .github/
│   └── workflows/
│       ├── crawl.yml          # Step 1: crawl a site for PDFs
│       └── analyse.yml        # Step 2: analyse PDFs for accessibility
├── docs/
│   └── index.html             # GitHub Pages submission form
├── scripts/
│   ├── pdf_spider.py          # Scrapy spider (downloads PDF files)
│   ├── crawl.py               # Crawl wrapper + manifest update
│   ├── manifest.py            # YAML manifest management (MD5 dedup)
│   ├── pdf_analyser.py        # Accessibility checks (pikepdf-based)
│   └── generate_report.py     # Markdown + JSON report generator
├── reports/
│   ├── README.md              # Manifest schema docs
│   ├── manifest.yaml          # ← committed; grows over time
│   ├── report.md              # ← committed; regenerated each run
│   └── report.json            # ← committed; regenerated each run
├── requirements.txt           # Python dependencies
└── README.md

AI Disclosure

This section documents all AI tools used in this project. Transparency about AI involvement is a core commitment — see SUSTAINABILITY.md for the full AI usage policy.

Building the project

The following LLMs were used during development and are the only AI tools known to have been applied to this repository:

LLM / tool	Provider	Used for
GitHub Copilot (GPT-4-class)	GitHub / OpenAI	Code suggestions, CI workflow improvements, PR support
GPT-4-class models via Copilot Chat	GitHub / OpenAI	Content drafting, structural editing, documentation
Claude (Anthropic)	Anthropic via GitHub Copilot Coding Agent	Automated issue resolution and code changes via the GitHub Copilot coding agent

Each use involved human review and editing before the output was merged.

Runtime AI usage

No AI runs automatically at runtime. When a crawl or analysis job executes, all processing is performed by deterministic Python scripts (pdf_spider.py, pdf_analyser.py, generate_report.py). No LLM is called during a scan.

Browser-based AI

No browser-based AI is enabled. The docs/index.html submission form is a static HTML page with no runtime AI features. Browser built-in AI APIs (if supported by the visitor's browser) are not activated by this page. Any future use of browser AI would require explicit user opt-in per the AI usage policy in SUSTAINABILITY.md.

Credits

Accessibility checks are based on simplA11yPDFCrawler by SIP Luxembourg (MIT licence).
Architecture inspired by mgifford/open-scans.

Licence

Building on simplA11yPDFCrawler, this is released under the MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 731 Commits
.github		.github
docs		docs
reports		reports
scripts		scripts
tests		tests
.coverage		.coverage
.gitignore		.gitignore
.pylintrc		.pylintrc
ACCESSIBILITY.md		ACCESSIBILITY.md
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
SUSTAINABILITY.md		SUSTAINABILITY.md
_config.yml		_config.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf-crawler

What it does

Interpreting results

Further reading

Frequently asked questions

Quick start

1 – Submit a crawl via the web form

Restarting a failed scan

Abandoned scans

Issue lifecycle

2 – Submit a crawl manually

3 – Trigger manually

Limiting crawl scope

Setting a page cap via the issue body

Setting a page cap via workflow dispatch

What happens when a crawl times out

Tips for large or slow sites

Workflows

Output files

Local development

Architecture

AI Disclosure

Building the project

Runtime AI usage

Browser-based AI

Credits

Licence

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pdf-crawler

What it does

Interpreting results

Further reading

Frequently asked questions

Quick start

1 – Submit a crawl via the web form

Restarting a failed scan

Abandoned scans

Issue lifecycle

2 – Submit a crawl manually

3 – Trigger manually

Limiting crawl scope

Setting a page cap via the issue body

Setting a page cap via workflow dispatch

What happens when a crawl times out

Tips for large or slow sites

Workflows

Output files

Local development

Architecture

AI Disclosure

Building the project

Runtime AI usage

Browser-based AI

Credits

Licence

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages