Name	Name	Last commit message	Last commit date
parent directory ..
.changeset	.changeset
bin	bin
examples	examples
src	src
tests	tests
.gitignore	.gitignore
.jscpd.json	.jscpd.json
.lenv	.lenv
.prettierignore	.prettierignore
.prettierrc	.prettierrc
CHANGELOG.md	CHANGELOG.md
Dockerfile	Dockerfile
README.md	README.md
babel.config.cjs	babel.config.cjs
docker-compose.yml	docker-compose.yml
eslint.config.js	eslint.config.js
jest.config.mjs	jest.config.mjs
package-lock.json	package-lock.json
package.json	package.json
yarn.lock	yarn.lock

web-capture (JavaScript/Node.js)

A CLI and microservice to fetch URLs and render them as:

Markdown: Converted from HTML with image extraction (default)
HTML: Rendered page content
PNG/JPEG screenshot: Viewport or full-page capture with theme support
ZIP archive: Markdown + locally downloaded images
PDF: Print-quality document with embedded images
DOCX: Word document with embedded images

Installation

From npm

npm install -g @link-assistant/web-capture

From Source

cd js
npm install

Quick Start

CLI Usage

# Capture a URL as Markdown (default format)
# Output auto-derived to ./data/web-capture/<host>/<path>/document.md
web-capture https://example.com

# Capture as Markdown to a specific file
web-capture https://example.com -o page.md

# Write to stdout explicitly
web-capture https://example.com -o -

# Capture as HTML
web-capture https://example.com --format html -o page.html

# Take a PNG screenshot
web-capture https://example.com --format png -o screenshot.png

# Take a JPEG screenshot with custom quality
web-capture https://example.com --format jpeg --quality 60 -o screenshot.jpg

# Dark theme full-page screenshot
web-capture https://example.com --format png --theme dark --fullPage -o dark.png

# Custom viewport size
web-capture https://example.com --format png --width 1920 --height 1080 -o wide.png

# Download as PDF
web-capture https://example.com --format pdf -o page.pdf

# Download as DOCX
web-capture https://example.com --format docx -o page.docx

# Create a ZIP archive (default: zip format)
web-capture https://example.com --archive

# Create a ZIP archive with specific format
web-capture https://example.com --archive zip -o site.zip
web-capture https://example.com --archive tar.gz -o site.tar.gz

# Keep images inline as base64 (opt-in)
web-capture https://example.com --embed-images -o page.md

# Custom images directory
web-capture https://example.com --images-dir assets -o page.md

# Disable specific features
web-capture https://example.com --no-extract-latex --no-post-process -o page.md

# Start as API server
web-capture --serve

# Start server on custom port
web-capture --serve --port 8080

Google Docs

# Capture the live editor model as Markdown (default --capture browser)
web-capture https://docs.google.com/document/d/DOC_ID/edit --capture browser

# Capture via the public export endpoint
web-capture https://docs.google.com/document/d/DOC_ID/edit --capture api

# Capture via the Google Docs REST API
web-capture https://docs.google.com/document/d/DOC_ID/edit --capture api --apiToken YOUR_TOKEN

API Endpoints (Server Mode)

Start the server with web-capture --serve and use the endpoints below.

HTML Endpoint

GET /html?url=<URL>&engine=<ENGINE>

Returns the raw HTML content of the specified URL.

Parameter	Required	Description	Default
`url`	Yes	URL to fetch	-
`engine`	No	Browser engine: puppeteer/playwright	puppeteer

Markdown Endpoint

GET /markdown?url=<URL>

Converts the HTML content of the specified URL to Markdown format. By default, original remote image URLs are preserved and base64 data URIs are stripped (clean single-file output). Use keepOriginalLinks=false to strip all images, or embedImages=true to keep base64 images inline.

Parameter	Required	Description	Default
`url`	Yes	URL to fetch	-
`embedImages`	No	Keep base64 images inline (`true`/`false`)	`false`
`keepOriginalLinks`	No	Keep original remote URLs, strip base64	`true`

Image Endpoint

GET /image?url=<URL>&format=png&theme=dark&width=1920&height=1080&fullPage=true

Returns a screenshot of the specified URL.

Parameter	Required	Description	Default
`url`	Yes	URL to capture	-
`engine`	No	Browser engine: puppeteer/playwright	puppeteer
`format`	No	Image format: `png` (lossless) or `jpeg`	png
`quality`	No	JPEG quality 0-100 (ignored for PNG)	80
`width`	No	Viewport width in pixels	1280
`height`	No	Viewport height in pixels	800
`fullPage`	No	Capture full scrollable page	false
`theme`	No	Color scheme: `light`, `dark`	browser default
`dismissPopups`	No	Auto-close cookie/consent popups	true

Archive Endpoint

GET /archive?url=<URL>&localImages=true&documentFormat=markdown

Returns a ZIP archive containing either document.md or document.html and asset directories (images/, css/).

Parameter	Required	Description	Default
`url`	Yes	URL to archive	-
`localImages`	No	Download images locally into the archive	`true`
`documentFormat`	No	Document format: `markdown` or `html`	`markdown`
`embedImages`	No	Keep base64 images inline in the document	`false`
`keepOriginalLinks`	No	Keep original remote URLs (skip downloading)	`false`

Archive structure (with localImages=true):

archive.zip
├── document.md       # or document.html when documentFormat=html
├── images/
│   ├── image-1.png
│   ├── image-2.jpg
│   └── ...
└── css/              # only when documentFormat=html
    ├── style-1.css
    └── ...

PDF Endpoint

GET /pdf?url=<URL>&theme=light

Returns a PDF document rendered by the browser engine (all images embedded).

Parameter	Required	Description	Default
`url`	Yes	URL to convert	-
`engine`	No	Browser engine	puppeteer
`width`	No	Viewport width in pixels	1280
`height`	No	Viewport height in pixels	800
`theme`	No	Color scheme: `light`, `dark`	browser default
`dismissPopups`	No	Auto-close popups before export	true

DOCX Endpoint

GET /docx?url=<URL>

Returns a DOCX document with embedded images.

Parameter	Required	Description	Default
`url`	Yes	URL to convert	-

Fetch/Stream Endpoints

GET /fetch?url=<URL>
GET /stream?url=<URL>

Proxy fetch and streaming proxy endpoints.

CLI Reference

Server Mode

web-capture --serve [--port <port>]

Option	Short	Description	Default
`--serve`	`-s`	Start as HTTP API server	-
`--port`	`-p`	Port to listen on	3000 (or PORT env)

Capture Mode

web-capture <url> [options]

Option	Short	Description	Default
`--format`	`-f`	Output format (see below)	`markdown`
`--output`	`-o`	Output file path. Use `-o -` for stdout	auto-derived from URL
`--data-dir`		Base directory for auto-derived output paths	`./data/web-capture`
`--engine`	`-e`	Browser engine: `puppeteer`, `playwright`	`puppeteer` (or BROWSER_ENGINE env)
`--theme`	`-t`	Color scheme: `light`, `dark`, `no-preference`	browser default
`--width`		Viewport width in pixels	1280
`--height`		Viewport height in pixels	800
`--quality`		JPEG quality 0-100	80
`--fullPage`		Capture full scrollable page	false
`--embed-images`		Keep images as inline base64 data URIs	false
`--no-extract-images`		Alias for `--embed-images`	false
`--keep-original-links`		Keep original remote URLs, strip base64	false
`--images-dir`		Subdirectory name for extracted images	`images`
`--archive`		Create archive: `zip`, `7z`, `tar.gz`, `tar`	-
`--document-format`		Document format in archive: `markdown`, `html`	`markdown`
`--localImages`		Download images locally in archive mode	true
`--extract-latex`		Extract LaTeX formulas	true
`--no-extract-latex`		Disable LaTeX extraction	-
`--extract-metadata`		Extract article metadata	true
`--no-extract-metadata`		Disable metadata extraction	-
`--post-process`		Apply post-processing	true
`--no-post-process`		Disable post-processing	-
`--detect-code-language`		Detect code block languages	true
`--no-detect-code-language`		Disable code language detection	-

Supported formats:

Format	Description
`markdown` / `md`	Markdown conversion (default)
`html`	Rendered HTML
`image` / `png`	PNG screenshot (lossless)
`jpeg`	JPEG screenshot (configurable quality)
`pdf`	PDF with embedded images
`docx`	Word document with embedded images
`archive` / `zip`	ZIP archive with markdown + local images

Configuration

Configuration values are resolved with the following priority (highest to lowest):

CLI arguments: --port 8080
Environment variables: PORT=8080
Default .lenv file: .lenv in the project root
Built-in defaults

Environment Variables

Variable	Description	Default
`PORT`	Server port	`3000`
`BROWSER_ENGINE`	Browser engine	`puppeteer`
`API_TOKEN`	API token for authenticated capture	-
`WEB_CAPTURE_DATA_DIR`	Base directory for output	`./data/web-capture`
`WEB_CAPTURE_EMBED_IMAGES`	`0`/`1` — keep images inline	`0`
`WEB_CAPTURE_KEEP_ORIGINAL_LINKS`	`0`/`1` — keep original remote URLs	`0`
`WEB_CAPTURE_IMAGES_DIR`	Subdirectory for extracted images	`images`
`WEB_CAPTURE_EXTRACT_LATEX`	`0`/`1` — extract LaTeX	`1`
`WEB_CAPTURE_EXTRACT_METADATA`	`0`/`1` — extract metadata	`1`
`WEB_CAPTURE_POST_PROCESS`	`0`/`1` — post-processing	`1`
`WEB_CAPTURE_DETECT_CODE_LANGUAGE`	`0`/`1` — detect code langs	`1`

Browser Engine Support

The service supports both Puppeteer and Playwright browser engines:

Puppeteer: Default engine, mature and well-tested
Playwright: Alternative engine with similar capabilities

Supported engine values:

puppeteer or pptr - Use Puppeteer
playwright or pw - Use Playwright

Popup/Modal Dismissal

By default, the image, PDF, and archive endpoints automatically dismiss common popups before capture:

Google Funding Choices consent dialogs
Cookie consent banners
Generic modals and overlays
Fixed-position overlays covering content

Set dismissPopups=false to disable this behavior.

Docker

# Build and run using Docker Compose
docker compose up -d

# Or manually
docker build -t web-capture-js .
docker run -p 3000:3000 web-capture-js

Development

Available Commands

npm run dev - Start the development server with hot reloading
npm run start - Start the service using Docker Compose
npm test - Run all unit tests
npm run lint - Check code with ESLint
npm run format - Format code with Prettier

Testing

npm test                    # Run unit and integration tests
npm run test:e2e            # Run end-to-end tests
npm run test:all            # Run all tests including build

Built With

Express.js for the web server
Puppeteer and Playwright for headless browser automation
Turndown for HTML to Markdown conversion
archiver for ZIP creation
docx for DOCX generation
Jest for testing

Library Usage

You can also use web-capture as a Node.js library:

import {
  fetchHtml,
  convertHtmlToMarkdown,
} from '@link-assistant/web-capture/src/lib.js';
import { createBrowser } from '@link-assistant/web-capture/src/browser.js';
import { dismissPopups } from '@link-assistant/web-capture/src/popups.js';

// Fetch and convert to markdown
const html = await fetchHtml('https://habr.com/en/articles/895896/');
const markdown = convertHtmlToMarkdown(
  html,
  'https://habr.com/en/articles/895896/'
);

// Take a themed screenshot
const browser = await createBrowser('playwright', { colorScheme: 'dark' });
const page = await browser.newPage();
await page.setViewport({ width: 1920, height: 1080 });
await page.goto('https://habr.com/en/articles/895896/', {
  waitUntil: 'networkidle0',
});
await dismissPopups(page);
const buffer = await page.screenshot({ type: 'png', fullPage: true });
await browser.close();

License

Unlicense — This is free and unencumbered software released into the public domain. You are free to copy, modify, publish, use, compile, sell, or distribute this software for any purpose, commercial or non-commercial, and by any means. See https://unlicense.org for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

web-capture (JavaScript/Node.js)

Installation

From npm

From Source

Quick Start

CLI Usage

Google Docs

API Endpoints (Server Mode)

HTML Endpoint

Markdown Endpoint

Image Endpoint

Archive Endpoint

PDF Endpoint

DOCX Endpoint

Fetch/Stream Endpoints

CLI Reference

Server Mode

Capture Mode

Configuration

Environment Variables

Browser Engine Support

Popup/Modal Dismissal

Docker

Development

Available Commands

Testing

Built With

Library Usage

License

FilesExpand file tree

js

Directory actions

More options

Directory actions

More options

Latest commit

History

js

Folders and files

parent directory

README.md

web-capture (JavaScript/Node.js)

Installation

From npm

From Source

Quick Start

CLI Usage

Google Docs

API Endpoints (Server Mode)

HTML Endpoint

Markdown Endpoint

Image Endpoint

Archive Endpoint

PDF Endpoint

DOCX Endpoint

Fetch/Stream Endpoints

CLI Reference

Server Mode

Capture Mode

Configuration

Environment Variables

Browser Engine Support

Popup/Modal Dismissal

Docker

Development

Available Commands

Testing

Built With

Library Usage

License