Skip to content

Add URL rewriting for imported SQL data#16

Closed
adamziel wants to merge 4 commits intotrunkfrom
rewrite-urls
Closed

Add URL rewriting for imported SQL data#16
adamziel wants to merge 4 commits intotrunkfrom
rewrite-urls

Conversation

@adamziel
Copy link
Copy Markdown
Owner

Summary

  • Adds a db-apply command that reads db.sql, rewrites URLs using wp-php-toolkit structured processors (BlockMarkupUrlProcessor, URLInTextProcessor), and executes statements against a target MySQL database
  • Domain discovery runs inline during db-sync, collecting all HTTP/HTTPS domains from decoded base64 values into .import-domains.json
  • Uses WP_MySQL_Naive_Query_Stream (vendored from WordPress/sqlite-database-integration#264) for streaming SQL parsing with cursor-based resumability

URL rewriting pipeline

For each base64-decoded string value in INSERT statements:

  1. Serialized PHP → detected via ContentClassifier (port of is_serialized()) → skipped (adjusting s:N: length prefixes is out of scope)
  2. JSON → decoded, all string values walked recursively, URLs rewritten via URLInTextProcessor, re-encoded
  3. Everything else (HTML, block markup, plain text, markdown) → wp_rewrite_urls() which handles HTML attributes, block comment JSON, text nodes, and CSS url() in style attributes

No preg_match or DOMDocument — only the structured data processors from wp-php-toolkit/data-liberation.

New components

File Purpose
importer/lib/Base64ValueScanner.php Scans SQL for FROM_BASE64('...') expressions, decodes values
importer/lib/ContentClassifier.php Classifies values as serialized PHP, JSON, or text/HTML
importer/lib/DomainCollector.php Collects unique domains from string values via URLInTextProcessor
importer/lib/SqlValueUrlRewriter.php Rewrites URLs in a single value using the appropriate processor
importer/lib/SqlStatementRewriter.php Orchestrates per-statement rewriting
importer/lib/mysql-query-stream/ Vendored WP_MySQL_Naive_Query_Stream + WP_MySQL_Lexer

Usage

# 1. Download SQL (domains discovered automatically)
php import.php db-sync https://source.example.com /path/to/import --secret=TOKEN

# 2. Apply with URL rewriting
php import.php db-apply - /path/to/import \
  --target-user=root --target-db=wp_new \
  --url-mapping=https://source.example.com::https://target.example.com

Test plan

  • 72 unit tests across 5 test files (Base64ValueScanner, ContentClassifier, DomainCollector, SqlValueUrlRewriter, SqlStatementRewriter)
  • E2E round-trip test (import-36-url-rewriting.test.js — scaffolded, needs implementation)
  • Manual test with a real WordPress SQL dump

🤖 Generated with Claude Code

@adamziel adamziel force-pushed the rewrite-urls branch 2 times, most recently from a44b435 to e0aa813 Compare February 17, 2026 22:55
@adamziel
Copy link
Copy Markdown
Owner Author

adamziel commented Feb 18, 2026

It seems like the domain detection could break if the download stream pauses mid-query, e.g.

INSERT INTO ... VALUES
(FROM_BASE64("STRING1")),
<paused right here>

So we'll need some way of resuming from there – even if that way is oversimplified for now and requires going back to the initial INSERT. Maybe that could happen in a follow-up PR, it's a relatively low stakes error. Worst-case scenario we'll report an incomplete list of detected domains. We'll still be able to replace any listed domains when inserting the data.

It would be nice to auto-detect the source site domain and mark it for rewriting, maybe assets domain as well if it's separate.

Other than that, I'll need to take this for a spin.

@adamziel
Copy link
Copy Markdown
Owner Author

Also – it would be cool to support downloading the media files loaded from external domains, but we don't strictly need it to land this feature.

Comment thread importer/import.php Outdated
adamziel and others added 4 commits February 26, 2026 16:58
Remove unused CONVERT_PREFIX and CONVERT_PREFIX_LEN constants from
Base64ValueScanner. Exclude vendored mysql-query-stream/ from PHPStan
analysis while keeping it in scan paths for class discovery.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Update port from 8102 to 8104 (matching site-registry.json)
- Fix URL construction: use & instead of ? for query string
  continuation (getSiteUrl already includes ?)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The url-rewriting feature depends on wp-php-toolkit/data-liberation
classes (URLInTextProcessor, WPURL) loaded via Composer autoloader.
Add composer install to setup.sh (runs in both CI and Docker) and
install Composer in the Dockerfile.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Domain discovery now persists to .import-domains.json alongside the
periodic cursor saves during db-sync. Previously domains were only
written after the full download completed, which meant a crash would
lose all discovered domains since the resumed download skips already-
downloaded SQL data.

The source site domain is also auto-detected from the export URL and
seeded into the domain collector, so it always appears in the domains
file even before SQL scanning starts.

Other fixes: update url-rewriting E2E test port to avoid collision
with file-deletions test, fix PHP 8.4 nullable parameter deprecations,
update composer.lock for wp-php-toolkit dependencies.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant