Skip to content

WordPress import: size-variant image URLs in post content are not rewritten #645

@shinobiworks

Description

@shinobiworks

Description

After importing a WordPress site via the /_emdash/api/import/wordpress/* endpoints, images embedded in post content still reference the original WordPress domain for URLs that include WordPress-generated size suffixes (e.g. .../image-1024x695.png).

The rewrite-urls endpoint normalizes query strings in getBaseUrl() but does not strip WordPress's -NNNxNNN size suffix, so variant URLs in content do not match the urlMap keys built from <wp:attachment_url> (which only lists originals).

Looking into this I noticed:

  • WXR's <wp:attachment_url> contains only the original URL.
  • _wp_attachment_metadata (PHP-serialized postmeta) does contain all the generated variant filenames, but that path is not consumed during media import.
  • Post content references the variant URLs directly.

I'm not sure whether the current behavior is intentional (e.g. a design choice to avoid serving larger originals in place of pre-resized variants) or simply not yet handled.

Possible fixes (if a fix is welcome):

  1. Minimal: strip -NNNxNNN before the extension inside getBaseUrl() and allow the same pattern in the regex built in rewriteStringUrls(). Variant URLs in content are rewritten to the imported original. Simple, but serves a larger file where a pre-resized variant was used.
  2. Thorough: parse _wp_attachment_metadata during analyze/media, import each variant as a separate media item, and map variant URLs individually.

I have a working local patch for option 1 I can turn into a PR if that direction is acceptable.

Steps to reproduce

Note: I haven't verified this through the admin UI "WordPress Import" wizard directly. The reproduction below uses the same underlying API endpoints that the wizard calls in sequence, so the same code path is exercised.

  1. Start a fresh EmDash dev server (emdash@0.4.0).
  2. Run dev-bypass?token=1 to get a PAT.
  3. POST /_emdash/api/import/wordpress/analyze with a WXR file where at least one post embeds a resized image (e.g. a WordPress site exporting a post that contains <img src=".../foo-1024x695.png"> while the corresponding attachment URL is .../foo.png).
  4. POST /_emdash/api/import/wordpress/prepare with the analyze output.
  5. POST /_emdash/api/import/wordpress/execute with the WXR and a basic config.
  6. POST /_emdash/api/import/wordpress/media with attachments from analyze, capture the returned urlMap.
  7. POST /_emdash/api/import/wordpress/rewrite-urls with that urlMap.
  8. Inspect a post that used a variant URL — its content.asset.url still points to https://<wp-domain>/.../foo-1024x695.png.

Expected: the URL is rewritten to the imported EmDash media URL.
Actual: the URL is left unchanged.

Environment

  • emdash: 0.4.0
  • @emdash-cms/cloudflare: 0.4.0
  • astro: 6.1.6
  • Node.js: 22.22.2
  • OS: Linux (Docker sandbox)
  • Template: starter-cloudflare

Logs / error output

No errors are emitted. The endpoint returns `success: true` with `updated: 8` / `urlsRewritten: 8` on a sample site where ~60 posts reference size variants — the non-matching URLs are silently skipped because `findMatchingUrl()` returns `null`.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions