Skip to content

bug: html where filenames have spaces (not %20) are not properly parsed #79

@hanoii

Description

@hanoii

The following does not work, the url doesn't get parsed.

<div><img class="img-responsive" src="/images/dir with spaces/file with spaces.png?ver=1.0" /></div>

I am actually working in submitting a few PRs, so I was navigating the code and likely the issue is:

preg_match_all('/<img\s+[^>]*?src=["\']?([^"\'> ]+)["\']?[^>]*>/is', $html, $matches);

See

https://regex101.com/r/mHTaDk

There are obviously various ways of fixing this. I feel the regex is doing too much (attempt to much single quote, double quotes and unquoted html attributes.

I am giving a try to a fix that would expand into three different version of the same regex. Now that I saw the code I have a ton of questions on some of the thought process - one is why did you when with regex instead of tryong to use any of the good xml/html parsers out there?

Regardless, I'll see if I can come up with a better solution but adding the issue in case you get to that first.

Related to #33 and tangential to #65 although the latter the issue is more on the sanitation of the file than the parsing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions