bug: html where filenames have spaces (not %20) are not properly parsed

The following does not work, the url doesn't get parsed.

```html
<div><img class="img-responsive" src="/images/dir with spaces/file with spaces.png?ver=1.0" /></div>
```

I am actually working in submitting a few PRs, so I was navigating the code and likely the issue is:

https://github.com/janreges/siteone-crawler/blob/cf4eccb53e766a3eb081cebd6fb2568b9f1c1976/src/Crawler/ContentProcessor/HtmlProcessor.php#L286

See

https://regex101.com/r/mHTaDk

There are obviously various ways of fixing this. I feel the regex is doing too much (attempt to much single quote, double quotes and [unquoted html attributes](https://css-tricks.com/problems-with-unquoted-attributes/).

I am giving a try to a fix that would expand into three different version of the same regex. Now that I saw the code I have a ton of questions on some of the thought process - one is why did you when with regex instead of tryong to use any of the good xml/html parsers out there?

Regardless, I'll see if I can come up with a better solution but adding the issue in case you get to that first.

Related to #33 and tangential to #65  although the latter the issue is more on the sanitation of the file than the parsing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: html where filenames have spaces (not %20) are not properly parsed #79

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

bug: html where filenames have spaces (not %20) are not properly parsed #79

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions