Skip to content

Feature Request: Add Support for Redirects in img2dataset #442

@AiratTop

Description

@AiratTop

Currently, img2dataset does not support downloading files from URLs that require following HTTP redirects. For example, trying to download the file from the following URL fails due to multiple redirects in the process:

https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70840/239501.jpg

Below is an example of how wget handles the redirects:

wget https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70840/239501.jpg
--2024-12-12 18:00:45--  https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70840/239501.jpg
Resolving hors.easymerch.ru (hors.easymerch.ru)... 77.223.102.239, 188.246.224.25, 5.182.4.205, ...
Connecting to hors.easymerch.ru (hors.easymerch.ru)|77.223.102.239|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files21.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:45--  https://files21.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files21.easymerch.ru (files21.easymerch.ru)... 95.217.111.153
Connecting to files21.easymerch.ru (files21.easymerch.ru)|95.217.111.153|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files20.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:45--  https://files20.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files20.easymerch.ru (files20.easymerch.ru)... 135.181.16.12
Connecting to files20.easymerch.ru (files20.easymerch.ru)|135.181.16.12|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files19.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:46--  https://files19.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files19.easymerch.ru (files19.easymerch.ru)... 95.217.111.157
Connecting to files19.easymerch.ru (files19.easymerch.ru)|95.217.111.157|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files18.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:46--  https://files18.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files18.easymerch.ru (files18.easymerch.ru)... 65.21.140.24
Connecting to files18.easymerch.ru (files18.easymerch.ru)|65.21.140.24|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files17.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:46--  https://files17.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files17.easymerch.ru (files17.easymerch.ru)... 65.21.138.242
Connecting to files17.easymerch.ru (files17.easymerch.ru)|65.21.138.242|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files16.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:46--  https://files16.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files16.easymerch.ru (files16.easymerch.ru)... 65.21.143.51
Connecting to files16.easymerch.ru (files16.easymerch.ru)|65.21.143.51|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files15.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:47--  https://files15.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files15.easymerch.ru (files15.easymerch.ru)... 65.21.201.86
Connecting to files15.easymerch.ru (files15.easymerch.ru)|65.21.201.86|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files14.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:47--  https://files14.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files14.easymerch.ru (files14.easymerch.ru)... 65.21.204.225
Connecting to files14.easymerch.ru (files14.easymerch.ru)|65.21.204.225|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files13.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:47--  https://files13.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files13.easymerch.ru (files13.easymerch.ru)... 65.21.235.55
Connecting to files13.easymerch.ru (files13.easymerch.ru)|65.21.235.55|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files12.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:47--  https://files12.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files12.easymerch.ru (files12.easymerch.ru)... 65.21.204.240
Connecting to files12.easymerch.ru (files12.easymerch.ru)|65.21.204.240|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files11.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:48--  https://files11.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files11.easymerch.ru (files11.easymerch.ru)... 65.21.204.245
Connecting to files11.easymerch.ru (files11.easymerch.ru)|65.21.204.245|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5718993 (5,5M) [image/jpeg]
Saving to: ‘239501.jpg.3’

239501.jpg.3                                    100%[=====================================================================================================>]   5,45M  9,76MB/s    in 0,6s    

2024-12-12 18:00:48 (9,76 MB/s) - ‘239501.jpg.3’ saved [5718993/5718993]

To ensure img2dataset works seamlessly with such URLs, it would be helpful to add a feature that enables automatic following of HTTP redirects.

Proposed Solution

Add an optional parameter (e.g., follow_redirects) that allows enabling auto-following of redirects during the download process. The default behavior could remain unchanged to preserve backward compatibility.

For example, the requests library already supports this functionality with its default behavior:

response = requests.get(url, timeout=30)
response.raise_for_status()

Alternatively, this behavior could be activated with an additional CLI flag.

Benefits

Enables downloading resources from dynamically redirected URLs.
Improves usability for datasets hosted on platforms with redirect-based file access.

Example Use Case

Using img2dataset to download files from:

https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70840/239501.jpg

Without this feature, the download fails, but with redirect support, the process completes successfully.

List of files to test:

https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70837/239497.jpg
https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70838/239498.jpg
https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70839/239499.jpg
https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70840/239500.jpg
https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70840/239501.jpg

Command to run:

img2dataset --url_list=list.txt --output_folder=images --processes_count 2 --thread_count 8 --image_size=256 --timeout 30

Try to download this images with using img2dataset and you will get an error:

HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.\nThe last 30x error message was:\nFound": 10

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions