-
Notifications
You must be signed in to change notification settings - Fork 366
Description
Currently, img2dataset does not support downloading files from URLs that require following HTTP redirects. For example, trying to download the file from the following URL fails due to multiple redirects in the process:
https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70840/239501.jpg
Below is an example of how wget handles the redirects:
wget https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70840/239501.jpg
--2024-12-12 18:00:45-- https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70840/239501.jpg
Resolving hors.easymerch.ru (hors.easymerch.ru)... 77.223.102.239, 188.246.224.25, 5.182.4.205, ...
Connecting to hors.easymerch.ru (hors.easymerch.ru)|77.223.102.239|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files21.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:45-- https://files21.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files21.easymerch.ru (files21.easymerch.ru)... 95.217.111.153
Connecting to files21.easymerch.ru (files21.easymerch.ru)|95.217.111.153|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files20.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:45-- https://files20.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files20.easymerch.ru (files20.easymerch.ru)... 135.181.16.12
Connecting to files20.easymerch.ru (files20.easymerch.ru)|135.181.16.12|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files19.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:46-- https://files19.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files19.easymerch.ru (files19.easymerch.ru)... 95.217.111.157
Connecting to files19.easymerch.ru (files19.easymerch.ru)|95.217.111.157|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files18.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:46-- https://files18.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files18.easymerch.ru (files18.easymerch.ru)... 65.21.140.24
Connecting to files18.easymerch.ru (files18.easymerch.ru)|65.21.140.24|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files17.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:46-- https://files17.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files17.easymerch.ru (files17.easymerch.ru)... 65.21.138.242
Connecting to files17.easymerch.ru (files17.easymerch.ru)|65.21.138.242|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files16.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:46-- https://files16.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files16.easymerch.ru (files16.easymerch.ru)... 65.21.143.51
Connecting to files16.easymerch.ru (files16.easymerch.ru)|65.21.143.51|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files15.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:47-- https://files15.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files15.easymerch.ru (files15.easymerch.ru)... 65.21.201.86
Connecting to files15.easymerch.ru (files15.easymerch.ru)|65.21.201.86|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files14.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:47-- https://files14.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files14.easymerch.ru (files14.easymerch.ru)... 65.21.204.225
Connecting to files14.easymerch.ru (files14.easymerch.ru)|65.21.204.225|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files13.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:47-- https://files13.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files13.easymerch.ru (files13.easymerch.ru)... 65.21.235.55
Connecting to files13.easymerch.ru (files13.easymerch.ru)|65.21.235.55|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files12.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:47-- https://files12.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files12.easymerch.ru (files12.easymerch.ru)... 65.21.204.240
Connecting to files12.easymerch.ru (files12.easymerch.ru)|65.21.204.240|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files11.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:48-- https://files11.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files11.easymerch.ru (files11.easymerch.ru)... 65.21.204.245
Connecting to files11.easymerch.ru (files11.easymerch.ru)|65.21.204.245|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5718993 (5,5M) [image/jpeg]
Saving to: ‘239501.jpg.3’
239501.jpg.3 100%[=====================================================================================================>] 5,45M 9,76MB/s in 0,6s
2024-12-12 18:00:48 (9,76 MB/s) - ‘239501.jpg.3’ saved [5718993/5718993]To ensure img2dataset works seamlessly with such URLs, it would be helpful to add a feature that enables automatic following of HTTP redirects.
Proposed Solution
Add an optional parameter (e.g., follow_redirects) that allows enabling auto-following of redirects during the download process. The default behavior could remain unchanged to preserve backward compatibility.
For example, the requests library already supports this functionality with its default behavior:
response = requests.get(url, timeout=30)
response.raise_for_status()Alternatively, this behavior could be activated with an additional CLI flag.
Benefits
Enables downloading resources from dynamically redirected URLs.
Improves usability for datasets hosted on platforms with redirect-based file access.
Example Use Case
Using img2dataset to download files from:
https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70840/239501.jpg
Without this feature, the download fails, but with redirect support, the process completes successfully.
List of files to test:
https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70837/239497.jpg
https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70838/239498.jpg
https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70839/239499.jpg
https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70840/239500.jpg
https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70840/239501.jpg
Command to run:
img2dataset --url_list=list.txt --output_folder=images --processes_count 2 --thread_count 8 --image_size=256 --timeout 30Try to download this images with using img2dataset and you will get an error:
HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.\nThe last 30x error message was:\nFound": 10