A Python-based Reddit scraper for fetching posts from a specified subreddit. This tool can scrape posts based on various criteria and supports proxy usage for improved performance and anonymity.
-
Fetch Posts: Retrieve posts from a subreddit with options for different types of fetching.
all: Fetch all posts from the subreddit.random_n: Fetch a specified number of random posts.first_n: Fetch the firstnposts.date_range: Fetch posts within a specified date range.
-
Proxy Support: Optionally use a list of proxies to make requests to Reddit.
- Supports testing proxies to ensure they are working.
- Automatically switches proxies on failure.
- Handle proxy file input and retry mechanisms.
-
Error Handling: Robust error handling for network issues, proxy failures, and API errors.
-
Multithreading Option: Option to use multiple threads for fetching posts (can be disabled for single-threaded requests).
-
Output: Save scraped post data to a text file with details such as title, content, upvotes, comments, date, and time.
-
Clone the repository:
git clone https://github.com/Devn913/subreddit_post_scrapping.git cd subreddit_post_scrapping -
Install the required packages:
pip install -r requirements.txt
The following packages are required to run this script:
requests: For making HTTP requests to Reddit.
You can install these packages using the requirements.txt file.
To run the scraper, execute the following command in your terminal:
python3 scrape.pyYou will be prompted for the following inputs:
-
Subreddit Name: The name of the subreddit to scrape (e.g.,
idm). -
Fetch Type: The type of fetching operation:
allfor all postsrandom_nfor a random number of postsfirst_nfor the firstnpostsdate_rangefor posts within a specific date range
-
Number of Posts: For
random_nandfirst_nfetch types, specify the number of posts to fetch. -
Date Range: For
date_range, provide start and end dates in the formatyyyy/mm/dd. -
Use Proxy: Choose whether to use a proxy list. If
yes, provide the name of the proxy file. -
Proxy File: Specify the file name containing the list of proxies (one per line).
-
Output File Name: The name of the file where scraped data will be saved (default is
reddit_posts.txt). -
Multithreading: Choose whether to enable multithreading and specify the number of threads if enabled.
python3 scrape.pyFollow the prompts to enter the subreddit name, fetch type, number of posts, and other options.
The output file will contain information about each post in the following format:
Title: Post Title
Content: Post Content
Upvotes: 123
Comments: 45
Date: 2024-07-13
Time: 14:32:10
--------------------------------------------------------------------------------
The scraper includes error handling for:
- Network issues and API errors
- Proxy failures and invalid proxies
- Incorrect or missing user inputs
If all proxies are dead or invalid, the scraper will prompt for a new proxy file.
This project is licensed under the GPL License.
Contributions are welcome! Please open an issue or submit a pull request if you have suggestions or improvements.
For questions or feedback, feel free to open an issue on the GitHub repository.
Happy scraping!