Reddit Crawl is a Python-based command-line tool designed to fetch posts from specified subreddits using the Reddit API (via PRAW) and store them in a local DuckDB database. It allows users to crawl single or multiple subreddits, search for posts using keywords, and retrieve subreddit information.
This tool is particularly useful for researchers, data analysts, or anyone looking to gather textual data from Reddit for analysis, archiving, or building datasets.
This project was created mainly as a technical PoC and because I was curious about the job market in Europe. I used a number of AI tools to augment my work and built this in 1-2 hours.
We'll see if this is useful for anyone else.
AI Tools used:
- Superwhisper to dictate my intentions
- Cursor to write the code together with their AI agents
- Gemini 2.5 Pro mostly with some auto model selection for simple parts
- Claude Task Master to convert the PRD to a number of tasks (I did not complete everything). See the
tasksfolder
- Crawl Subreddits: Fetch posts from one or more specified subreddits.
- Keyword Search: Search for posts within subreddits based on a list of keywords, with options to sort and filter by time.
- Listing Sort Options: Fetch posts by 'hot', 'new', 'top', 'controversial', or 'rising' when not using search terms.
- Time Filtering: Filter search results or listings by time (e.g., 'all', 'day', 'week', 'month', 'year').
- Flexible Limits: Specify the maximum number of posts to fetch per subreddit or per search term.
- Database Storage: Saves fetched subreddit details, posts, and crawl session information into a DuckDB database using SQLModel for ORM.
- Environment Variable Configuration: Securely manage Reddit API credentials using a
.envfile. - Caching: Utilizes joblib for caching API responses to speed up repeated requests and respect API rate limits.
- CLI Interface: Easy-to-use command-line interface powered by Typer.
- Python 3.9+
- Access to the Reddit API (requires a Reddit account and a registered script application).
uv is an extremely fast Python package and project manager. If you don't have it, install it using the official installer:
curl -LsSf https://astral.sh/uv/install.sh | shRefer to the official uv documentation for more installation methods.
git clone https://github.com/pascalwhoop/reddit-crawl
cd reddit-crawluv sync
source .venv/bin/activateYou need to provide your Reddit API credentials.
-
Copy the
.env.examplefile to a new file named.env:cp .env.example .env
-
Edit the
.envfile and fill in your Reddit API credentials:
How to get Reddit API Credentials:
See PRAW docs for more information.
The database and tables will be created automatically when you run a crawl command if they don't already exist. By default, a DuckDB database file will be created at data/reddit_data.db (or as specified by the --output-db option or DATABASE_URL in .env).
reddit-crawl --helpor for a specific command:
reddit-crawl subreddit --help
reddit-crawl subreddit get --help
reddit-crawl subreddit info --helpThis tool is designed for efficient content discovery through a two-phase approach:
-
Start with a subreddit you're familiar with and collect its top 100 posts:
reddit-crawl subreddit get your_subreddit --limit-per-item 100 --sort-by-listing top
-
Export the collected content and analyze it using a large language model (e.g., Gemini 2.5):
- Feed the content to the LLM
reddit-crawl view posts --subreddit your_subreddit --limit 100 --json > your_subreddit_posts.json - Request keyword extraction and search query generation for a specific area of interest to you
please extract key keywords that would have found these reddit posts if I searched for it on reddit / google. separate them as a comma separated list in a single line. <your above output> - The LLM will generate 20-30 relevant search queries based on common themes and patterns
- Feed the content to the LLM
-
Use the LLM to:
- Validate and refine the generated search queries
- Suggest additional relevant subreddits for your topic
- Identify related communities you might have missed
-
Execute a comprehensive crawl using the generated queries and subreddits (see last example):
- Target 10-15 relevant subreddits
- Apply 15-20 search queries per subreddit
- This approach ensures thorough coverage of your topic across multiple communities
This methodical approach helps you:
- Discover relevant content efficiently
- Identify patterns and trends across communities
- Build a comprehensive dataset for your research or analysis
- Avoid missing important discussions in related subreddits
I got around 250MB or 54000 posts from this
-
Crawl the 'python' subreddit (default settings):
reddit-crawl subreddit get python
-
Crawl 'python' and 'learnpython', getting the top 10 posts of all time from listings:
reddit-crawl subreddit get python learnpython --limit-per-item 10 --sort-by-listing top --time-filter-listing all
-
Complex search example:
reddit-crawl subreddit get \ cscareerquestionsEU \ europe \ digitalnomad \ ExperiencedDevs \ develeire \ BerlinJobs \ berlinsocialclub \ Netherlands \ AmsterdamJobs \ cscareers \ UKPersonalFinance \ london \ iOSProgramming \ androiddev \ webdev \ datascience \ sysadmin \ devops \ freelance \ freelanceDE \ --search-terms 'job market, layoff, unemployed, offer, rejection, interview, applying, ghosted, frustration, anxiety, demotivation, burnout, salary, compensation, work-life balance, AI, outsourcing, probation, career progression, company culture, non-EU, visa sponsorship, experience level, German language, LeetCode, networking, upskill, remote work, relocation, difficult market' \ --limit-per-item 1000
The data is stored in a DuckDB database with the following main tables (defined in reddit_crawl/models.py using SQLModel):
Subreddit: Stores details about each crawled subreddit.id(primary key)nametitledescriptionsubscriberscreated_utc
Post: Stores details about each fetched post.id(primary key)titleauthor_namescoreupvote_rationum_commentscreated_utcurlselftextpermalinksubreddit_id(foreign key toSubreddit.id)
Crawl: Records information about each crawl session.id(primary key, auto-incrementing)subreddit_name_crawledtimestamppost_count(number of newly added posts during that session for that subreddit)parameters_used_json(JSON string of crawl parameters like search terms, limit, etc.)
You can query this database using any DuckDB compatible tool or library.