Reddit Crawl: Subreddit Post Fetcher

Reddit Crawl is a Python-based command-line tool designed to fetch posts from specified subreddits using the Reddit API (via PRAW) and store them in a local DuckDB database. It allows users to crawl single or multiple subreddits, search for posts using keywords, and retrieve subreddit information.

This tool is particularly useful for researchers, data analysts, or anyone looking to gather textual data from Reddit for analysis, archiving, or building datasets.

Note about creation of this project

This project was created mainly as a technical PoC and because I was curious about the job market in Europe. I used a number of AI tools to augment my work and built this in 1-2 hours.

We'll see if this is useful for anyone else.

AI Tools used:

Superwhisper to dictate my intentions
Cursor to write the code together with their AI agents
- Gemini 2.5 Pro mostly with some auto model selection for simple parts
Claude Task Master to convert the PRD to a number of tasks (I did not complete everything). See the tasks folder

Key Features

Crawl Subreddits: Fetch posts from one or more specified subreddits.
Keyword Search: Search for posts within subreddits based on a list of keywords, with options to sort and filter by time.
Listing Sort Options: Fetch posts by 'hot', 'new', 'top', 'controversial', or 'rising' when not using search terms.
Time Filtering: Filter search results or listings by time (e.g., 'all', 'day', 'week', 'month', 'year').
Flexible Limits: Specify the maximum number of posts to fetch per subreddit or per search term.
Database Storage: Saves fetched subreddit details, posts, and crawl session information into a DuckDB database using SQLModel for ORM.
Environment Variable Configuration: Securely manage Reddit API credentials using a .env file.
Caching: Utilizes joblib for caching API responses to speed up repeated requests and respect API rate limits.
CLI Interface: Easy-to-use command-line interface powered by Typer.

Prerequisites

Python 3.9+
Access to the Reddit API (requires a Reddit account and a registered script application).

Setup and Installation

1. Install `uv` (if you haven't already):

uv is an extremely fast Python package and project manager. If you don't have it, install it using the official installer:

curl -LsSf https://astral.sh/uv/install.sh | sh

Refer to the official uv documentation for more installation methods.

2. Clone the Repository (if applicable):

git clone https://github.com/pascalwhoop/reddit-crawl
cd reddit-crawl

3. Install the project:

uv sync
source .venv/bin/activate

4. Set Up Environment Variables:

You need to provide your Reddit API credentials.

Copy the .env.example file to a new file named .env:
```
cp .env.example .env
```
Edit the .env file and fill in your Reddit API credentials:

How to get Reddit API Credentials:

See PRAW docs for more information.

5. Database Initialization:

The database and tables will be created automatically when you run a crawl command if they don't already exist. By default, a DuckDB database file will be created at data/reddit_data.db (or as specified by the --output-db option or DATABASE_URL in .env).

General Help

reddit-crawl --help

or for a specific command:

reddit-crawl subreddit --help
reddit-crawl subreddit get --help
reddit-crawl subreddit info --help

Recommended usage

This tool is designed for efficient content discovery through a two-phase approach:

Phase 1: Initial Data Collection and Analysis

Start with a subreddit you're familiar with and collect its top 100 posts:

reddit-crawl subreddit get your_subreddit --limit-per-item 100 --sort-by-listing top

Export the collected content and analyze it using a large language model (e.g., Gemini 2.5):
- Feed the content to the LLM
```
reddit-crawl view posts --subreddit your_subreddit --limit 100 --json > your_subreddit_posts.json
```
- Request keyword extraction and search query generation for a specific area of interest to you
```
please extract key keywords that would have found these reddit posts if I searched for it on reddit / google. separate them as a comma separated list in a single line. 

<your above output>
```
- The LLM will generate 20-30 relevant search queries based on common themes and patterns

Phase 2: Comprehensive Content Discovery

Use the LLM to:
- Validate and refine the generated search queries
- Suggest additional relevant subreddits for your topic
- Identify related communities you might have missed
Execute a comprehensive crawl using the generated queries and subreddits (see last example):
- Target 10-15 relevant subreddits
- Apply 15-20 search queries per subreddit
- This approach ensures thorough coverage of your topic across multiple communities

This methodical approach helps you:

Discover relevant content efficiently
Identify patterns and trends across communities
Build a comprehensive dataset for your research or analysis
Avoid missing important discussions in related subreddits

I got around 250MB or 54000 posts from this

Examples

Crawl the 'python' subreddit (default settings):
```
reddit-crawl subreddit get python
```

Crawl 'python' and 'learnpython', getting the top 10 posts of all time from listings:

reddit-crawl subreddit get python learnpython --limit-per-item 10 --sort-by-listing top --time-filter-listing all

Complex search example:

reddit-crawl subreddit get \
  cscareerquestionsEU \
  europe \
  digitalnomad \
  ExperiencedDevs \
  develeire \
  BerlinJobs \
  berlinsocialclub \
  Netherlands \
  AmsterdamJobs \
  cscareers \
  UKPersonalFinance \
  london \
  iOSProgramming \
  androiddev \
  webdev \
  datascience \
  sysadmin \
  devops \
  freelance \
  freelanceDE \
  --search-terms 'job market, layoff, unemployed, offer, rejection, interview, applying, ghosted, frustration, anxiety, demotivation, burnout, salary, compensation, work-life balance, AI, outsourcing, probation, career progression, company culture, non-EU, visa sponsorship, experience level, German language, LeetCode, networking, upskill, remote work, relocation, difficult market' \
  --limit-per-item 1000

Database Structure

The data is stored in a DuckDB database with the following main tables (defined in reddit_crawl/models.py using SQLModel):

Subreddit: Stores details about each crawled subreddit.
- id (primary key)
- name
- title
- description
- subscribers
- created_utc
Post: Stores details about each fetched post.
- id (primary key)
- title
- author_name
- score
- upvote_ratio
- num_comments
- created_utc
- url
- selftext
- permalink
- subreddit_id (foreign key to Subreddit.id)
Crawl: Records information about each crawl session.
- id (primary key, auto-incrementing)
- subreddit_name_crawled
- timestamp
- post_count (number of newly added posts during that session for that subreddit)
- parameters_used_json (JSON string of crawl parameters like search terms, limit, etc.)

You can query this database using any DuckDB compatible tool or library.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
reddit_crawl		reddit_crawl
tasks		tasks
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
prd.md		prd.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit Crawl: Subreddit Post Fetcher

Note about creation of this project

Key Features

Prerequisites

Setup and Installation

1. Install `uv` (if you haven't already):

2. Clone the Repository (if applicable):

3. Install the project:

4. Set Up Environment Variables:

5. Database Initialization:

General Help

Recommended usage

Phase 1: Initial Data Collection and Analysis

Phase 2: Comprehensive Content Discovery

Examples

Database Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reddit Crawl: Subreddit Post Fetcher

Note about creation of this project

Key Features

Prerequisites

Setup and Installation

1. Install uv (if you haven't already):

2. Clone the Repository (if applicable):

3. Install the project:

4. Set Up Environment Variables:

5. Database Initialization:

General Help

Recommended usage

Phase 1: Initial Data Collection and Analysis

Phase 2: Comprehensive Content Discovery

Examples

Database Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Install `uv` (if you haven't already):

Packages