Skip to content

Commit 3715423

Browse files
authored
Merge pull request #62 from hannesill/local_full_dataset
feat: Add local MIMIC-IV full dataset support (Replace SQLite with Parquet + DuckDB)
2 parents 5a32f5c + 0dc5b37 commit 3715423

File tree

17 files changed

+2261
-512
lines changed

17 files changed

+2261
-512
lines changed

.github/workflows/pre-commit.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@ jobs:
2525
ln -s $(which uv) ~/.local/bin/uv
2626
- run: uv venv
2727
- run: uv sync --dev
28-
- run: uv add pytest==7.4.3
2928
- uses: tox-dev/action-pre-commit-uv@v1
3029
with:
3130
extra_args: --all-files

.github/workflows/publish.yaml

Lines changed: 5 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -33,23 +33,18 @@ jobs:
3333
echo "version=$VERSION" >> $GITHUB_OUTPUT
3434
echo "Publishing version: $VERSION"
3535
36-
- name: Update version in pyproject.toml
36+
- name: Update version in __init__.py
3737
run: |
38-
# Update version in pyproject.toml to match the git tag
39-
sed -i "s/version = \".*\"/version = \"${{ steps.get_version.outputs.version }}\"/" pyproject.toml
40-
echo "Updated pyproject.toml version to ${{ steps.get_version.outputs.version }}"
41-
cat pyproject.toml | grep version
42-
43-
- name: Lock dependencies
44-
run: uv lock --locked
38+
# Update version in src/m3/__init__.py to match the git tag
39+
sed -i "s/__version__ = \".*\"/__version__ = \"${{ steps.get_version.outputs.version }}\"/" src/m3/__init__.py
40+
echo "Updated src/m3/__init__.py version to ${{ steps.get_version.outputs.version }}"
41+
grep "__version__" src/m3/__init__.py
4542
4643
- name: Sync dependencies including dev
4744
run: uv sync --all-groups
4845

4946
- name: Run quick tests
5047
run: |
51-
uv add pytest==7.4.3
52-
uv add pytest-asyncio
5348
uv run pytest tests/ -v --tb=short
5449
5550
- name: Build package

.github/workflows/tests.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,6 @@ jobs:
2727
- name: Install dependencies
2828
run: |
2929
uv sync --all-groups
30-
uv add pytest==7.4.3
3130
3231
- name: Run tests
3332
run: uv run pytest -v

README.md

Lines changed: 182 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,11 @@ Transform medical data analysis with AI! Ask questions about MIMIC-IV data in pl
1616

1717
## Features
1818

19-
- **Natural Language Queries**: Ask questions about MIMIC-IV data in plain English
20-
- **Local SQLite**: Fast queries on demo database (free, no setup)
21-
- **BigQuery Support**: Access full MIMIC-IV dataset on Google Cloud
22-
- **Enterprise Security**: OAuth2 authentication with JWT tokens and rate limiting
23-
- **SQL Injection Protection**: Read-only queries with comprehensive validation
19+
- 🔍 **Natural Language Queries**: Ask questions about MIMIC-IV data in plain English
20+
- 🏠 **Local DuckDB + Parquet**: Fast local queries for demo and full dataset using Parquet files with DuckDB views
21+
- ☁️ **BigQuery Support**: Access full MIMIC-IV dataset on Google Cloud
22+
- 🔒 **Enterprise Security**: OAuth2 authentication with JWT tokens and rate limiting
23+
- 🛡️ **SQL Injection Protection**: Read-only queries with comprehensive validation
2424

2525
## 🚀 Quick Start
2626

@@ -47,7 +47,7 @@ uv --version
4747

4848
### BigQuery Setup (Optional - Full Dataset)
4949

50-
**Skip this if using SQLite demo database.**
50+
**Skip this if using DuckDB demo database.**
5151

5252
1. **Install Google Cloud SDK:**
5353
- macOS: `brew install google-cloud-sdk`
@@ -59,35 +59,32 @@ uv --version
5959
```
6060
*Opens your browser - choose the Google account with BigQuery access to MIMIC-IV.*
6161

62-
### MCP Client Configuration
63-
64-
Paste one of the following into your MCP client config, then restart your client.
62+
### M3 Initialization
6563

6664
**Supported clients:** [Claude Desktop](https://www.claude.com/download), [Cursor](https://cursor.com/download), [Goose](https://block.github.io/goose/), and [more](https://github.com/punkpeye/awesome-mcp-clients).
6765

6866
<table>
6967
<tr>
7068
<td width="50%">
7169

72-
**SQLite (Demo Database)**
70+
**DuckDB (Demo or Full Dataset)**
7371

74-
Free, local, no setup required.
7572

76-
```json
77-
{
78-
"mcpServers": {
79-
"m3": {
80-
"command": "uvx",
81-
"args": ["m3-mcp"],
82-
"env": {
83-
"M3_BACKEND": "sqlite"
84-
}
85-
}
86-
}
87-
}
73+
To create a m3 directory and navigate into it run:
74+
```shell
75+
mkdir m3 && cd m3
76+
```
77+
If you want to use the full dataset, download it manually from [PhysioNet](https://physionet.org/content/mimiciv/3.1/) and place it into `m3/m3_data/raw`. For using the demo set you can continue and run:
78+
79+
```shell
80+
uv init && uv add m3-mcp && \
81+
uv run m3 init DATASET_NAME && uv run m3 config --quick
8882
```
83+
Replace `DATASET_NAME` with `mimic-iv-demo` or `mimic-iv-full` and copy & paste the output of this command into your client config JSON file.
8984

90-
*Demo database (136MB, 100 patients, 275 admissions) downloads automatically on first query.*
85+
*Demo dataset (16MB raw download size) downloads automatically on first query.*
86+
87+
*Full dataset (10.6GB raw download size) needs to be downloaded manually.*
9188

9289
</td>
9390
<td width="50%">
@@ -96,6 +93,8 @@ Free, local, no setup required.
9693

9794
Requires GCP credentials and PhysioNet access.
9895

96+
Paste this into your client config JSON file:
97+
9998
```json
10099
{
101100
"mcpServers": {
@@ -126,13 +125,13 @@ Requires GCP credentials and PhysioNet access.
126125

127126
## Backend Comparison
128127

129-
| Feature | SQLite (Demo) | BigQuery (Full) |
130-
|---------|---------------|-----------------|
131-
| **Cost** | Free | BigQuery usage fees |
132-
| **Setup** | Zero config | GCP credentials required |
133-
| **Data Size** | 100 patients, 275 admissions | 365k patients, 546k admissions |
134-
| **Speed** | Fast (local) | Network latency |
135-
| **Use Case** | Learning, development | Research, production |
128+
| Feature | DuckDB (Demo) | DuckDB (Full) | BigQuery (Full) |
129+
|---------|---------------|---------------|-----------------|
130+
| **Cost** | Free | Free | BigQuery usage fees |
131+
| **Setup** | Zero config | Manual Download | GCP credentials required |
132+
| **Data Size** | 100 patients, 275 admissions | 365k patients, 546k admissions | 365k patients, 546k admissions |
133+
| **Speed** | Fast (local) | Fast (local) | Network latency |
134+
| **Use Case** | Learning, development | Research (local) | Research, production |
136135

137136
---
138137

@@ -146,7 +145,7 @@ Requires GCP credentials and PhysioNet access.
146145
<tr>
147146
<td width="50%">
148147

149-
**SQLite:**
148+
**DuckDB (Local):**
150149
```bash
151150
git clone https://github.com/rafiattrach/m3.git && cd m3
152151
docker build -t m3:lite --target lite .
@@ -205,7 +204,7 @@ pip install m3-mcp
205204
"m3": {
206205
"command": "m3-mcp-server",
207206
"env": {
208-
"M3_BACKEND": "sqlite"
207+
"M3_BACKEND": "duckdb"
209208
}
210209
}
211210
}
@@ -233,14 +232,146 @@ pre-commit install
233232
"args": ["-m", "m3.mcp_server"],
234233
"cwd": "/path/to/m3",
235234
"env": {
236-
"M3_BACKEND": "sqlite"
235+
"M3_BACKEND": "duckdb"
237236
}
238237
}
239238
}
240239
}
241240
```
242241

243-
## Advanced Configuration
242+
#### Using `UV` (Recommended)
243+
Assuming you have [UV](https://docs.astral.sh/uv/getting-started/installation/) installed.
244+
245+
**Step 1: Clone and Navigate**
246+
```bash
247+
# Clone the repository
248+
git clone https://github.com/rafiattrach/m3.git
249+
cd m3
250+
```
251+
252+
**Step 2: Create `UV` Virtual Environment**
253+
```bash
254+
# Create virtual environment
255+
uv venv
256+
```
257+
258+
**Step 3: Install M3**
259+
```bash
260+
uv sync
261+
# Do not forget to use `uv run` to any subsequent commands to ensure you're using the `uv` virtual environment
262+
```
263+
264+
### 🗄️ Database Configuration
265+
266+
After installation, choose your data source:
267+
268+
#### Option A: Local Demo (DuckDB + Parquet)
269+
270+
**Perfect for learning and development - completely free!**
271+
272+
1. **Initialize demo dataset**:
273+
```bash
274+
m3 init mimic-iv-demo
275+
```
276+
277+
2. **Setup MCP Client**:
278+
```bash
279+
m3 config
280+
```
281+
282+
*Alternative: For Claude Desktop specifically:*
283+
```bash
284+
m3 config claude --backend duckdb --db-path /Users/you/path/to/m3_data/databases/mimic_iv_demo.duckdb
285+
```
286+
287+
5. **Restart your MCP client** and ask:
288+
289+
- "What tools do you have for MIMIC-IV data?"
290+
- "Show me patient demographics from the ICU"
291+
292+
#### Option B: Local Full Dataset (DuckDB + Parquet)
293+
294+
**Run the entire MIMIC-IV dataset locally with DuckDB views over Parquet.**
295+
296+
1. **Acquire CSVs** (requires PhysioNet credentials):
297+
- Download the official MIMIC-IV CSVs from PhysioNet and place them under:
298+
- `/Users/you/path/to/m3/m3_data/raw_files/mimic-iv-full/hosp/`
299+
- `/Users/you/path/to/m3/m3_data/raw_files/mimic-iv-full/icu/`
300+
- Note: `m3 init`'s auto-download function currently only supports the demo dataset. Use your browser or `wget` to obtain the full dataset.
301+
302+
2. **Initialize full dataset**:
303+
```bash
304+
m3 init mimic-iv-full
305+
```
306+
- This may take up to 30 minutes, depending on your system (e.g. 10 minutes for MacBook Pro M3)
307+
- Performance knobs (optional):
308+
```bash
309+
export M3_CONVERT_MAX_WORKERS=6 # number of parallel files (default=4)
310+
export M3_DUCKDB_MEM=4GB # DuckDB memory limit per worker (default=3GB)
311+
export M3_DUCKDB_THREADS=4 # DuckDB threads per worker (default=2)
312+
```
313+
Pay attention to your system specifications, especially if you have enough memory.
314+
315+
3. **Select dataset and verify**:
316+
```bash
317+
m3 use full # optional, as this automatically got set to full
318+
m3 status
319+
```
320+
- Status prints active dataset, local DB path, Parquet presence, quick row counts and total Parquet size.
321+
322+
4. **Configure MCP client** (uses the full local DB):
323+
```bash
324+
m3 config
325+
# or
326+
m3 config claude --backend duckdb --db-path /Users/you/path/to/m3/m3_data/databases/mimic_iv_full.duckdb
327+
```
328+
329+
#### Option C: BigQuery (Full Dataset)
330+
331+
**For researchers needing complete MIMIC-IV data**
332+
333+
##### Prerequisites
334+
- Google Cloud account and project with billing enabled
335+
- Access to MIMIC-IV on BigQuery (requires PhysioNet credentialing)
336+
337+
##### Setup Steps
338+
339+
1. **Install Google Cloud CLI**:
340+
341+
**macOS (with Homebrew):**
342+
```bash
343+
brew install google-cloud-sdk
344+
```
345+
346+
**Windows:** Download from https://cloud.google.com/sdk/docs/install
347+
348+
**Linux:**
349+
```bash
350+
curl https://sdk.cloud.google.com | bash
351+
```
352+
353+
2. **Authenticate**:
354+
```bash
355+
gcloud auth application-default login
356+
```
357+
*This will open your browser - choose the Google account that has access to your BigQuery project with MIMIC-IV data.*
358+
359+
3. **Setup MCP Client for BigQuery**:
360+
```bash
361+
m3 config
362+
```
363+
364+
*Alternative: For Claude Desktop specifically:*
365+
```bash
366+
m3 config claude --backend bigquery --project-id YOUR_PROJECT_ID
367+
```
368+
369+
4. **Test BigQuery Access** - Restart your MCP client and ask:
370+
```
371+
Use the get_race_distribution function to show me the top 5 races in MIMIC-IV admissions.
372+
```
373+
374+
## 🔧 Advanced Configuration
244375

245376
Need to configure other MCP clients or customize settings? Use these commands:
246377

@@ -255,8 +386,8 @@ Generates configuration for any MCP client with step-by-step guidance.
255386
# Quick universal config with defaults
256387
m3 config --quick
257388
258-
# Universal config with custom database
259-
m3 config --quick --backend sqlite --db-path /path/to/database.db
389+
# Universal config with custom DuckDB database
390+
m3 config --quick --backend duckdb --db-path /path/to/database.duckdb
260391
261392
# Save config to file for other MCP clients
262393
m3 config --output my_config.json
@@ -291,7 +422,7 @@ m3 config # Choose OAuth2 option during setup
291422

292423
---
293424

294-
## Available MCP Tools
425+
## 🛠️ Available MCP Tools
295426

296427
When your MCP client processes questions, it uses these tools automatically:
297428

@@ -323,15 +454,17 @@ Try asking your MCP client these questions:
323454
- `Prompt:` *What tables are available in the database?*
324455
- `Prompt:` *What tools do you have for MIMIC-IV data?*
325456

326-
## Troubleshooting
457+
## 🎩 Pro Tips
458+
459+
- Do you want to pre-approve the usage of all tools in Claude Desktop? Use the prompt below and then select **Always Allow**
460+
- `Prompt:` *Can you please call all your tools in a logical sequence?*
461+
462+
## 🔍 Troubleshooting
327463

328464
### Common Issues
329465

330-
**SQLite "Database not found" errors:**
331-
```bash
332-
# Re-download demo database
333-
m3 init mimic-iv-demo
334-
```
466+
**Local "Parquet not found" or view errors:**
467+
Rerun the `m3 init` command for your chosen dataset.
335468

336469
**MCP client server not starting:**
337470
1. Check your MCP client logs (for Claude Desktop: Help → View Logs)
@@ -409,11 +542,11 @@ m3-mcp-server
409542

410543
## Roadmap
411544

412-
- **Local Full Dataset**: Complete MIMIC-IV locally (no cloud costs)
413-
- **Advanced Tools**: More specialized medical data functions
414-
- **Visualization**: Built-in plotting and charting tools
415-
- **Enhanced Security**: Role-based access control, audit logging
416-
- **Multi-tenant Support**: Organization-level data isolation
545+
- 🏠 **Complete Local Full Dataset**: Complete the support for `mimic-iv-full` (Download CLI)
546+
- 🔧 **Advanced Tools**: More specialized medical data functions
547+
- 📊 **Visualization**: Built-in plotting and charting tools
548+
- 🔐 **Enhanced Security**: Role-based access control, audit logging
549+
- 🌐 **Multi-tenant Support**: Organization-level data isolation
417550

418551
## Contributing
419552

0 commit comments

Comments
 (0)