This is a minimal implementation of MinHashing as described in Jeffrey Ullman's book Mining Massive Datasets.
It can be installed via pip install minhashlib
Based on the benchmark outputs in this repository:
- On recent multi-seed CPU synthetic runs (seeds
42-46),minhashlibbuilds signatures about~6xfaster thandatasketch. - Accuracy is comparable in magnitude (similar MAE scale and matching threshold-based metrics in these runs), though
datasketchis slightly better on MAE in most synthetic scenarios. - Memory results are workload-dependent, so this project does not claim universal memory superiority.
In short: this implementation is minimal and fast, with accuracy that is broadly comparable to datasketch, but outcomes vary depending on dataset and configuration.
Use benchmarks/benchmark_claims_suite.py to run comprehensive, reproducible benchmarks:
- Multiple datasets (
synthetic,20newsgroups,wikipedia,ag_news,local) - Multiple seeds with mean/std/95% CI
- Metrics: MAE(Mean Average Error), Precision/Recall/F1 at threshold, Precision@K/Recall@K
- Speed: build/pair-eval/retrieval latency and throughput
- Memory: peak allocation and bytes/signature
- Optional scaling sweeps over docs/number of hashes/doc length
Example (full):
python3 benchmarks/benchmark_claims_suite.py
--datasets synthetic,20newsgroups,wikipedia
--wiki-dump-path data/simplewiki-latest-pages-articles.xml.bz2
--seeds 42,43,44
--p-values 2147483647,3037000493
--max-docs 2000
--random-pairs 3000
--num-queries 200
--include-scalingExample (offline/local corpus only):
python3 benchmarks/benchmark_claims_suite.py
--datasets synthetic,local
--local-docs /path/to/docs.jsonl
--seeds 42,43,44Outputs are written to benchmark_outputs/ by default:
raw_runs.json/raw_runs.csvsummary_stats.json/summary_stats.csvrun_metadata.jsonskipped_runs.json
Pull required benchmark datasets into local project paths:
python3 scripts/setup_benchmark_data.pyThis prepares:
data/simplewiki-latest-pages-articles.xml.bz2(for Wikipedia benchmarks).cache/scikit_learn_data(for20newsgroupsbenchmarks)
Optional flags:
python3 scripts/setup_benchmark_data.py --force
python3 scripts/setup_benchmark_data.py --skip-wikipedia
python3 scripts/setup_benchmark_data.py --skip-20newsgroupsYou can run individual benchmarks instead of the full suite:
# Accuracy-only
python3 benchmarks/benchmark_claims_accuracy.py --datasets synthetic,20newsgroups,wikipedia
# Performance-only
python3 benchmarks/benchmark_claims_performance.py --datasets synthetic,20newsgroups,wikipedia
# Memory-only
python3 benchmarks/benchmark_claims_memory.py --datasets synthetic,20newsgroups,wikipedia
# Scaling-only (synthetic sweeps)
python3 benchmarks/benchmark_claims_scaling.py --datasets syntheticEach individual benchmark writes outputs under benchmark_outputs/<test_name>/.