Democratizing ML Training Across Consumer Hardware
VulkanBLAS is building the PyTorch for everyone - a lightweight, Vulkan-native ML framework that works on ANY GPU. Then we're enabling distributed training on heterogeneous consumer hardware, Folding@Home style.
Today: Train MNIST on your AMD RX 6500 XT
Month 2: Train with friends' GPUs (local network)
Month 6: 100+ GPUs worldwide training collaboratively
Year 1: "VulkanBLAS Compute" - Airbnb for GPUs
Impact: Anyone with any GPU can contribute to training useful models.
Working:
- ✅ Vulkan backend via
ggml-vulkan(llama.cpp integration) - ✅ Tensor abstraction with RAII memory management
- ✅ Operations: add, mul (tensor & scalar), sum, repeat
- ✅ Autograd engine with reverse-mode AD
- ✅ Topological sort for gradient propagation
- ✅ Broadcasting support via
repeat()method - ✅ Graph isolation (proxy tensors fix)
Lines of Code: ~1,700 (core implementation)
Progress: ~30% complete
auto x = make_tensor({1}, true); // requires_grad=true
auto y = make_tensor({1}, true);
// f(x,y) = x*y + x
auto xy = x->mul(y);
auto f = xy->add(x);
f->backward(); // Compute gradients
// df/dx = y+1, df/dy = x ✓All tests passing! 🎉
Build high-level NN primitives on GGML+Autograd:
nn::Linear,nn::ReLU,nn::Sigmoid,nn::GELUnn::LayerNorm,nn::RMSNorm- Sequential container
- Test on XOR problem
- Optimizers: SGD, Adam
- Loss functions: MSE, CrossEntropy
- Training loop utilities
import vkblas
x = vkblas.tensor([1, 2, 3], requires_grad=True)
y = x * 2 + 3
y.backward()
print(x.grad) # [2, 2, 2]- MNIST, Fashion-MNIST
- Benchmark vs PyTorch
- Performance analysis
Inspired by: INTELLECT-2 (PrimeIntellect), AReaL-Hex research
Prove asynchronous training works on 2 local GPUs:
- AsyncWorker (inference + gradients)
- Staleness filtering (discard old gradients)
- 2x speedup target
Multi-machine training over local network:
- Lightweight gradient upload/download
- Model broadcasting (Shardcast-style)
- 2.5x speedup with 3 machines
Smart workload distribution:
- Dynamic batch sizing (fast GPU: 512, slow GPU: 64)
- RX 6500 XT + RTX 4090 efficient together
- <10% idle time on any GPU
Handle node dropouts gracefully:
- Heartbeat monitoring
- Graceful degradation
- Survive 50% worker dropout
Folding@Home for ML:
# One line to donate GPU time
vkblas distributed join --server train.vulkanblas.org- Public coordinator server
- Gradient validation (anti-cheat)
- Contribution tracking & leaderboard
- Train 1B parameter models collaboratively
Benchmark: Single Precision GEMM
Hardware: AMD Radeon RX 6500 XT vs Ryzen 5 5600X
| Matrix Size | CPU (GFLOPS) | GPU (GFLOPS) | Speedup |
|---|---|---|---|
| 256x256 | ~4.0 | ~17.6 | 4.4x |
| 512x512 | ~4.1 | ~66.3 | 16x |
| 1024x1024 | ~0.8 | ~247.9 | 296x |
| 2048x2048 | N/A | ~501.5 | N/A |
| 4096x4096 | N/A | ~702.7 | N/A |
- CMake 3.15+
- Vulkan SDK
- C++17 compiler
git clone --recursive https://github.com/cafeTechne/VulkanBLAS
cd VulkanBLAS
cmake -B build -S .
cmake --build build --config Release
# Run examples
./build/Release/autograd_demo
./build/Release/hello_tensorVulkanBLAS/
├── src/
│ ├── vkblas.hpp/cpp # Public API
│ ├── tensor.hpp/cpp # Tensor + Autograd
│ ├── backend_ggml.cpp # ggml-vulkan wrapper
│ └── backend.hpp # Backend interface
├── examples/
│ ├── autograd_demo.cpp # Autograd validation
│ ├── hello_tensor.cpp # Basic tensor ops
│ └── tensor_ops_demo.cpp # Element-wise ops
├── external/
│ └── llama.cpp/ # ggml-vulkan submodule
└── docs/
├── INTEGRATION.md # Architecture guide
└── DISTRIBUTED.md # Distributed training roadmap
First framework for distributed training on ANY Vulkan GPU:
- ✅ AMD (RDNA 2/3/4)
- ✅ Intel Arc
- ✅ Qualcomm Adreno (mobile!)
- ✅ Apple M-series (via MoltenVK)
- ✅ NVIDIA (if you want)
| Metric | VulkanBLAS | PyTorch |
|---|---|---|
| Binary Size | ~5MB | ~500MB |
| Build Time | 2 min | 30+ min |
| Dependencies | ggml only | CUDA, MKL, etc. |
Permissionless compute contribution - anyone can donate GPU time to train models.
- INTELLECT-2 (PrimeIntellect): First 32B globally distributed RL training
- AReaL-Hex: 1.5x throughput on heterogeneous GPUs
- DistDGLv2 (Amazon): Async pipeline training
- HAP/Poplar: Heterogeneity-aware scheduling
This project democratizes ML hardware access. Contributions welcome!
Current Focus: Phase 6 (Neural Network Layers)
- Implement
nn::Linearusingggml_mul_mat - Add activation functions
- Test on XOR problem
See task.md for detailed roadmap.
MIT (TBD)
Breaking NVIDIA's CUDA monopoly. Enabling ML on ANY GPU. Building the future of distributed training. 🚀
Built on the shoulders of giants: llama.cpp, ggml, Vulkan.