🌐 VulkanBLAS

Democratizing ML Training Across Consumer Hardware

VulkanBLAS is building the PyTorch for everyone - a lightweight, Vulkan-native ML framework that works on ANY GPU. Then we're enabling distributed training on heterogeneous consumer hardware, Folding@Home style.

🎯 The Vision

Today: Train MNIST on your AMD RX 6500 XT
Month 2: Train with friends' GPUs (local network)
Month 6: 100+ GPUs worldwide training collaboratively
Year 1: "VulkanBLAS Compute" - Airbnb for GPUs

Impact: Anyone with any GPU can contribute to training useful models.

✅ Current Status (Phase 5 Complete!)

Working:

✅ Vulkan backend via ggml-vulkan (llama.cpp integration)
✅ Tensor abstraction with RAII memory management
✅ Operations: add, mul (tensor & scalar), sum, repeat
✅ Autograd engine with reverse-mode AD
✅ Topological sort for gradient propagation
✅ Broadcasting support via repeat() method
✅ Graph isolation (proxy tensors fix)

Lines of Code: ~1,700 (core implementation)
Progress: ~30% complete

Autograd Demo

auto x = make_tensor({1}, true);  // requires_grad=true
auto y = make_tensor({1}, true);

// f(x,y) = x*y + x
auto xy = x->mul(y);
auto f = xy->add(x);

f->backward();  // Compute gradients

// df/dx = y+1, df/dy = x ✓

All tests passing! 🎉

🚀 Roadmap

Phase 6: Neural Network Layers (Current - 2 weeks)

Build high-level NN primitives on GGML+Autograd:

nn::Linear, nn::ReLU, nn::Sigmoid, nn::GELU
nn::LayerNorm, nn::RMSNorm
Sequential container
Test on XOR problem

Phase 7: Training Infrastructure (Weeks 4-5)

Optimizers: SGD, Adam
Loss functions: MSE, CrossEntropy
Training loop utilities

Phase 8: Python Bindings (Weeks 6-7)

import vkblas

x = vkblas.tensor([1, 2, 3], requires_grad=True)
y = x * 2 + 3
y.backward()
print(x.grad)  # [2, 2, 2]

Phase 9: Validation (Week 8)

MNIST, Fashion-MNIST
Benchmark vs PyTorch
Performance analysis

🌐 Distributed Training Vision (Phases 10-14)

Inspired by: INTELLECT-2 (PrimeIntellect), AReaL-Hex research

Phase 10: Local Async Training (Weeks 9-10)

Prove asynchronous training works on 2 local GPUs:

AsyncWorker (inference + gradients)
Staleness filtering (discard old gradients)
2x speedup target

Phase 11: Network Protocol (Weeks 11-12)

Multi-machine training over local network:

Lightweight gradient upload/download
Model broadcasting (Shardcast-style)
2.5x speedup with 3 machines

Phase 12: Heterogeneity-Aware Scheduling (Weeks 13-14)

Smart workload distribution:

Dynamic batch sizing (fast GPU: 512, slow GPU: 64)
RX 6500 XT + RTX 4090 efficient together
<10% idle time on any GPU

Phase 13: Fault Tolerance (Weeks 15-16)

Handle node dropouts gracefully:

Heartbeat monitoring
Graceful degradation
Survive 50% worker dropout

Phase 14: Public Infrastructure (Months 5-6)

Folding@Home for ML:

# One line to donate GPU time
vkblas distributed join --server train.vulkanblas.org

Public coordinator server
Gradient validation (anti-cheat)
Contribution tracking & leaderboard
Train 1B parameter models collaboratively

⚡ Performance

Benchmark: Single Precision GEMM
Hardware: AMD Radeon RX 6500 XT vs Ryzen 5 5600X

Matrix Size	CPU (GFLOPS)	GPU (GFLOPS)	Speedup
256x256	~4.0	~17.6	4.4x
512x512	~4.1	~66.3	16x
1024x1024	~0.8	~247.9	296x
2048x2048	N/A	~501.5	N/A
4096x4096	N/A	~702.7	N/A

🛠️ Building

Prerequisites

CMake 3.15+
Vulkan SDK
C++17 compiler

Build Steps

git clone --recursive https://github.com/cafeTechne/VulkanBLAS
cd VulkanBLAS
cmake -B build -S .
cmake --build build --config Release

# Run examples
./build/Release/autograd_demo
./build/Release/hello_tensor

📁 Project Structure

VulkanBLAS/
├── src/
│   ├── vkblas.hpp/cpp      # Public API
│   ├── tensor.hpp/cpp      # Tensor + Autograd
│   ├── backend_ggml.cpp    # ggml-vulkan wrapper
│   └── backend.hpp         # Backend interface
├── examples/
│   ├── autograd_demo.cpp   # Autograd validation
│   ├── hello_tensor.cpp    # Basic tensor ops
│   └── tensor_ops_demo.cpp # Element-wise ops
├── external/
│   └── llama.cpp/          # ggml-vulkan submodule
└── docs/
    ├── INTEGRATION.md      # Architecture guide
    └── DISTRIBUTED.md      # Distributed training roadmap

🔥 Killer Features

1. Universal GPU Support

First framework for distributed training on ANY Vulkan GPU:

✅ AMD (RDNA 2/3/4)
✅ Intel Arc
✅ Qualcomm Adreno (mobile!)
✅ Apple M-series (via MoltenVK)
✅ NVIDIA (if you want)

2. Lightweight

Metric	VulkanBLAS	PyTorch
Binary Size	~5MB	~500MB
Build Time	2 min	30+ min
Dependencies	ggml only	CUDA, MKL, etc.

3. Folding@Home for ML

Permissionless compute contribution - anyone can donate GPU time to train models.

📚 Research Foundation

INTELLECT-2 (PrimeIntellect): First 32B globally distributed RL training
AReaL-Hex: 1.5x throughput on heterogeneous GPUs
DistDGLv2 (Amazon): Async pipeline training
HAP/Poplar: Heterogeneity-aware scheduling

🤝 Contributing

This project democratizes ML hardware access. Contributions welcome!

Current Focus: Phase 6 (Neural Network Layers)

Implement nn::Linear using ggml_mul_mat
Add activation functions
Test on XOR problem

See task.md for detailed roadmap.

📄 License

MIT (TBD)

Breaking NVIDIA's CUDA monopoly. Enabling ML on ANY GPU. Building the future of distributed training. 🚀

Built on the shoulders of giants: llama.cpp, ggml, Vulkan.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
docs		docs
examples		examples
external		external
kernels		kernels
src		src
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌐 VulkanBLAS

🎯 The Vision

✅ Current Status (Phase 5 Complete!)

Autograd Demo

🚀 Roadmap

Phase 6: Neural Network Layers (Current - 2 weeks)

Phase 7: Training Infrastructure (Weeks 4-5)

Phase 8: Python Bindings (Weeks 6-7)

Phase 9: Validation (Week 8)

🌐 Distributed Training Vision (Phases 10-14)

Phase 10: Local Async Training (Weeks 9-10)

Phase 11: Network Protocol (Weeks 11-12)

Phase 12: Heterogeneity-Aware Scheduling (Weeks 13-14)

Phase 13: Fault Tolerance (Weeks 15-16)

Phase 14: Public Infrastructure (Months 5-6)

⚡ Performance

🛠️ Building

Prerequisites

Build Steps

📁 Project Structure

🔥 Killer Features

1. Universal GPU Support

2. Lightweight

3. Folding@Home for ML

📚 Research Foundation

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Languages

cafeTechne/VulkanBLAS

Folders and files

Latest commit

History

Repository files navigation

🌐 VulkanBLAS

🎯 The Vision

✅ Current Status (Phase 5 Complete!)

Autograd Demo

🚀 Roadmap

Phase 6: Neural Network Layers (Current - 2 weeks)

Phase 7: Training Infrastructure (Weeks 4-5)

Phase 8: Python Bindings (Weeks 6-7)

Phase 9: Validation (Week 8)

🌐 Distributed Training Vision (Phases 10-14)

Phase 10: Local Async Training (Weeks 9-10)

Phase 11: Network Protocol (Weeks 11-12)

Phase 12: Heterogeneity-Aware Scheduling (Weeks 13-14)

Phase 13: Fault Tolerance (Weeks 15-16)

Phase 14: Public Infrastructure (Months 5-6)

⚡ Performance

🛠️ Building

Prerequisites

Build Steps

📁 Project Structure

🔥 Killer Features

1. Universal GPU Support

2. Lightweight

3. Folding@Home for ML

📚 Research Foundation

🤝 Contributing

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages