Skip to content

cafeTechne/VulkanBLAS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌐 VulkanBLAS

Democratizing ML Training Across Consumer Hardware

VulkanBLAS is building the PyTorch for everyone - a lightweight, Vulkan-native ML framework that works on ANY GPU. Then we're enabling distributed training on heterogeneous consumer hardware, Folding@Home style.


🎯 The Vision

Today: Train MNIST on your AMD RX 6500 XT
Month 2: Train with friends' GPUs (local network)
Month 6: 100+ GPUs worldwide training collaboratively
Year 1: "VulkanBLAS Compute" - Airbnb for GPUs

Impact: Anyone with any GPU can contribute to training useful models.


✅ Current Status (Phase 5 Complete!)

Working:

  • ✅ Vulkan backend via ggml-vulkan (llama.cpp integration)
  • ✅ Tensor abstraction with RAII memory management
  • ✅ Operations: add, mul (tensor & scalar), sum, repeat
  • Autograd engine with reverse-mode AD
  • ✅ Topological sort for gradient propagation
  • ✅ Broadcasting support via repeat() method
  • ✅ Graph isolation (proxy tensors fix)

Lines of Code: ~1,700 (core implementation)
Progress: ~30% complete

Autograd Demo

auto x = make_tensor({1}, true);  // requires_grad=true
auto y = make_tensor({1}, true);

// f(x,y) = x*y + x
auto xy = x->mul(y);
auto f = xy->add(x);

f->backward();  // Compute gradients

// df/dx = y+1, df/dy = x ✓

All tests passing! 🎉


🚀 Roadmap

Phase 6: Neural Network Layers (Current - 2 weeks)

Build high-level NN primitives on GGML+Autograd:

  • nn::Linear, nn::ReLU, nn::Sigmoid, nn::GELU
  • nn::LayerNorm, nn::RMSNorm
  • Sequential container
  • Test on XOR problem

Phase 7: Training Infrastructure (Weeks 4-5)

  • Optimizers: SGD, Adam
  • Loss functions: MSE, CrossEntropy
  • Training loop utilities

Phase 8: Python Bindings (Weeks 6-7)

import vkblas

x = vkblas.tensor([1, 2, 3], requires_grad=True)
y = x * 2 + 3
y.backward()
print(x.grad)  # [2, 2, 2]

Phase 9: Validation (Week 8)

  • MNIST, Fashion-MNIST
  • Benchmark vs PyTorch
  • Performance analysis

🌐 Distributed Training Vision (Phases 10-14)

Inspired by: INTELLECT-2 (PrimeIntellect), AReaL-Hex research

Phase 10: Local Async Training (Weeks 9-10)

Prove asynchronous training works on 2 local GPUs:

  • AsyncWorker (inference + gradients)
  • Staleness filtering (discard old gradients)
  • 2x speedup target

Phase 11: Network Protocol (Weeks 11-12)

Multi-machine training over local network:

  • Lightweight gradient upload/download
  • Model broadcasting (Shardcast-style)
  • 2.5x speedup with 3 machines

Phase 12: Heterogeneity-Aware Scheduling (Weeks 13-14)

Smart workload distribution:

  • Dynamic batch sizing (fast GPU: 512, slow GPU: 64)
  • RX 6500 XT + RTX 4090 efficient together
  • <10% idle time on any GPU

Phase 13: Fault Tolerance (Weeks 15-16)

Handle node dropouts gracefully:

  • Heartbeat monitoring
  • Graceful degradation
  • Survive 50% worker dropout

Phase 14: Public Infrastructure (Months 5-6)

Folding@Home for ML:

# One line to donate GPU time
vkblas distributed join --server train.vulkanblas.org
  • Public coordinator server
  • Gradient validation (anti-cheat)
  • Contribution tracking & leaderboard
  • Train 1B parameter models collaboratively

⚡ Performance

Benchmark: Single Precision GEMM
Hardware: AMD Radeon RX 6500 XT vs Ryzen 5 5600X

Matrix Size CPU (GFLOPS) GPU (GFLOPS) Speedup
256x256 ~4.0 ~17.6 4.4x
512x512 ~4.1 ~66.3 16x
1024x1024 ~0.8 ~247.9 296x
2048x2048 N/A ~501.5 N/A
4096x4096 N/A ~702.7 N/A

🛠️ Building

Prerequisites

  • CMake 3.15+
  • Vulkan SDK
  • C++17 compiler

Build Steps

git clone --recursive https://github.com/cafeTechne/VulkanBLAS
cd VulkanBLAS
cmake -B build -S .
cmake --build build --config Release

# Run examples
./build/Release/autograd_demo
./build/Release/hello_tensor

📁 Project Structure

VulkanBLAS/
├── src/
│   ├── vkblas.hpp/cpp      # Public API
│   ├── tensor.hpp/cpp      # Tensor + Autograd
│   ├── backend_ggml.cpp    # ggml-vulkan wrapper
│   └── backend.hpp         # Backend interface
├── examples/
│   ├── autograd_demo.cpp   # Autograd validation
│   ├── hello_tensor.cpp    # Basic tensor ops
│   └── tensor_ops_demo.cpp # Element-wise ops
├── external/
│   └── llama.cpp/          # ggml-vulkan submodule
└── docs/
    ├── INTEGRATION.md      # Architecture guide
    └── DISTRIBUTED.md      # Distributed training roadmap

🔥 Killer Features

1. Universal GPU Support

First framework for distributed training on ANY Vulkan GPU:

  • ✅ AMD (RDNA 2/3/4)
  • ✅ Intel Arc
  • ✅ Qualcomm Adreno (mobile!)
  • ✅ Apple M-series (via MoltenVK)
  • ✅ NVIDIA (if you want)

2. Lightweight

Metric VulkanBLAS PyTorch
Binary Size ~5MB ~500MB
Build Time 2 min 30+ min
Dependencies ggml only CUDA, MKL, etc.

3. Folding@Home for ML

Permissionless compute contribution - anyone can donate GPU time to train models.


📚 Research Foundation

  • INTELLECT-2 (PrimeIntellect): First 32B globally distributed RL training
  • AReaL-Hex: 1.5x throughput on heterogeneous GPUs
  • DistDGLv2 (Amazon): Async pipeline training
  • HAP/Poplar: Heterogeneity-aware scheduling

🤝 Contributing

This project democratizes ML hardware access. Contributions welcome!

Current Focus: Phase 6 (Neural Network Layers)

  • Implement nn::Linear using ggml_mul_mat
  • Add activation functions
  • Test on XOR problem

See task.md for detailed roadmap.


📄 License

MIT (TBD)


Breaking NVIDIA's CUDA monopoly. Enabling ML on ANY GPU. Building the future of distributed training. 🚀

Built on the shoulders of giants: llama.cpp, ggml, Vulkan.

About

Breaking NVIDIA's monopoly, one matrix multiply at a time.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published