Releases: zerfoo/ztensor
Releases · zerfoo/ztensor
v1.5.0
10 Apr 23:43
Compare
Sorry, something went wrong.
No results found
1.5.0 (2026-04-10)
Features
compute: add AllocDeviceFloat32 and CopyToDevice to FusedEncoderProvider (8d6c90b )
compute: add fused PatchTST encoder layer CUDA kernels (4dfd46e )
Bug Fixes
compute: GPUEngine.Reshape honors dst argument (18a53fe )
compute: reuse dst GPU memory instead of allocating per call (#84 ) (26bbd49 )
kernels: rename kernel_add in fused_encoder_bwd to avoid symbol clash (716bbd6 )
v1.4.0
06 Apr 21:15
Compare
Sorry, something went wrong.
No results found
1.4.0 (2026-04-06)
Features
graph: add NewPJRTClient for external PJRT usage (c8db036 )
graph: add PJRTPlan execution wrapper with KV cache state management (3e5cb40 )
Bug Fixes
ci: exclude metal and pjrt from go vet (5a7fdc3 )
kernels: update GemvQ5_0F32 test to match qhOffset/qsOffset signature (70f8fd5 )
v1.3.0
03 Apr 01:09
Compare
Sorry, something went wrong.
No results found
1.3.0 (2026-04-03)
Features
graph: add CompilePJRT for PJRT backend compilation (dfd77a4 )
pjrt: add buffer management (host-device transfer, readback, lifecycle) (9b5dc75 )
pjrt: add KV cache I/O rewriting and executable cache (c8decc5 )
pjrt: add PJRT C API purego bindings for plugin loading, client, and device (c675807 )
pjrt: add program execution, serialization, and full StableHLO emitter (382ea0a )
pjrt: add StableHLO program compilation wrapper (7fcdde7 )
stablehlo: add emitter for element-wise and unary ops (499cef2 )
stablehlo: add emitter for MatMul and structural ops (13d87df )
stablehlo: add emitter for reductions and Softmax decomposition (c07b287 )
stablehlo: add MLIR type system and SSA naming (7c68d1e )
stablehlo: add shape inference for arithmetic ops (cac094e )
stablehlo: add shape inference for structural ops (8bf132c )
Bug Fixes
pjrt: centralize internal/cuda import in pjrt.go (aa8c170 )
pjrt: remove duplicate ccall/goStringN declarations (3e5fba9 )
v1.2.0
02 Apr 07:26
Compare
Sorry, something went wrong.
No results found
1.2.0 (2026-04-01)
Features
cuda: add Q6_K, Q5_K, Q5_0 GPU dequant kernels for M>1 prefill (d57e37e )
cuda: add Q8 Gather kernel for GPU embedding lookup (30eb9c4 )
tensor: add QuantizeQ4K for float32 to Q4_K quantization (d0d3a82 )
Bug Fixes
compute: add Q4KStorage to UploadWeights F32 skip list (cc071b6 )
compute: CPU dequant fallback for Q4_K when K%256!=0 (f50ffa7 )
compute: use dequant+cuBLAS for Q4_K when K%256!=0 (5f21cbb )
compute: use pool-backed GPUStorage for pool allocations (4367330 )
cuda: byte-wise loads in Q5_0 GEMV for ARM64 alignment (5f19e54 )
kernels: check null function pointer in FusedSoftmaxVMulF32 (935ad61 )
Performance Improvements
cuda: separated GPU layout for Q5_0 GEMV (d456c39 )
v1.1.3
01 Apr 04:34
Compare
Sorry, something went wrong.
No results found
1.1.3 (2026-04-01)
Bug Fixes
compute: add Q5_0Storage B-weight handling to CPU MatMul (e7927e5 )
compute: Q5_0 GEMV byte-wise loads for ARM64 alignment (5c7ec7a )
compute: skip Q4Storage in UploadWeights F32 loop (revert overaggressive skip) (2e91650 )
compute: skip transpose reshape fast-path for square matrices (eab19d0 )
v1.1.2
31 Mar 06:18
Compare
Sorry, something went wrong.
No results found
1.1.2 (2026-03-31)
Bug Fixes
compute: upload CPU fallback MatMul results to GPU for device consistency (5bc914b )
v1.1.1
31 Mar 05:30
Compare
Sorry, something went wrong.
No results found
1.1.1 (2026-03-31)
Bug Fixes
cuda: remove float4 alignment requirement from gemv_q8_kernel (1313605 )
cuda: remove float4 alignment requirement from gemv_q8_kernel (34aba3b )
v1.1.0
31 Mar 05:10
Compare
Sorry, something went wrong.
No results found
1.1.0 (2026-03-31)
Features
compute: add GPUFusedSoftmaxVMul method with provider interface (d659e76 )
compute: add GPURepeatInterleave method with purego bindings (6af7b96 )
compute: add GraphCapturer interface for CUDA graph capture/replay (1f37c69 )
compute: GPU-native Copy using cudaMemcpyAsync D2D (efc8b42 )
compute: wire capture-aware pool into GPUEngine BeginCapture/EndCapture (e39b318 )
cuda: add cudaMallocAsync and cudaFreeAsync bindings (e339656 )
cuda: add cudaMemsetAsync binding and GPU-native Zero (47b5d39 )
cuda: add fused repeat-interleave kernel for GQA head expansion (91e2469 )
cuda: add fused softmax + V multiply kernel for decode attention (ef6f7ce )
cuda: make MemPool capture-aware with SetCaptureStream (58b6337 )
gpuapi: wire FusedSoftmaxVMulF32 into KernelRunner interface (9afdb01 )
Bug Fixes
compute: copy mmap bytes to heap in mmapDevicePtr fallback (0ad23b5 )
compute: revert H2D to sync Memcpy (async breaks mmap'd tensors) (9a87e36 )
compute: use async memcpy in getDevicePtr for CUDA graph capture (b36b7ed )
v1.0.0
30 Mar 17:40
Compare
Sorry, something went wrong.
No results found
1.0.0 (2026-03-30)
Miscellaneous Chores
v0.15.0
29 Mar 21:07
Compare
Sorry, something went wrong.
No results found
0.15.0 (2026-03-29)
Features
tensor: MmapStorage.SliceElements for zero-copy expert weight slicing (0a40e11 )
xblas: streaming GEMM for mmap'd tensors, unblocks over-RAM inference (8d80b91 )