@@ -159,7 +159,7 @@ with 128 blocks, that's the difference between 50μs and 5s of added latency per
159159├── build.rs # NVCC build script (sm80+sm90 by default)
160160├── cuda/
161161│ ├── tensor_kernels.cu # Batched CUDA kernels + memcpy fallback
162- │ └── prebuilt/ # Prebuilt .fatbin files with MD5 checksums
162+ │ └── prebuilt/ # Prebuilt .fatbin, .a (static libs), and .md5 checksums
163163├── src/
164164│ ├── lib.rs # Rust facade for the kernels
165165│ └── tensor_kernels.rs # FFI wrappers + integration tests
@@ -230,21 +230,22 @@ path (block ⇄ universal ⇄ operational), and asserts lossless round-trips.
230230
231231#### Prebuilt Kernels
232232
233- By default, the build system uses prebuilt ` .fatbin ` files from ` cuda/prebuilt/ `
234- if ` nvcc ` is not available. To force building from source:
233+ By default, the build system automatically:
234+ - ** Uses prebuilt** ` .fatbin ` and ` .a ` files from ` cuda/prebuilt/ ` if ` nvcc ` is ** not available**
235+ - ** Builds from source** if ` nvcc ` is ** available**
236+
237+ To force using prebuilt kernels even when nvcc is available:
235238
236239``` bash
237- # Disable prebuilt kernels
238- export DYNAMO_USE_PREBUILT_KERNELS=false
239- cargo build
240+ cargo build --features prebuilt-kernels
240241```
241242
242243After modifying CUDA source, regenerate prebuilt kernels and update checksums:
243244
244245``` bash
245246# This rebuilds tensor_kernels.cu and updates MD5 hashes
246247cargo build --release
247- # Commit the updated cuda/prebuilt/tensor_kernels.{fatbin,md5}
248+ # Commit the updated cuda/prebuilt/tensor_kernels.{fatbin,a, md5}
248249```
249250
250251** Important:** If you change ` CUDA_ARCHS ` or update your nvcc version, you need to
@@ -260,6 +261,30 @@ cargo build --release
260261The build system only checks if the ` .cu ` source has changed, not build configuration.
261262This prevents CI from regenerating non-reproducible ` .a ` files unnecessarily.
262263
264+ ##### Architecture Limitations
265+
266+ ** Prebuilt mode currently only supports x86_64 architecture.**
267+
268+ Static libraries (` .a ` files) contain compiled host-side C++ code and are CPU architecture-specific.
269+ The prebuilt ` libtensor_kernels.a ` is built for x86_64. On ARM (aarch64) or other architectures,
270+ you must install ` nvcc ` and build ` tensor_kernels ` from source.
271+
272+ The build will fail with a clear error message if you attempt prebuilt mode on ARM:
273+
274+ ```
275+ ╔════════════════════════════════════════════════════════════════════════╗
276+ ║ Prebuilt mode is not supported on aarch64 architecture ║
277+ ║ ║
278+ ║ Static libraries (.a files) are CPU architecture-specific. ║
279+ ║ Prebuilt libtensor_kernels.a is only available for x86_64. ║
280+ ║ ║
281+ ║ Please install nvcc to build from source, or use an x86_64 system. ║
282+ ╚════════════════════════════════════════════════════════════════════════╝
283+ ```
284+
285+ ** Note:** Only ` tensor_kernels.cu ` requires a static library (` .a ` ) for FFI linking. The
286+ ` vectorized_copy.cu ` kernel loads at runtime via ` .fatbin ` and works on all architectures.
287+
263288---
264289
265290### Python Bindings & Tests
0 commit comments