Failure to serve MiniMax-M2 on H100 #13046

Karthikksamy · 2025-11-11T04:55:16Z

Karthikksamy
Nov 11, 2025

With the below command,
uv run python -m sglang.launch_server --model-path MiniMaxAI/MiniMax-M2 --tp-size 4 --tool-call-parser minimax-m2 --reasoning-parser minimax-append-think --host 0.0.0.0 --trust-remote-code --port 8000 --mem-fraction-static 0.8 --cuda-graph-max-bs 16

Getting this error,

nvcc fatal : Unsupported gpu architecture 'compute_90a'
[7/11] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_paged_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/ubuntu/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/include/python3.10 -isystem /usr/include/cccl -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -c /mnt/ssd4/.cache/flashinfer/0.4.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_paged_kernel_mask_3.cu -o batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_paged_kernel_mask_3.cuda.o
FAILED: [code=1] batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_paged_kernel_mask_3.cuda.o
/usr/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_paged_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/ubuntu/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/include/python3.10 -isystem /usr/include/cccl -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -c /mnt/ssd4/.cache/flashinfer/0.4.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_paged_kernel_mask_3.cu -o batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_paged_kernel_mask_3.cuda.o
nvcc fatal : Unsupported gpu architecture 'compute_90a'
[8/11] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_ragged_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/ubuntu/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/include/python3.10 -isystem /usr/include/cccl -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -c /mnt/ssd4/.cache/flashinfer/0.4.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_ragged_kernel_mask_3.cu -o batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_ragged_kernel_mask_3.cuda.o
FAILED: [code=1] batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_ragged_kernel_mask_3.cuda.o
/usr/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_ragged_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/ubuntu/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/include/python3.10 -isystem /usr/include/cccl -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -c /mnt/ssd4/.cache/flashinfer/0.4.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_ragged_kernel_mask_3.cu -o batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_ragged_kernel_mask_3.cuda.o
nvcc fatal : Unsupported gpu architecture 'compute_90a'
[9/11] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/ubuntu/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/include/python3.10 -isystem /usr/include/cccl -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -c /mnt/ssd4/.cache/flashinfer/0.4.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill.cu -o batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill.cuda.o
FAILED: [code=1] batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill.cuda.o
/usr/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/ubuntu/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/include/python3.10 -isystem /usr/include/cccl -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -c /mnt/ssd4/.cache/flashinfer/0.4.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill.cu -o batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill.cuda.o
nvcc fatal : Unsupported gpu architecture 'compute_90a'
[10/11] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_binding.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/ubuntu/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/include/python3.10 -isystem /usr/include/cccl -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -c /mnt/ssd4/.cache/flashinfer/0.4.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_binding.cu -o batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_binding.cuda.o
FAILED: [code=1] batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_binding.cuda.o
/usr/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_binding.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/ubuntu/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/include/python3.10 -isystem /usr/include/cccl -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -c /mnt/ssd4/.cache/flashinfer/0.4.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_binding.cu -o batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_binding.cuda.o
nvcc fatal : Unsupported gpu architecture 'compute_90a'
ninja: build stopped: subcommand failed.

Possible solutions:

set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
set --cuda-graph-max-bs to a smaller value (e.g., 16)
disable torch compile by not using --enable-torch-compile
disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

Karthikksamy · 2025-11-11T04:57:18Z

Karthikksamy
Nov 11, 2025
Author

Any idea, why I am getting unsupported gpu architecture

nvcc fatal : Unsupported gpu architecture 'compute_90a'

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failure to serve MiniMax-M2 on H100 #13046

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Failure to serve MiniMax-M2 on H100 #13046

Uh oh!

Karthikksamy Nov 11, 2025

Replies: 1 comment

Uh oh!

Karthikksamy Nov 11, 2025 Author

Karthikksamy
Nov 11, 2025

Karthikksamy
Nov 11, 2025
Author