Failure to serve MiniMax-M2 on H100 #13046
Unanswered
Karthikksamy
asked this question in
Q&A
Replies: 1 comment
-
|
Any idea, why I am getting unsupported gpu architecture nvcc fatal : Unsupported gpu architecture 'compute_90a' |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
With the below command,
uv run python -m sglang.launch_server --model-path MiniMaxAI/MiniMax-M2 --tp-size 4 --tool-call-parser minimax-m2 --reasoning-parser minimax-append-think --host 0.0.0.0 --trust-remote-code --port 8000 --mem-fraction-static 0.8 --cuda-graph-max-bs 16
Getting this error,
nvcc fatal : Unsupported gpu architecture 'compute_90a'
[7/11] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_paged_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/ubuntu/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/include/python3.10 -isystem /usr/include/cccl -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -c /mnt/ssd4/.cache/flashinfer/0.4.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_paged_kernel_mask_3.cu -o batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_paged_kernel_mask_3.cuda.o
FAILED: [code=1] batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_paged_kernel_mask_3.cuda.o
/usr/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_paged_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/ubuntu/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/include/python3.10 -isystem /usr/include/cccl -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -c /mnt/ssd4/.cache/flashinfer/0.4.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_paged_kernel_mask_3.cu -o batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_paged_kernel_mask_3.cuda.o
nvcc fatal : Unsupported gpu architecture 'compute_90a'
[8/11] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_ragged_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/ubuntu/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/include/python3.10 -isystem /usr/include/cccl -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -c /mnt/ssd4/.cache/flashinfer/0.4.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_ragged_kernel_mask_3.cu -o batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_ragged_kernel_mask_3.cuda.o
FAILED: [code=1] batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_ragged_kernel_mask_3.cuda.o
/usr/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_ragged_kernel_mask_3.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/ubuntu/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/include/python3.10 -isystem /usr/include/cccl -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -c /mnt/ssd4/.cache/flashinfer/0.4.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_ragged_kernel_mask_3.cu -o batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_ragged_kernel_mask_3.cuda.o
nvcc fatal : Unsupported gpu architecture 'compute_90a'
[9/11] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/ubuntu/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/include/python3.10 -isystem /usr/include/cccl -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -c /mnt/ssd4/.cache/flashinfer/0.4.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill.cu -o batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill.cuda.o
FAILED: [code=1] batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill.cuda.o
/usr/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/ubuntu/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/include/python3.10 -isystem /usr/include/cccl -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -c /mnt/ssd4/.cache/flashinfer/0.4.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill.cu -o batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill.cuda.o
nvcc fatal : Unsupported gpu architecture 'compute_90a'
[10/11] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_binding.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/ubuntu/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/include/python3.10 -isystem /usr/include/cccl -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -c /mnt/ssd4/.cache/flashinfer/0.4.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_binding.cu -o batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_binding.cuda.o
FAILED: [code=1] batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_binding.cuda.o
/usr/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_binding.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /home/ubuntu/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/include/python3.10 -isystem /usr/include/cccl -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/tvm_ffi/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /mnt/ssd3/sglang/.venv/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -c /mnt/ssd4/.cache/flashinfer/0.4.1/90a/generated/batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_binding.cu -o batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_binding.cuda.o
nvcc fatal : Unsupported gpu architecture 'compute_90a'
ninja: build stopped: subcommand failed.
Possible solutions:
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose
Beta Was this translation helpful? Give feedback.
All reactions