From repo root:
cd test/int/nnc
make debug -j4
./mpsdnn.tests
./mpsblas.testsExpected result for both test binaries:
all test case(s) passed, congratulations!
make debug -j4: success./mpsdnn.tests: success (108/108,EXIT:0)./mpsblas.tests: success (130/130,EXIT:0)
In this Codex environment, MPS test binaries may crash in the restricted sandbox (EXIT:139). Running these binaries with unrestricted execution resolves that environment-specific issue.
If a generic compilation flow fails, remove stale .dep.mk files in directories and retry the build.
Example from repo root:
find . -name .dep.mk -deleteFrom this workspace, run UBSAN from test/unit/nnc (not repo root):
cd test/unit/nnc
make clean
make ubsan -j64Notes:
make cleanis not available at repo root (make: *** No rule to make target 'clean'.).- In this Codex sandbox, ASAN leak detection may fail early with:
LeakSanitizer has encountered a fatal errorLeakSanitizer does not work under ptrace
- To run UBSAN/ASAN-built unit tests in this environment, disable leak detection:
ASAN_OPTIONS=detect_leaks=0 ./dynamic.graph.tests "<filter>"Do not hand-edit these generated files:
lib/nnc/cmd/ccv_nnc_cmd.inclib/nnc/cmd/ccv_nnc_cmd.hlib/nnc/cmd/ccv_nnc_backend.hlib/nnc/cmd/ccv_nnc_cmd_easy.h
Generate them with the script:
cd lib/nnc/cmd
./build-cmd.rb .Then verify build artifacts are still compilable (example):
cd test/int/nnc
make debug -j4Note: generated content may differ slightly across machines / environments, but should still produce compilable artifacts.
To avoid polluting commit history after local validation, restore generated files back to tip:
git checkout -- lib/nnc/cmd/ccv_nnc_cmd.inc lib/nnc/cmd/ccv_nnc_cmd.h lib/nnc/cmd/ccv_nnc_backend.h lib/nnc/cmd/ccv_nnc_cmd_easy.h lib/nnc/cmd/config.mk- Branch sync policy: when asked to keep a branch up to date with another branch, use
git rebaseinstead ofgit merge(unless explicitly requested otherwise). - Operator file naming convention (generic):
ccv_nnc_OPS.c: operator metadata / registry logic (in-place support, tensor shape inference, etc.).ccv_nnc_OPS_cpu_ref.c: CPU reference implementation.gpu/ccv_nnc_OPS_gpu_cudnn.cu: GPU implementation via cuDNN.gpu/ccv_nnc_OPS_gpu_ref.cu: GPU implementation via direct CUDA kernels.mps/ccv_nnc_OPS_mps.m: Apple MPS backend implementation (MPSGraph / MFA).
- Integration test philosophy (
test/int/nnc):- Prefer larger randomized tensors over tiny hand-crafted examples.
- Validate GPU outputs by comparing against CPU reference implementation outputs, using the same command and compatible tensor formats.
- Cover both
NCHWandNHWClayouts when backend support exists. - Use tolerance-based comparisons for floating-point parity (
REQUIRE_ARRAY_EQ_WITH_TOLERANCE).
- MPS SDPA test notes:
- For the
scaled dot product attention + unify headintegration test intest/int/nnc/mpsblas.tests.c, prefer a relative-difference check over a pure max-absolute-difference check. - In the current workspace, the intermediate attention output drift was small (
~4.5e-4max abs diff), but the downstream512-wide unify-head projection amplified that to about0.2max abs diff in the final output. - A robust comparison there is
fabs(a - b) / max(max(fabs(a), fabs(b)), 1)with a threshold around2e-3.
- For the
- MPS SDPA NA-attention gating note:
- The neural-accelerator attention path lowers the head tile to the MPP
matmul2dNdimension, which must be a multiple of8or16. - Small head dimensions such as
D = 4can compile-fail on the NA path even though the generic MFA / non-NA path is valid. - In
lib/nnc/cmd/scaled_dot_product_attention/mps/ccv_nnc_scaled_dot_product_attention_mps.m, gateuse_neural_acceleratorsconservatively for SDPA soD <= 128only uses NA attention when(D % 8) == 0.
- The neural-accelerator attention path lowers the head tile to the MPP
grid_sampleintegration test specifics:- NCHW path can be guarded by
CCV_NNC_BACKEND_GPU_CUDNN || CCV_NNC_BACKEND_MPS. - NHWC path is currently guarded by
CCV_NNC_BACKEND_MPS(cuDNN implementation path is NCHW-only internally).
- NCHW path can be guarded by
- Variable naming convention preference:
- Prefer
<tensor>_ndstyle names (for example,a_nd,b_nd,w_nd) as the default rule. adim/bdim/astride-style names are acceptable as specific shape/stride-array exceptions, not the general naming pattern.
- Prefer
- Assertion-only conditional style:
- Do not write
if (...) assert(...);. - Always brace assertion-only conditionals as
if (...) { assert(...); }so release builds do not change control flow when assertions compile out.
- Do not write
ctagsusage expectation and workflow:ctagsshould be available and used to discover reusable helpers before introducing local utility functions.- Practical flow:
- List relevant helper functions from common headers:
ctags -x --c-kinds=f lib/nnc/ccv_nnc_easy.h lib/nnc/ccv_nnc_internal.h. - Filter by intent (example):
ctags -x --c-kinds=f lib/nnc/ccv_nnc_easy.h lib/nnc/ccv_nnc_internal.h | rg 'tensor_get_|tensor_hw|tensor_view_get_'. - Reuse discovered existing helpers when possible, instead of adding local utility functions.
- List relevant helper functions from common headers:
- MFA cache / dispatch rule:
- Distinguish the two cache layers clearly:
- the kernel-object cache should only key source-generation properties;
- the pipeline / function-constant layer should carry shape-specific values such as
length,rowLength,scaleOffset, etc.
- If a kernel's dispatch geometry depends on runtime shape, do not store that shape inside the cached kernel object.
- Instead, derive
gridSize(...)from the current descriptor / params at encode time. - Concrete example:
Dequantize8iRowwiseKernelshould stay shape-agnostic, andgridSize(length)should use the currentlength; otherwise a later larger dequant can silently reuse a stale smaller dispatch and leave the tail zeroed.
- Distinguish the two cache layers clearly:
- MFA Conv3D /
NAConv3Dimplementation notes:- Frontend selection in
lib/nnc/cmd/convolution/mps/ccv_nnc_conv_mps.mshould keepuse_mfa_gemmanduse_mfa_conv3dseparate. - Current
use_mfa_conv3dsupport surface is intentionally narrow:- 3D convolution only.
- kernel depth must be
3. - spatial kernel may be any odd square (
3x3,5x5,7x7, ...). - stride and dilation must be
1. - depth padding is unsupported.
- input / output channels must both be divisible by
16. - NA hardware must be available in production code.
ccv_nnc_mfa_prepare_conv3d(...)should stay a no-op, like other MFA prepare entry points that do not need eager work.NAConv3Duses the same batching pattern asNAMatMul:- do not loop batch on the host;
- use
threadgroup_position_in_grid.zto encodebatch * output_depth. - host dispatch should iterate only across kernel-depth slices.
- Conv3D weights are currently accepted in OIDHW / NCHW layout and must be permuted to DHWIO scratch before the MFA kernel runs.
- Conv3D scratch reservation should mirror GEMM:
- reserve the front of MFA scratch for permuted weights with
ccv_nnc_mfa_conv3d_reserved_scratch_size(...); - if weights are palettized, depalettize after that reserved region so the two scratch uses do not overlap.
- reserve the front of MFA scratch for permuted weights with
- Bias support in
NAConv3Dis fused only into the first multiply kernel:conv3d_multiplyinitializes the destination from bias (or zero);- later
conv3d_multiply_accumulateslices only accumulate.
- Spatial padding support rules:
- use named fields
padding_left,padding_right,padding_top,padding_bottomin the C MFA params. - normalize asymmetric padding conservatively by preserving
right/bottomand derivingleft/topfrom output shape. - this matches the repo's existing "prefer more padding in the beginning" rule from
ccv_nnc_hint_auto(...).
- use named fields
- Descriptor / kernel descriptor rule:
- any non-derived value needed by
NAConv3DKernelDescriptormust also be present inNAConv3DDescriptor; - otherwise shader cache keys and generated kernel source can diverge.
- padding is a
KernelDescriptorproperty and should be inlined into kernel source, not passed withsetBytes.
- any non-derived value needed by
- Padded Conv3D test specifics:
hint.border.begin/endfor 3D convolution areD/H/Wonly; channels are not part of hint borders.- for padded 3D test cases, prefer explicit
stride = 1hints instead ofccv_nnc_hint_auto(...); ccv_nnc_hint_auto(...)can infer the wrong depth stride for padded 3D shapes, causingccv_nnc_hint_verify(...)to fail before the MFA path is exercised.
- Test style for
test/int/nnc/mpsdnn.tests.c:- keep focused test cases self-contained instead of factoring a small local helper shared across several similar cases.
- Local validation workflow for
NAConv3Don a machine without neural accelerators:- temporarily force
use_neural_accelerators = 1inccv_nnc_conv_mps.m; - run
./mpsdnn.tests "mfa conv3d"fromtest/int/nnc; - revert the force after validation so production code uses
ccv_nnc_mfa_has_neural_accelerators(context).
- temporarily force
- Frontend selection in
NAInt8AttentionbackwarddSfallback note:- Earlier exploration suggested
dS -> halfmight be a fallback worth keeping in mind, but on the current shippedD=128fixed-quant setup it is not a win. - Rechecked on
4096 x 4096 x 128with the current selector:- fixed-quant
dS: forward median4.0495 ms, backward median21.8308 ms, ratio5.3910x dS -> half: forward median4.0552 ms, backward median23.0083 ms, ratio5.6737x
- fixed-quant
- Takeaway:
- on the current
NAInt8Attentionbackward path,dS -> halfregresses relative to fixed-quantdS - do not treat it as the preferred fallback without reworking the kernel again
- on the current
- Earlier exploration suggested
NAInt8Attentionbackward fixed-quant selector note:- For the shipping
D=128low-precision backward path, the safe production rule is:- query:
blockR=16,blockC=32,blockD=32,executionSIMDGroups=4 - key/value:
blockR=16,blockC=64,blockD=64,executionSIMDGroups=16
- query:
- Trust the backward absolute times more than any single reported ratio; forward medians on the probe can move enough to make one-off ratios look too optimistic.
- Reliable current probe numbers are in this range:
4096 x 4096 x 128: backward median about21-23 ms, typically around5.2x-5.6x8192 x 8192 x 128: backward median about82-87 ms, typically around5.2x-5.4x
- Wider key/value traversal (
blockC=96) can benchmark slightly faster in the probe but is not accuracy-safe on the real gradient test surface; keepblockC=64in production.
- For the shipping