Skip to content

Memory performance optimizations#88

Open
RatanKokal wants to merge 5 commits intocertik:mainfrom
RatanKokal:contiguous
Open

Memory performance optimizations#88
RatanKokal wants to merge 5 commits intocertik:mainfrom
RatanKokal:contiguous

Conversation

@RatanKokal
Copy link
Copy Markdown

Following optimizations are performed

  • Subroutine Refactoring: Switched from array-returning functions to subroutines with intent(inout) to enable in-place mutation and kill the "hidden allocation tax."

  • Zero-Copy Cache Passing: Eliminated the token-by-token allocate/deallocate of kv_cache2 by passing a fixed-size buffer with a max_seq_cache pointer.

  • The Transpose/itcopy Fix: Re-aligned the wte and attn_w weight dimensions to match Fortran's Column-Major storage, stopping OpenBLAS from calling itcopy (internal transpose) on every single layer.

  • Contiguous Head Slicing: Moved the n_head index to the last dimension of the cache so that k_cache(:, :, l) is a single contiguous memory block.

  • Strict Aliasing Protection: Added a dedicated block buffer for the final layer_norm to prevent the compiler from generating defensive memory copies when input and output arrays overlapped.

  • Small GEMV Heuristic: Optimized the linear layer to use pure matmul for 1×N vectors

  • Softmax In-Place: Decoupled the softmax calculation from the matmul call to prevent the creation of intermediate temporary arrays during the attention mechanism.

  • incopy and Alignment: Aligned memory for direct SIMD streaming, eliminating incopy and internal buffering.

Made mem access contiguous and also tried to minimize transpose operations
Converted functions to subroutines to eliminate the "hidden allocation tax" and memmove overhead
Replaced array slicing with full-block pointer
Added a dedicated buffer for the final layer_norm to satisfy Fortran's strict-aliasing rules
@RatanKokal
Copy link
Copy Markdown
Author

RatanKokal commented Mar 23, 2026

These optimizations were done for gfortran, which turns out to be bad for lfortran. Looking into this. My initial analysis says that it is not using openBLAS for matmul

@certik
Copy link
Copy Markdown
Owner

certik commented Mar 23, 2026

Don't worry about LFortran, its not ready for performance yet.

If it got faster with GFortran, then it's a win.

Can you separate code formatting from actual changes? Don't format the code on this PR, just add the changes and let's see.

@RatanKokal
Copy link
Copy Markdown
Author

Don't worry about LFortran, its not ready for performance yet.

If it got faster with GFortran, then it's a win.

Can you separate code formatting from actual changes? Don't format the code on this PR, just add the changes and let's see.

Done

@certik
Copy link
Copy Markdown
Owner

certik commented Mar 23, 2026

Excellent. Now let's evaluate the changes. Can you post timings before and after using gfortran and all optimizations on? Also post what mode you are using (openblas/accelerate/etc.) and exactly how you are benchmarking it.

@RatanKokal
Copy link
Copy Markdown
Author

RatanKokal commented Mar 23, 2026

fastGPT Performance Benchmark: openBLAS

Summary

The contiguous branch of fastGPT introduces a suite of low-level architectural optimizations designed to eliminate the "Hidden Memory Tax" inherent in standard Fortran array abstractions. By aligning data structures to Column-Major storage and enforcing In-Place mutations, we achieved a 3.19x peak speedup and reduced total CPU instruction volume by 68%.


1. Throughput & Scaling Full Data (Tokens / Second)

Measured across different sequence lengths using Linux taskset for strict CPU affinity.

Short Context (20 Tokens)

Core Config Original Code Optimized Code Impact
0 (1 Core) 14.95 tok/s (1.337s) 41.23 tok/s (0.485s) 2.75x Faster
0,1 (2 Cores) 17.05 tok/s (1.173s) 39.60 tok/s (0.505s) 2.32x Faster
0,2 (2 Cores) 25.51 tok/s (0.784s) 43.29 tok/s (0.462s) 1.69x Faster
0,1,2,3 (4 Cores) 27.24 tok/s (0.734s) 41.23 tok/s (0.485s) 1.51x Faster

Medium Context (200 Tokens)

Core Config Original Code Optimized Code Impact
0 (1 Core) 14.47 tok/s (13.81s) 46.30 tok/s (4.31s) 3.20x Faster
0,1 (2 Cores) 15.81 tok/s (12.64s) 45.89 tok/s (4.35s) 2.90x Faster
0,2 (2 Cores) 20.97 tok/s (9.53s) 47.31 tok/s (4.22s) 2.25x Faster
0,1,2,3 (4 Cores) 15.00 tok/s (13.33s) 47.98 tok/s (4.16s) 3.19x Faster

Long Context (1000 Tokens) - The "Memory Cliff"

Core Config Original Code Optimized Code Impact
0 (1 Core) 10.92 tok/s (91.55s) 39.22 tok/s (25.49s) 3.59x Faster
0,1 (2 Cores) 10.79 tok/s (92.61s) 38.30 tok/s (26.10s) 3.54x Faster
0,2 (2 Cores) 12.28 tok/s (81.36s) 36.97 tok/s (27.04s) 3.01x Faster
0,1,2,3 (4 Cores) 9.43 tok/s (106.04s) 38.63 tok/s (25.88s) 4.09x Faster

Scaling Analysis: The unoptimized code suffered a massive scaling collapse at 1000 tokens (4 cores performed worse than 1 core due to memory allocation contention). The optimized code flattened this cliff, allowing a single core to cleanly outperform 4 cores running the old code by nearly 4x.


2. Hardware Efficiency (perf stat)

Single-core baseline (taskset -c 0), generating 20 tokens.

Metric Original Code Optimized Code Impact
Wall-Clock Time 1.858s 1.231s 33.7% Faster
Total Instructions 8.05 Billion 2.54 Billion - 68.4% (5.5B fewer)
CPU Cycles 5.84 Billion 3.82 Billion - 34.6%
Cache References 284.2 Million 209.5 Million - 26.3%
Cache Misses 157.4 Million 185.9 Million (See Profiling Note)
IPC (Efficiency) 1.38 0.66 (See Profiling Note)

Profiling Note: The original code executed 5.5 billion "empty calorie" instructions packing temporary arrays in the L1 cache. This artificially inflated Cache References and IPC, while stalling actual generation. The optimized code executes math natively, dropping the IPC but finishing significantly faster because it purely waits on main memory bandwidth.


3. Control Flow & Branch Prediction

Measuring the CPU branch predictor's workload during the 20-token run.

Metric Original Code Optimized Code Impact
Total Branches 480.7 Million 192.1 Million - 60%
Branch Misses 2.84 Million 1.40 Million - 50% pipeline flushes

Key Takeaway: Fusing nested layer_norm loops and stripping array temporaries deleted 288 million useless branch evaluations, cutting the absolute number of expensive CPU pipeline flushes in half.


4. Hotspot Analysis (perf report)

Original Code (Main Branch)

The CPU is dominated by OpenBLAS memory management. 72.3% of all cycles are spent simply rearranging data for the math library rather than calculating tokens.

  • sgemm_incopy (47.60%): Internal BLAS memory packing.
  • sgemm_itcopy (24.70%): Internal BLAS matrix transposing.
  • sgemm_kernel (19.69%): Actual computation.

Optimized Code (Contiguous PR)

Packing overhead is virtually eliminated. The CPU now spends its time natively in the optimized Fortran logic.

  • __gpt2_mod_MOD_linear (25.84%): Direct computation via our optimized n_seq_x == 1 path.
  • __driver_MOD_load_model (42.40%): Fixed I/O cost (Percentage will drop as token count increases).
  • sgemm_kernel (5.81%): Drastic reduction in BLAS dependency.

Appendix: Understanding the Core Configurations (taskset)

During this benchmark, taskset -c was used to pin the execution to specific hardware threads. Modern CPUs (specifically Intel 12th/13th Gen+ architectures) use a hybrid layout of Performance Cores (P-cores) and Efficiency Cores (E-cores), alongside Hyper-Threading.

The perf stat logs confirmed the presence of cpu_core (P-cores) and cpu_atom (E-cores). This explains the specific scaling behaviors observed:

  • taskset -c 0 (1 Logical Core): Pins the process strictly to CPU 0, which is a high-speed P-core. This provides the cleanest look at single-thread IPC and cache behavior without OS context switching.
  • taskset -c 0,1 (2 Logical Cores): In many architectures, CPU 0 and CPU 1 are two Hyper-Threads sharing the exact same physical P-core. They share the same L1/L2 cache and execution units. This is why 0,1 often showed little improvement (or even regression) compared to 0 alone—the threads were fighting over the same physical hardware resources.
  • taskset -c 0,2 (2 Physical Cores): CPU 0 and CPU 2 are typically two distinct physical P-cores. This configuration gave the system access to double the L1/L2 cache and double the execution units. This explains why 0,2 consistently outperformed 0,1 across the benchmarks.
  • taskset -c 0,1,2,3 (4 Logical Cores): Utilizes multiple physical cores and their Hyper-Threads. While this provided maximum compute, it triggered the "Memory Cliff" at 1000 tokens in the old code due to 4 threads simultaneously fighting for memory bus bandwidth to read the massive KV cache.

@certik
Copy link
Copy Markdown
Owner

certik commented Mar 23, 2026

Please answer what I asked for:

Can you post timings before and after using gfortran and all optimizations on? Also post what mode you are using (openblas/accelerate/etc.) and exactly how you are benchmarking it.

Post what GFortran options and a version you used. Post what configuration you used for fastGPT. And post the exact commands you used to benchmark.

You need to post all the details, so that I can reproduce the benchmarks locally.

@RatanKokal
Copy link
Copy Markdown
Author

Please answer what I asked for:

Can you post timings before and after using gfortran and all optimizations on? Also post what mode you are using (openblas/accelerate/etc.) and exactly how you are benchmarking it.

Post what GFortran options and a version you used. Post what configuration you used for fastGPT. And post the exact commands you used to benchmark.

You need to post all the details, so that I can reproduce the benchmarks locally.

fastGPT Benchmarking Process & Methodology


1. Environment Specifications

  • Processor: Intel Core i5-1235U (12th Gen)

    • Architecture: Hybrid (2 Performance-cores / 8 Efficient-cores)
    • Topology: 12 Logical Threads
  • Operating System: Fedora Linux (64-bit)

  • Compiler: gfortran (conda-forge gcc 15.2.0-18) 15.2.0

  • Build System: CMake

  • Math Library: OpenBLAS (Linked via CMake)


2. Build Pipeline

The project is compiled in Release mode to enable high-level loop optimizations and vectorization. We utilize a modern CMake workflow for cross-platform consistency.

Configuration

cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DFASTGPT_BLAS=OpenBLAS \
  -DCMAKE_Fortran_COMPILER=gfortran

Parallel Compilation

The -j flag is used to leverage the i5's multiple threads for faster build times.

cmake --build build -j

Active Optimization Flags

-O3 -march=native -ffast-math -fopenmp

3. Benchmarking Methodology

To account for the hybrid P/E-core split of the i5-1235U, strict CPU affinity is enforced using Linux taskset. This prevents the OS from migrating high-intensity GPT-2 inference loops to the slower Efficient-cores, which would skew latency results.

Execution & Core Isolation

Single P-Core Isolation (Baseline)

taskset -c 0 ./build/gpt2

Dual P-Core Scaling

taskset -c 0,1 ./build/gpt2

4. Hardware Telemetry & Profiling

The perf toolsuite is utilized to verify that the optimizations successfully bypassed software-level overhead in the math library and reached the memory-bandwidth ceiling.

Micro-architectural Analysis

To measure the reduction in total instruction volume and monitor cache-miss behavior:

perf stat -e instructions,cycles,cache-references,cache-misses,branches,branch-misses \
  taskset -c 0 ./build/gpt2

Hotspot Identification

To confirm the elimination of OpenBLAS internal packing routines (sgemm_incopy and sgemm_itcopy):

# 1. Record the hardware profile
perf record -g taskset -c 0 ./build/gpt2

# 2. Inspect the call-graph hotspots
perf report

@RatanKokal
Copy link
Copy Markdown
Author

bench.sh

@RatanKokal
Copy link
Copy Markdown
Author

Also, had a query regarding the absence of benchmarks for multi-core in case of Accelerate. Is there any architectural reason behind that?

@certik
Copy link
Copy Markdown
Owner

certik commented Mar 23, 2026

Yes, there is only one unit that Accelerate runs on, so multiple cores do not speed it up.

I'll try to reproduce your benchmarks soon and then we can discuss more.

@RatanKokal
Copy link
Copy Markdown
Author

I have been reviewing the proposed use of MPI and coarrays. While I recognize that these frameworks are the standard for multi-node computation and distributed memory systems, I’m trying to better understand their specific intent within our current hardware constraints.

Specifically, on a single multi-core machine (8–12 cores), the overhead of message passing often outweighs the benefits compared to shared-memory parallelization. Since GEMM in BLAS is already highly optimized for these architectures, and fastGPT inference is currently memory-bound, introducing a parallelized message-passing system may increase communication overhead without yielding performance gains, as compute is not the primary bottleneck.

The operation being memory-bound after optimization can be observed, as throughput does not scale with cores.

@RatanKokal
Copy link
Copy Markdown
Author

I have been reviewing the proposed use of MPI and coarrays. While I recognize that these frameworks are the standard for multi-node computation and distributed memory systems, I’m trying to better understand their specific intent within our current hardware constraints.

Specifically, on a single multi-core machine (8–12 cores), the overhead of message passing often outweighs the benefits compared to shared-memory parallelization. Since GEMM in BLAS is already highly optimized for these architectures, and fastGPT inference is currently memory-bound, introducing a parallelized message-passing system may increase communication overhead without yielding performance gains, as compute is not the primary bottleneck.

The operation being memory-bound after optimization can be observed, as throughput does not scale with cores.

One use case which I can foresee is for gpt2-xl which is 1.5B parameters model. It may need more than one machines if not quantized to handle memory constraints.

@RatanKokal
Copy link
Copy Markdown
Author

Hi @certik,

were you able to successfully run the benchmarks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants