Memory performance optimizations by RatanKokal · Pull Request #88 · certik/fastGPT

RatanKokal · 2026-03-22T23:45:06Z

Following optimizations are performed

Subroutine Refactoring: Switched from array-returning functions to subroutines with intent(inout) to enable in-place mutation and kill the "hidden allocation tax."
Zero-Copy Cache Passing: Eliminated the token-by-token allocate/deallocate of kv_cache2 by passing a fixed-size buffer with a max_seq_cache pointer.
The Transpose/itcopy Fix: Re-aligned the wte and attn_w weight dimensions to match Fortran's Column-Major storage, stopping OpenBLAS from calling itcopy (internal transpose) on every single layer.
Contiguous Head Slicing: Moved the n_head index to the last dimension of the cache so that k_cache(:, :, l) is a single contiguous memory block.
Strict Aliasing Protection: Added a dedicated block buffer for the final layer_norm to prevent the compiler from generating defensive memory copies when input and output arrays overlapped.
Small GEMV Heuristic: Optimized the linear layer to use pure matmul for 1×N vectors
Softmax In-Place: Decoupled the softmax calculation from the matmul call to prevent the creation of intermediate temporary arrays during the attention mechanism.
incopy and Alignment: Aligned memory for direct SIMD streaming, eliminating incopy and internal buffering.

Made mem access contiguous and also tried to minimize transpose operations

Converted functions to subroutines to eliminate the "hidden allocation tax" and memmove overhead Replaced array slicing with full-block pointer Added a dedicated buffer for the final layer_norm to satisfy Fortran's strict-aliasing rules

RatanKokal · 2026-03-23T01:33:14Z

These optimizations were done for gfortran, which turns out to be bad for lfortran. Looking into this. My initial analysis says that it is not using openBLAS for matmul

certik · 2026-03-23T01:57:37Z

Don't worry about LFortran, its not ready for performance yet.

If it got faster with GFortran, then it's a win.

Can you separate code formatting from actual changes? Don't format the code on this PR, just add the changes and let's see.

RatanKokal · 2026-03-23T02:57:43Z

Don't worry about LFortran, its not ready for performance yet.

If it got faster with GFortran, then it's a win.

Can you separate code formatting from actual changes? Don't format the code on this PR, just add the changes and let's see.

Done

certik · 2026-03-23T03:57:36Z

Excellent. Now let's evaluate the changes. Can you post timings before and after using gfortran and all optimizations on? Also post what mode you are using (openblas/accelerate/etc.) and exactly how you are benchmarking it.

RatanKokal · 2026-03-23T13:29:44Z

fastGPT Performance Benchmark: openBLAS

Summary

The contiguous branch of fastGPT introduces a suite of low-level architectural optimizations designed to eliminate the "Hidden Memory Tax" inherent in standard Fortran array abstractions. By aligning data structures to Column-Major storage and enforcing In-Place mutations, we achieved a 3.19x peak speedup and reduced total CPU instruction volume by 68%.

1. Throughput & Scaling Full Data (Tokens / Second)

Measured across different sequence lengths using Linux taskset for strict CPU affinity.

Short Context (20 Tokens)

Core Config	Original Code	Optimized Code	Impact
0 (1 Core)	14.95 tok/s (1.337s)	41.23 tok/s (0.485s)	2.75x Faster
0,1 (2 Cores)	17.05 tok/s (1.173s)	39.60 tok/s (0.505s)	2.32x Faster
0,2 (2 Cores)	25.51 tok/s (0.784s)	43.29 tok/s (0.462s)	1.69x Faster
0,1,2,3 (4 Cores)	27.24 tok/s (0.734s)	41.23 tok/s (0.485s)	1.51x Faster

Medium Context (200 Tokens)

Core Config	Original Code	Optimized Code	Impact
0 (1 Core)	14.47 tok/s (13.81s)	46.30 tok/s (4.31s)	3.20x Faster
0,1 (2 Cores)	15.81 tok/s (12.64s)	45.89 tok/s (4.35s)	2.90x Faster
0,2 (2 Cores)	20.97 tok/s (9.53s)	47.31 tok/s (4.22s)	2.25x Faster
0,1,2,3 (4 Cores)	15.00 tok/s (13.33s)	47.98 tok/s (4.16s)	3.19x Faster

Long Context (1000 Tokens) - The "Memory Cliff"

Core Config	Original Code	Optimized Code	Impact
0 (1 Core)	10.92 tok/s (91.55s)	39.22 tok/s (25.49s)	3.59x Faster
0,1 (2 Cores)	10.79 tok/s (92.61s)	38.30 tok/s (26.10s)	3.54x Faster
0,2 (2 Cores)	12.28 tok/s (81.36s)	36.97 tok/s (27.04s)	3.01x Faster
0,1,2,3 (4 Cores)	9.43 tok/s (106.04s)	38.63 tok/s (25.88s)	4.09x Faster

Scaling Analysis: The unoptimized code suffered a massive scaling collapse at 1000 tokens (4 cores performed worse than 1 core due to memory allocation contention). The optimized code flattened this cliff, allowing a single core to cleanly outperform 4 cores running the old code by nearly 4x.

2. Hardware Efficiency (`perf stat`)

Single-core baseline (taskset -c 0), generating 20 tokens.

Metric	Original Code	Optimized Code	Impact
Wall-Clock Time	1.858s	1.231s	33.7% Faster
Total Instructions	8.05 Billion	2.54 Billion	- 68.4% (5.5B fewer)
CPU Cycles	5.84 Billion	3.82 Billion	- 34.6%
Cache References	284.2 Million	209.5 Million	- 26.3%
Cache Misses	157.4 Million	185.9 Million	(See Profiling Note)
IPC (Efficiency)	1.38	0.66	(See Profiling Note)

Profiling Note: The original code executed 5.5 billion "empty calorie" instructions packing temporary arrays in the L1 cache. This artificially inflated Cache References and IPC, while stalling actual generation. The optimized code executes math natively, dropping the IPC but finishing significantly faster because it purely waits on main memory bandwidth.

3. Control Flow & Branch Prediction

Measuring the CPU branch predictor's workload during the 20-token run.

Metric	Original Code	Optimized Code	Impact
Total Branches	480.7 Million	192.1 Million	- 60%
Branch Misses	2.84 Million	1.40 Million	- 50% pipeline flushes

Key Takeaway: Fusing nested layer_norm loops and stripping array temporaries deleted 288 million useless branch evaluations, cutting the absolute number of expensive CPU pipeline flushes in half.

4. Hotspot Analysis (`perf report`)

Original Code (Main Branch)

The CPU is dominated by OpenBLAS memory management. 72.3% of all cycles are spent simply rearranging data for the math library rather than calculating tokens.

sgemm_incopy (47.60%): Internal BLAS memory packing.
sgemm_itcopy (24.70%): Internal BLAS matrix transposing.
sgemm_kernel (19.69%): Actual computation.

Optimized Code (Contiguous PR)

Packing overhead is virtually eliminated. The CPU now spends its time natively in the optimized Fortran logic.

__gpt2_mod_MOD_linear (25.84%): Direct computation via our optimized n_seq_x == 1 path.
__driver_MOD_load_model (42.40%): Fixed I/O cost (Percentage will drop as token count increases).
sgemm_kernel (5.81%): Drastic reduction in BLAS dependency.

Appendix: Understanding the Core Configurations (`taskset`)

During this benchmark, taskset -c was used to pin the execution to specific hardware threads. Modern CPUs (specifically Intel 12th/13th Gen+ architectures) use a hybrid layout of Performance Cores (P-cores) and Efficiency Cores (E-cores), alongside Hyper-Threading.

The perf stat logs confirmed the presence of cpu_core (P-cores) and cpu_atom (E-cores). This explains the specific scaling behaviors observed:

taskset -c 0 (1 Logical Core): Pins the process strictly to CPU 0, which is a high-speed P-core. This provides the cleanest look at single-thread IPC and cache behavior without OS context switching.
taskset -c 0,1 (2 Logical Cores): In many architectures, CPU 0 and CPU 1 are two Hyper-Threads sharing the exact same physical P-core. They share the same L1/L2 cache and execution units. This is why 0,1 often showed little improvement (or even regression) compared to 0 alone—the threads were fighting over the same physical hardware resources.
taskset -c 0,2 (2 Physical Cores): CPU 0 and CPU 2 are typically two distinct physical P-cores. This configuration gave the system access to double the L1/L2 cache and double the execution units. This explains why 0,2 consistently outperformed 0,1 across the benchmarks.
taskset -c 0,1,2,3 (4 Logical Cores): Utilizes multiple physical cores and their Hyper-Threads. While this provided maximum compute, it triggered the "Memory Cliff" at 1000 tokens in the old code due to 4 threads simultaneously fighting for memory bus bandwidth to read the massive KV cache.

certik · 2026-03-23T13:49:18Z

Please answer what I asked for:

Can you post timings before and after using gfortran and all optimizations on? Also post what mode you are using (openblas/accelerate/etc.) and exactly how you are benchmarking it.

Post what GFortran options and a version you used. Post what configuration you used for fastGPT. And post the exact commands you used to benchmark.

You need to post all the details, so that I can reproduce the benchmarks locally.

RatanKokal · 2026-03-23T14:03:15Z

Please answer what I asked for:

Can you post timings before and after using gfortran and all optimizations on? Also post what mode you are using (openblas/accelerate/etc.) and exactly how you are benchmarking it.

Post what GFortran options and a version you used. Post what configuration you used for fastGPT. And post the exact commands you used to benchmark.

You need to post all the details, so that I can reproduce the benchmarks locally.

fastGPT Benchmarking Process & Methodology

1. Environment Specifications

Processor: Intel Core i5-1235U (12th Gen)
- Architecture: Hybrid (2 Performance-cores / 8 Efficient-cores)
- Topology: 12 Logical Threads
Operating System: Fedora Linux (64-bit)
Compiler: gfortran (conda-forge gcc 15.2.0-18) 15.2.0
Build System: CMake
Math Library: OpenBLAS (Linked via CMake)

2. Build Pipeline

The project is compiled in Release mode to enable high-level loop optimizations and vectorization. We utilize a modern CMake workflow for cross-platform consistency.

Configuration

cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DFASTGPT_BLAS=OpenBLAS \
  -DCMAKE_Fortran_COMPILER=gfortran

Parallel Compilation

The -j flag is used to leverage the i5's multiple threads for faster build times.

cmake --build build -j

Active Optimization Flags

-O3 -march=native -ffast-math -fopenmp

3. Benchmarking Methodology

To account for the hybrid P/E-core split of the i5-1235U, strict CPU affinity is enforced using Linux taskset. This prevents the OS from migrating high-intensity GPT-2 inference loops to the slower Efficient-cores, which would skew latency results.

Execution & Core Isolation

Single P-Core Isolation (Baseline)

taskset -c 0 ./build/gpt2

Dual P-Core Scaling

taskset -c 0,1 ./build/gpt2

4. Hardware Telemetry & Profiling

The perf toolsuite is utilized to verify that the optimizations successfully bypassed software-level overhead in the math library and reached the memory-bandwidth ceiling.

Micro-architectural Analysis

To measure the reduction in total instruction volume and monitor cache-miss behavior:

perf stat -e instructions,cycles,cache-references,cache-misses,branches,branch-misses \
  taskset -c 0 ./build/gpt2

Hotspot Identification

To confirm the elimination of OpenBLAS internal packing routines (sgemm_incopy and sgemm_itcopy):

# 1. Record the hardware profile
perf record -g taskset -c 0 ./build/gpt2

# 2. Inspect the call-graph hotspots
perf report

RatanKokal · 2026-03-23T14:06:05Z

bench.sh

RatanKokal · 2026-03-23T22:30:54Z

Also, had a query regarding the absence of benchmarks for multi-core in case of Accelerate. Is there any architectural reason behind that?

certik · 2026-03-23T23:47:04Z

Yes, there is only one unit that Accelerate runs on, so multiple cores do not speed it up.

I'll try to reproduce your benchmarks soon and then we can discuss more.

RatanKokal · 2026-03-25T21:25:46Z

I have been reviewing the proposed use of MPI and coarrays. While I recognize that these frameworks are the standard for multi-node computation and distributed memory systems, I’m trying to better understand their specific intent within our current hardware constraints.

Specifically, on a single multi-core machine (8–12 cores), the overhead of message passing often outweighs the benefits compared to shared-memory parallelization. Since GEMM in BLAS is already highly optimized for these architectures, and fastGPT inference is currently memory-bound, introducing a parallelized message-passing system may increase communication overhead without yielding performance gains, as compute is not the primary bottleneck.

The operation being memory-bound after optimization can be observed, as throughput does not scale with cores.

RatanKokal · 2026-03-28T10:02:00Z

I have been reviewing the proposed use of MPI and coarrays. While I recognize that these frameworks are the standard for multi-node computation and distributed memory systems, I’m trying to better understand their specific intent within our current hardware constraints.

Specifically, on a single multi-core machine (8–12 cores), the overhead of message passing often outweighs the benefits compared to shared-memory parallelization. Since GEMM in BLAS is already highly optimized for these architectures, and fastGPT inference is currently memory-bound, introducing a parallelized message-passing system may increase communication overhead without yielding performance gains, as compute is not the primary bottleneck.

The operation being memory-bound after optimization can be observed, as throughput does not scale with cores.

One use case which I can foresee is for gpt2-xl which is 1.5B parameters model. It may need more than one machines if not quantized to handle memory constraints.

RatanKokal · 2026-04-10T11:51:48Z

Hi @certik,

were you able to successfully run the benchmarks?

RatanKokal added 3 commits March 23, 2026 03:51

perf: Made mem access contiguous

ba55b5f

Made mem access contiguous and also tried to minimize transpose operations

perf: refactor to zero-copy subroutines

fe6976b

Converted functions to subroutines to eliminate the "hidden allocation tax" and memmove overhead Replaced array slicing with full-block pointer Added a dedicated buffer for the final layer_norm to satisfy Fortran's strict-aliasing rules

fix: got back the orignal timing logic

ffa113c

fix: reverted formatting

a329628

RatanKokal force-pushed the contiguous branch from 8676650 to a329628 Compare March 23, 2026 11:46

fix: restored mean variable original name

f9f13ab

Conversation

RatanKokal commented Mar 22, 2026

Uh oh!

RatanKokal commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

certik commented Mar 23, 2026

Uh oh!

RatanKokal commented Mar 23, 2026

Uh oh!

certik commented Mar 23, 2026

Uh oh!

RatanKokal commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

fastGPT Performance Benchmark: openBLAS

Summary

1. Throughput & Scaling Full Data (Tokens / Second)

Short Context (20 Tokens)

Medium Context (200 Tokens)

Long Context (1000 Tokens) - The "Memory Cliff"

2. Hardware Efficiency (perf stat)

3. Control Flow & Branch Prediction

4. Hotspot Analysis (perf report)

Original Code (Main Branch)

Optimized Code (Contiguous PR)

Appendix: Understanding the Core Configurations (taskset)

Uh oh!

certik commented Mar 23, 2026

Uh oh!

RatanKokal commented Mar 23, 2026

fastGPT Benchmarking Process & Methodology

1. Environment Specifications

2. Build Pipeline

Configuration

Parallel Compilation

Active Optimization Flags

3. Benchmarking Methodology

Execution & Core Isolation

Single P-Core Isolation (Baseline)

Dual P-Core Scaling

4. Hardware Telemetry & Profiling

Micro-architectural Analysis

Hotspot Identification

Uh oh!

RatanKokal commented Mar 23, 2026

Uh oh!

RatanKokal commented Mar 23, 2026

Uh oh!

certik commented Mar 23, 2026

Uh oh!

RatanKokal commented Mar 25, 2026

Uh oh!

RatanKokal commented Mar 28, 2026

Uh oh!

RatanKokal commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RatanKokal commented Mar 23, 2026 •

edited

Loading

RatanKokal commented Mar 23, 2026 •

edited

Loading

2. Hardware Efficiency (`perf stat`)

4. Hotspot Analysis (`perf report`)

Appendix: Understanding the Core Configurations (`taskset`)