Conversation
Made mem access contiguous and also tried to minimize transpose operations
Converted functions to subroutines to eliminate the "hidden allocation tax" and memmove overhead Replaced array slicing with full-block pointer Added a dedicated buffer for the final layer_norm to satisfy Fortran's strict-aliasing rules
|
These optimizations were done for gfortran, which turns out to be bad for lfortran. Looking into this. My initial analysis says that it is not using openBLAS for matmul |
|
Don't worry about LFortran, its not ready for performance yet. If it got faster with GFortran, then it's a win. Can you separate code formatting from actual changes? Don't format the code on this PR, just add the changes and let's see. |
Done |
|
Excellent. Now let's evaluate the changes. Can you post timings before and after using gfortran and all optimizations on? Also post what mode you are using (openblas/accelerate/etc.) and exactly how you are benchmarking it. |
fastGPT Performance Benchmark: openBLASSummaryThe 1. Throughput & Scaling Full Data (Tokens / Second)Measured across different sequence lengths using Linux Short Context (20 Tokens)
Medium Context (200 Tokens)
Long Context (1000 Tokens) - The "Memory Cliff"
Scaling Analysis: The unoptimized code suffered a massive scaling collapse at 1000 tokens (4 cores performed worse than 1 core due to memory allocation contention). The optimized code flattened this cliff, allowing a single core to cleanly outperform 4 cores running the old code by nearly 4x. 2. Hardware Efficiency (
|
| Metric | Original Code | Optimized Code | Impact |
|---|---|---|---|
| Wall-Clock Time | 1.858s | 1.231s | 33.7% Faster |
| Total Instructions | 8.05 Billion | 2.54 Billion | - 68.4% (5.5B fewer) |
| CPU Cycles | 5.84 Billion | 3.82 Billion | - 34.6% |
| Cache References | 284.2 Million | 209.5 Million | - 26.3% |
| Cache Misses | 157.4 Million | 185.9 Million | (See Profiling Note) |
| IPC (Efficiency) | 1.38 | 0.66 | (See Profiling Note) |
Profiling Note: The original code executed 5.5 billion "empty calorie" instructions packing temporary arrays in the L1 cache. This artificially inflated Cache References and IPC, while stalling actual generation. The optimized code executes math natively, dropping the IPC but finishing significantly faster because it purely waits on main memory bandwidth.
3. Control Flow & Branch Prediction
Measuring the CPU branch predictor's workload during the 20-token run.
| Metric | Original Code | Optimized Code | Impact |
|---|---|---|---|
| Total Branches | 480.7 Million | 192.1 Million | - 60% |
| Branch Misses | 2.84 Million | 1.40 Million | - 50% pipeline flushes |
Key Takeaway: Fusing nested layer_norm loops and stripping array temporaries deleted 288 million useless branch evaluations, cutting the absolute number of expensive CPU pipeline flushes in half.
4. Hotspot Analysis (perf report)
Original Code (Main Branch)
The CPU is dominated by OpenBLAS memory management. 72.3% of all cycles are spent simply rearranging data for the math library rather than calculating tokens.
sgemm_incopy(47.60%): Internal BLAS memory packing.sgemm_itcopy(24.70%): Internal BLAS matrix transposing.sgemm_kernel(19.69%): Actual computation.
Optimized Code (Contiguous PR)
Packing overhead is virtually eliminated. The CPU now spends its time natively in the optimized Fortran logic.
__gpt2_mod_MOD_linear(25.84%): Direct computation via our optimizedn_seq_x == 1path.__driver_MOD_load_model(42.40%): Fixed I/O cost (Percentage will drop as token count increases).sgemm_kernel(5.81%): Drastic reduction in BLAS dependency.
Appendix: Understanding the Core Configurations (taskset)
During this benchmark, taskset -c was used to pin the execution to specific hardware threads. Modern CPUs (specifically Intel 12th/13th Gen+ architectures) use a hybrid layout of Performance Cores (P-cores) and Efficiency Cores (E-cores), alongside Hyper-Threading.
The perf stat logs confirmed the presence of cpu_core (P-cores) and cpu_atom (E-cores). This explains the specific scaling behaviors observed:
taskset -c 0(1 Logical Core): Pins the process strictly to CPU 0, which is a high-speed P-core. This provides the cleanest look at single-thread IPC and cache behavior without OS context switching.taskset -c 0,1(2 Logical Cores): In many architectures, CPU 0 and CPU 1 are two Hyper-Threads sharing the exact same physical P-core. They share the same L1/L2 cache and execution units. This is why0,1often showed little improvement (or even regression) compared to0alone—the threads were fighting over the same physical hardware resources.taskset -c 0,2(2 Physical Cores): CPU 0 and CPU 2 are typically two distinct physical P-cores. This configuration gave the system access to double the L1/L2 cache and double the execution units. This explains why0,2consistently outperformed0,1across the benchmarks.taskset -c 0,1,2,3(4 Logical Cores): Utilizes multiple physical cores and their Hyper-Threads. While this provided maximum compute, it triggered the "Memory Cliff" at 1000 tokens in the old code due to 4 threads simultaneously fighting for memory bus bandwidth to read the massive KV cache.
|
Please answer what I asked for:
Post what GFortran options and a version you used. Post what configuration you used for fastGPT. And post the exact commands you used to benchmark. You need to post all the details, so that I can reproduce the benchmarks locally. |
fastGPT Benchmarking Process & Methodology1. Environment Specifications
2. Build PipelineThe project is compiled in Release mode to enable high-level loop optimizations and vectorization. We utilize a modern CMake workflow for cross-platform consistency. Configurationcmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DFASTGPT_BLAS=OpenBLAS \
-DCMAKE_Fortran_COMPILER=gfortranParallel CompilationThe cmake --build build -jActive Optimization Flags3. Benchmarking MethodologyTo account for the hybrid P/E-core split of the i5-1235U, strict CPU affinity is enforced using Linux Execution & Core IsolationSingle P-Core Isolation (Baseline)taskset -c 0 ./build/gpt2Dual P-Core Scalingtaskset -c 0,1 ./build/gpt24. Hardware Telemetry & ProfilingThe Micro-architectural AnalysisTo measure the reduction in total instruction volume and monitor cache-miss behavior: perf stat -e instructions,cycles,cache-references,cache-misses,branches,branch-misses \
taskset -c 0 ./build/gpt2Hotspot IdentificationTo confirm the elimination of OpenBLAS internal packing routines ( # 1. Record the hardware profile
perf record -g taskset -c 0 ./build/gpt2
# 2. Inspect the call-graph hotspots
perf report |
|
Also, had a query regarding the absence of benchmarks for multi-core in case of Accelerate. Is there any architectural reason behind that? |
|
Yes, there is only one unit that Accelerate runs on, so multiple cores do not speed it up. I'll try to reproduce your benchmarks soon and then we can discuss more. |
|
I have been reviewing the proposed use of MPI and coarrays. While I recognize that these frameworks are the standard for multi-node computation and distributed memory systems, I’m trying to better understand their specific intent within our current hardware constraints. Specifically, on a single multi-core machine (8–12 cores), the overhead of message passing often outweighs the benefits compared to shared-memory parallelization. Since GEMM in BLAS is already highly optimized for these architectures, and fastGPT inference is currently memory-bound, introducing a parallelized message-passing system may increase communication overhead without yielding performance gains, as compute is not the primary bottleneck. The operation being memory-bound after optimization can be observed, as throughput does not scale with cores. |
One use case which I can foresee is for gpt2-xl which is 1.5B parameters model. It may need more than one machines if not quantized to handle memory constraints. |
|
Hi @certik, were you able to successfully run the benchmarks? |
Following optimizations are performed
Subroutine Refactoring: Switched from array-returning functions to subroutines with intent(inout) to enable in-place mutation and kill the "hidden allocation tax."
Zero-Copy Cache Passing: Eliminated the token-by-token allocate/deallocate of kv_cache2 by passing a fixed-size buffer with a max_seq_cache pointer.
The Transpose/itcopy Fix: Re-aligned the wte and attn_w weight dimensions to match Fortran's Column-Major storage, stopping OpenBLAS from calling itcopy (internal transpose) on every single layer.
Contiguous Head Slicing: Moved the n_head index to the last dimension of the cache so that k_cache(:, :, l) is a single contiguous memory block.
Strict Aliasing Protection: Added a dedicated block buffer for the final layer_norm to prevent the compiler from generating defensive memory copies when input and output arrays overlapped.
Small GEMV Heuristic: Optimized the linear layer to use pure matmul for 1×N vectors
Softmax In-Place: Decoupled the softmax calculation from the matmul call to prevent the creation of intermediate temporary arrays during the attention mechanism.
incopy and Alignment: Aligned memory for direct SIMD streaming, eliminating incopy and internal buffering.