Motivation: Why is this needed?
Fortran currently lacks any standard facility for pattern matching and text processing. Researchers and engineers frequently deal with:
- Complex log file parsing from simulation outputs.
- Data cleaning from instrumentation that doesn't follow strict CSV/TSV formats.
- Advanced string validation that goes beyond simple index() and verify() calls.
- While languages like Python (1997), Julia (2012), and Rust (2014) have robust built-in or standard-library regex support, Fortran developers are
forced to write "cascading if-chains" or "manual ASCII arithmetic" for even basic tasks (like finding a digit or optional characters). Providing a native, pure Fortran regex engine fulfills a critical gap in the modern Fortran ecosystem.
Implementation Idea: The Pure Fortran Thompson NFA
I propose implementing a Thompson NFA (Nondeterministic Finite Automaton) engine entirely in Fortran.
Why Pure Fortran? It avoids the "C-interop fragility" of POSIX <regex.h> (which is non-existent on Windows) and adheres to the stdlib principle of zero external dependencies.
Guaranteed Performance: Unlike PCRE (which uses backtracking and can explode to $O(2^n)$ time), Thompson's algorithm guarantees linear time $O(n \times m)$ relative to input length.
Zero Dependencies: Fully portable across gfortran, ifx, flang, and nvfortran without linking external libraries like PCRE2.
The engine will follow a 4-stage pipeline:
Lexer: Tokenize the pattern string into atoms (literals, operators, escapes).
Parser: Build an Abstract Syntax Tree (AST) using recursive descent.
NFA Builder: Convert the AST into a state graph using Thompson's construction.
Executor: Simulate the NFA on the input string using a linear-time set-based simulation.
Prior Art
Go (golang/regexp): The regexp package in Go is the gold standard for a production-grade Thompson NFA implementation. We will use its API (Find, Match, Replace) as a blueprint for the Fortran interface.
Russ Cox : The definitive work proving that Thompson NFA is the superior choice for predictable performance.
Additional Information
Initial Scope: Support for *, +, ?, |, (), [], ^, $.
Memory Management: The engine will leverage modern Fortran's allocatable arrays for state management to ensure safety and efficiency.
Motivation: Why is this needed?
Fortran currently lacks any standard facility for pattern matching and text processing. Researchers and engineers frequently deal with:
forced to write "cascading if-chains" or "manual ASCII arithmetic" for even basic tasks (like finding a digit or optional characters). Providing a native, pure Fortran regex engine fulfills a critical gap in the modern Fortran ecosystem.
Implementation Idea: The Pure Fortran Thompson NFA
I propose implementing a Thompson NFA (Nondeterministic Finite Automaton) engine entirely in Fortran.
Why Pure Fortran? It avoids the "C-interop fragility" of POSIX$O(2^n)$ time), Thompson's algorithm guarantees linear time $O(n \times m)$ relative to input length.
<regex.h>(which is non-existent on Windows) and adheres to thestdlibprinciple of zero external dependencies.Guaranteed Performance: Unlike PCRE (which uses backtracking and can explode to
Zero Dependencies: Fully portable across
gfortran,ifx,flang, andnvfortranwithout linking external libraries like PCRE2.The engine will follow a 4-stage pipeline:
Lexer: Tokenize the pattern string into atoms (literals, operators, escapes).
Parser: Build an Abstract Syntax Tree (AST) using recursive descent.
NFA Builder: Convert the AST into a state graph using Thompson's construction.
Executor: Simulate the NFA on the input string using a linear-time set-based simulation.
Prior Art
Go (golang/regexp): The
regexppackage in Go is the gold standard for a production-grade Thompson NFA implementation. We will use its API (Find, Match, Replace) as a blueprint for the Fortran interface.Russ Cox : The definitive work proving that Thompson NFA is the superior choice for predictable performance.
Additional Information
Initial Scope: Support for
*,+,?,|,(),[],^,$.Memory Management: The engine will leverage modern Fortran's allocatable arrays for state management to ensure safety and efficiency.