Skip to content

Pure Fortran Regular Expression (Regex) Module #1163

@JAi-SATHVIK

Description

@JAi-SATHVIK

Motivation: Why is this needed?

Fortran currently lacks any standard facility for pattern matching and text processing. Researchers and engineers frequently deal with:

  • Complex log file parsing from simulation outputs.
  • Data cleaning from instrumentation that doesn't follow strict CSV/TSV formats.
  • Advanced string validation that goes beyond simple index() and verify() calls.
  • While languages like Python (1997), Julia (2012), and Rust (2014) have robust built-in or standard-library regex support, Fortran developers are

forced to write "cascading if-chains" or "manual ASCII arithmetic" for even basic tasks (like finding a digit or optional characters). Providing a native, pure Fortran regex engine fulfills a critical gap in the modern Fortran ecosystem.

Implementation Idea: The Pure Fortran Thompson NFA

I propose implementing a Thompson NFA (Nondeterministic Finite Automaton) engine entirely in Fortran.

Why Pure Fortran? It avoids the "C-interop fragility" of POSIX <regex.h> (which is non-existent on Windows) and adheres to the stdlib principle of zero external dependencies.
Guaranteed Performance: Unlike PCRE (which uses backtracking and can explode to $O(2^n)$ time), Thompson's algorithm guarantees linear time $O(n \times m)$ relative to input length.
Zero Dependencies: Fully portable across gfortran, ifx, flang, and nvfortran without linking external libraries like PCRE2.

The engine will follow a 4-stage pipeline:

Lexer: Tokenize the pattern string into atoms (literals, operators, escapes).
Parser: Build an Abstract Syntax Tree (AST) using recursive descent.
NFA Builder: Convert the AST into a state graph using Thompson's construction.
Executor: Simulate the NFA on the input string using a linear-time set-based simulation.

Prior Art

Go (golang/regexp): The regexp package in Go is the gold standard for a production-grade Thompson NFA implementation. We will use its API (Find, Match, Replace) as a blueprint for the Fortran interface.

Russ Cox : The definitive work proving that Thompson NFA is the superior choice for predictable performance.

Additional Information

Initial Scope: Support for *, +, ?, |, (), [], ^, $.
Memory Management: The engine will leverage modern Fortran's allocatable arrays for state management to ensure safety and efficiency.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ideaProposition of an idea and opening an issue to discuss it

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions