libfsm compiles regular expressions to deterministic finite state machines (FSMs) and generates executable code. FSM-based matching runs in linear time O(n) with no backtracking.
Regex engines like PCRE use backtracking to explore multiple possible match paths at runtime. This means the same pattern can have different execution costs depending on the input.
libfsm instead resolves all match decisions at compile time by constructing a Deterministic Finite Automaton (DFA). At runtime, matching is a single linear pass over the input with no alternative paths to explore.
As a result, libfsm avoids input-dependent slowdowns and is not susceptible to regular expression–based denial-of-service (ReDoS) attacks.
libfsm is not a drop-in replacement for traditional regex engines. It only supports patterns that can be compiled to FSMs.
- What libfsm Cannot Do
- Quick Start
- Supported Code Generation Targets
- Workflow Overview
- Writing Effective libfsm Patterns
- Byte Search Optimization
- Troubleshooting
- Pattern Matches Empty String Unintentionally
These PCRE features will not compile:
- Word boundaries (
\b) - Non-greedy quantifiers (
*?,+?,??) - Group capture (coming soon!) and backreferences
- Lookahead/lookbehind assertions (
(?=,(?!,(?<=,(?<!) - Conditional expressions (
(?(condition)then|else)) - Recursion and subroutines (
(?R),(?1))
Generate a matcher from a regex:
# Generate a Go matcher
re -p -r pcre -l go -k str 'user\d+' > user_detector.goThis produces a standalone matcher function.
libfsm provides stable, “first-class” code generation for:
- High-level languages: C (via
-l vmc), Go, Rust - LLVM IR
- Native WebAssembly
Adding code generation for new languages is straightforward and is defined in src/libfsm/print/.
libfsm provides two main tools for pattern matching:
retakes patterns from the command linerxtakes patterns from a file
A recommended workflow when using libfsm is:
-
Validate the regex
Test behavior using any PCRE-compatible tool (e.g., pcregrep(1) on the CLI or https://regex101.com/ in the browser).
-
Verify libfsm compatibility
If unsupported constructs exist, libfsm reports the failing location:
re -r pcre -l ast 'x*?' # Output: /x*?/:3: Unsupported operator
In this example,
:3indicates that the character at byte offset three in the pattern is an unsupported feature.# patterns with unsupported operators are output to declined.txt rx -r pcre -l ast -d declined.txt 'x*?'
-
Generate code
re -p -r pcre -l rust -k str '^item-[A-Z]{3}\z' > item_detector.rs
-
Use multiple patterns
Execution complexity for the generated code is proportional to the length of the text being matched, not to the number of patterns. Assuming your generated code isn't too large to compile, this means you can have as many patterns as you want, for the same time it takes to execute a single pattern.
Take advantage of this.
# re - patterns from command line: re -p -r pcre -l go -k str '^x?a b+c$' '^x*def?$' '^x$' # rx - patterns from file: rx -p -r pcre -l vmc -k str -d skipped.txt patterns.txt > detectors.c
-
Call the generated code from your program somehow
You're on your own for this.
-kcontrols the API for the generated code to read in data to match. Try different options for the language you're using and see which suits you.The generated API can also vary depending on how you want libfsm to handle ambiguities between different patterns. See the
AMBIG_*flags in include/fsm/options.h for different approaches there.
Both tools:
- Combine all patterns into one function (like using
|to join them) - Generate code that can return
(bool, int)for the match status and pattern ID - Pattern ID is argument position for
re, line number forrx - When encountering unsupported patterns:
rxcan decline them to-dfile and generates code with working patterns only;refails completely
| Flag | Purpose | Common Options | Notes |
|---|---|---|---|
-r |
Regex dialect | pcre, literal, glob, native, sql |
pcre supports the widest set of features |
-l |
Output language for printing | go, rust, vmc, llvm, wasm, dot |
Use vmc for C code. Pipe dot into idot for visualization |
-k |
Generated function I/O API | str, getc, pair |
str takes string, pair takes byte array, getc uses callback for streaming |
-p |
Print mode | (no value) | Abbrv. of -l fsm. Print the constructed fsm, rather than executing it. |
-d |
Declined patterns | filename | Only applies to rx (batch mode) |
This is not an exhaustive list. For full flag details, see include/fsm/options.h and the man pages.
The man pages can be built by running bmake -r doc, then view with man build/man/re.1/re.1.
Generally, to keep generated code compact, stick to the least expressive subset of features.
libfsm has no way to know in advance what text you'll be passing to its generated code. For example, are you matching a string that you know will never contain a newline? libfsm doesn't know that. It has to generate code that's capable of handling any input. You can help it out by making your patterns precise.
Think about what you intend your pattern to match, and what it's actually capable of matching given arbitrary text. This helps restrict the scope of your pattern from arbitrary text to exactly what you mean. The following bits of advice illustrate various specific ways to bring down this scope.
-
Replace broad wildcards
Avoid
.*and.+when possible. Wildcards match “anything,” which is often imprecise. And although they look compact, libfsm must enumerate every possible byte and continuation. This quickly leads to large DFAs.For example, a double-quoted string should not use
".*"because the content cannot contain an unescaped quote. Using.*forces libfsm to consider all characters -- including both the presence and absence of the closing"at every step. This greatly increases the number of states.Instead, restrict it to the actual valid characters
"[^"\r\n]*", which matches only what is allowed and will keep the DFA more compact.Use negated character classes to match only the allowed content:
Avoid Better <.*><[^>]*>\((.*)\)\([^)]*\)price=.+price=[0-9]+var\s.+=var\s[^=]+=The overlap between
.*or.+and strings that follow is often the cause of an “explosion” in the size of the generated FSM. So when compilation is slow or generated output is large, look for.*and.+first and replace them with a narrower character class. -
Take care with bounded repetition
If you have the pattern
^x{3,5}$, libfsm's resulting DFA will be structured like "match an x, then match an x, then match an x, then match an x or skip it, then match an x or skip it, then report an overall match if at the end of input". It has to repeat the pattern, noting each time whether it's required or optional (beyond the lower count in{min,max}), because DFA execution doesn't have a counter, just the current state within the overall DFA.When the subexpression (represented by
x) unintentionally matches too many things, they all have to be spelled out every time. So pay especially close attention to tightening up subexpressions in bounded repetition clauses. -
Anchor when matching full string
When the intention is to match an entire string, use anchors. Use
^at the beginning and\zfor the true end of the string.# Correct: matches only this exact hostname # Matches "web12.example.com" # Does not match "foo-web12.example.com-bar" ^web\d+\.example\.com\z # Incorrect: would match inside a larger string # Matches "web12.example.com" # Also matches "foo-web12.example.com-bar" web\d+\.example\.com
-
Prefer
\zover$for End-of-String\zalways matches the end of the string.$will also match a trailing newline at the end of the string, so if you use this in combination with capturing groups, you may not be capturing what you expect. Also,\zproduces a smaller FSM, so it is better to use it in places where\ncannot appear.# Preferred: matches only if the string ends with "bar" # Matches "/foo/bar" # Does NOT match "/foo/bar\n" /bar\z # Incorrect: allows a trailing newline, # which is usually unintended and adds unnecessary complexity # Matches "/foo/bar" # Also matches "/foo/bar\n" /bar$
-
Escape special characters when used as literals
Many characters have special meaning in regex (for example
.,+,*,?,[,(). If you mean to match them literally, escape them:Literal You Want Correct Regex Explanation example.comexample\.com.matches any character unless escapeda+ba\+b+means “one or more”price?price\??means “optional”[value]\[value\][and]start/end a character class(test)\(test\)(and)begin/end a groupMarkdown link [t](u)(\[[^]]*\]\([^)]*\))Matches [text](url)without crossing]or)The
.wildcard in particular is often mistakenly left unescaped in practice. On testing, it will match a literal.as intended. But it will also match any other character. This means that not only is your pattern incorrect (write negative test cases!), but also this part of your FSM is 256 times larger than it should be. -
Use non-capturing groups
Capture groups are currently not supported (coming soon!).
If you don't need to capture things, don't use capture. If you need grouping for alternation or precedence, use PCRE's non-capturing syntax
(?:...):# Correct (?:private|no-store) # Not what's intended (private|no-store)
Patterns that start with an uncommon character can be accelerated using an initial byte scan before running the FSM. This quickly jumps to likely match positions instead of scanning every byte.
Good candidates are patterns that start with uncommon prefix characters, for example:
#tag-[a-z]+
@user-[0-9]+
\[section\]
{"key":
"name='[^']+'"These prefixes (#, @, [, {, ', ") are rare in normal text, so a byte search can skip ahead before running the matcher.
We found using strings.IndexByte before calling the generated matcher in Go code significantly improved performance when matching strings with a large (>5k) leading prefix.
Pattern:
\s*Will compile to code that always returns true.
This is only an issue if that is not what you intend.
Fix options:
- Require at least one match:
\s+ - Anchor context:
^\s+$or alternatively, use-Fbflag