
December 23, 2025
Inspired by a talk by Yudai Takada (@ydah) at Hokuriku Ruby Kaigi 01
Regular expressions are one of the most widely used tools in programming, yet they are often treated as black boxes. We use them daily, but rarely stop to ask a fundamental question: how does a regular expression engine actually work?
In his talk βWalking with Computer Science in Ruby β Building a DFA-based Regular Expression Engineβ, Yudai Takada, product engineer at SmartHR and CRuby contributor, takes us on a deep but accessible journey into the internals of regex engines β not by theory alone, but by building one step by step in Ruby.
What βRegular Expressionsβ Really Mean

Takada begins by clarifying an important misconception: many features commonly called βregular expressionsβ are not regular in the strict computational sense.
As he explains, regular expressions are:
βA mathematical and computational concept for describing a set of strings using a single notation.β
However, modern engines often include extensions such as backreferences, lookaheads, and recursion β features that go beyond regular languages. Takada references Larry Wallβs famous remark:
βWhat we call βregular expressionsβ are only marginally related to real regular expressions.β
This distinction matters, because once we restrict ourselves to true regular expressions, we gain powerful theoretical guarantees β including linear-time matching.
Four Ways to Match a Regular Expression
The talk outlines four major approaches to regex matching:
- DFA-based engines (e.g. RE2, Hyperscan)
- Backtracking NFA engines (e.g. PCRE, Python, Ruby)
- VM / bytecode-based engines
- Regex derivatives (Brzozowski derivatives)
Takada deliberately focuses on DFA-based engines, stating that this approach offers:
βSimple and fast matching with no ambiguity.β
Unlike backtracking engines, DFA matching guarantees O(n) runtime with no catastrophic backtracking.
From Pattern to Machine: The Compilation Pipeline
A key contribution of the talk is its clear explanation of the classic regex compilation pipeline:
Pattern String
β
Abstract Syntax Tree (AST)
β
NFA (Non-deterministic Finite Automaton)
β
DFA (Deterministic Finite Automaton)
Takada emphasizes that regex engines are not magic:
βA regular expression engine is a pipeline that parses a pattern and transforms it into a fast matching machine.β
Each stage serves a purpose:
- The AST removes ambiguity from operator precedence.
- The NFA is easy to construct from the AST using Thompsonβs construction.
- The DFA removes non-determinism to enable fast execution.
Thompsonβs Construction: Turning Regex into an NFA
Using Thompsonβs algorithm, Takada shows how to construct NFA βfragmentsβ for:
- literals
- concatenation
- alternation (a|b)
- repetition (*)
Each fragment has:
- a single start state
- a single accept state
- optional Ξ΅-transitions
This compositional approach allows even complex patterns like a(b|c)* to be built recursively.
The Cost of Non-Determinism
While NFAs are easy to build, Takada points out their main drawback:
βDuring matching, an NFA must track all possible current states at the same time.β
This means managing sets of states at every character, which increases runtime cost. To solve this, the engine converts the NFA into a DFA.
Subset Construction: From NFA to DFA
The conversion uses the classic subset construction algorithm:
βOne DFA state represents a set of NFA states.β
Key steps include:
- Computing Ξ΅-closures
- Generating transitions for each input symbol
- Registering new DFA states
- Marking accept states
- Repeating until no new states remain
Once complete, the DFA has no Ξ΅-transitions and no ambiguity.

Matching with a DFA
The final matching algorithm is intentionally simple:
current = dfa.start_state
input.each_char do |char|
current = dfa.transition(current, char)
end
dfa.accept_states.include?(current)
Takada highlights the beauty of this approach:
βMatching becomes simple and fast.β
No recursion. No backtracking. No surprises.
Why Ruby?
One of the most compelling arguments in the talk is why Ruby is the right language for learning this.
Takada explains:
βRuby allows us to focus on the algorithm itself.β
Compared to lower-level languages, Ruby offers:
- expressive syntax
- powerful built-in data structures (Set, Hash)
- automatic memory management
This makes Ruby ideal for learning computer science concepts through implementation.
Conclusion
This presentation is not about replacing Rubyβs regex engine. It is about understanding.
By walking through the construction of a DFA-based regex engine, Yudai Takada demonstrates that even complex systems can be understood β and built β with clarity, rigor, and elegant Ruby code.
As he humorously concludes:
βAutumn to winter is regex engine season.β
The full source code for the regex engine presented in the talk, named Hoozuki, is publicly available on GitHub and serves as an excellent reference for anyone curious to go deeper.
