Skip to content

Regular expressions, explained

Regular expressions are a compact way to describe text patterns. You use them when "contains this exact word" is too simple and full parsing would be overkill.

Need to find every email-looking string in a log file? Regex can help. Want to validate a date-like format before deeper processing? Regex is often the first pass. Need to split, search, replace, or extract repeated text structures? Same story.

The flip side is that regex gets hard to read fast. A short pattern can feel magical when it works and hostile when it breaks. The trick is to stop treating a regex as one mysterious blob. Read it as pieces: what characters can appear here, how many times, where in the string, and whether the match should be captured for later use.

This guide covers the pieces that matter most in practice: matching, character classes, quantifiers, anchors, groups, flags, escaping, and the performance mistakes that turn a handy pattern into a slow one.

What regex matching actually does

A regex engine looks at input text and tries to find a match for the pattern you gave it.

Sometimes that means "find a match anywhere in the string." Sometimes it means "the whole string must match from start to end." That difference matters. A pattern like cat will match "concatenate" because those three letters appear inside it. A pattern like ^cat$ matches only the string "cat" and nothing more.

That is why people often get surprising results at first. The regex is not wrong. It is doing exactly what the engine was asked to do.

Character classes

Character classes say what kind of character is allowed at a position.

Some common examples:

  • [abc] matches one character that is either a, b, or c
  • [a-z] matches one lowercase ASCII letter
  • [0-9] matches one digit
  • [^0-9] matches one character that is not a digit

You will also see shorthand classes:

  • \d for a digit
  • \w for a "word" character in many engines
  • \s for whitespace

Each engine has its own details, especially around Unicode, so it helps to test patterns against real input instead of trusting memory.

Character classes are one of the easiest ways to make a regex more specific. Instead of saying "anything goes here," you can say "only digits," "only hex characters," or "only letters, hyphens, and spaces."

Quantifiers

Quantifiers say how many times something may repeat.

The big ones are:

  • * = zero or more
  • + = one or more
  • ? = zero or one
  • {3} = exactly three
  • {2,5} = between two and five
  • {2,} = two or more

So:

  • \d+ matches one or more digits
  • [A-Z]{2} matches exactly two capital letters
  • colou?r matches both color and colour

Quantifiers are where greediness enters the story. By default, many regex engines are greedy, meaning they try to match as much as they can. That is why ".*" can swallow more text than you meant when matching quoted strings.

In many engines you can make a quantifier lazy by adding ?, like .*?, which asks for the shortest match that still lets the overall pattern succeed.

Anchors

Anchors do not match visible characters. They match positions.

The two anchors most people learn first are:

  • ^ = start of the string or line
  • $ = end of the string or line

These matter whenever you want to validate an entire field instead of finding a substring inside it.

For example:

  • \d{5} matches a five-digit sequence anywhere
  • ^\d{5}$ matches only a string that is exactly five digits long

Without anchors, form validation patterns are often much looser than people think.

You may also run into boundaries such as \b, which tries to match a word boundary. Useful, but not magical. Whether a position counts as a word boundary depends on how the engine defines "word character."

Groups and capturing

Parentheses create groups.

Groups do two common jobs:

2. they capture matched text for later use

For example:

(ha)+

matches ha, hahaha, and so on, because the quantifier applies to the grouped ha.

Capturing groups are helpful when you want pieces back. If you match:

^(\d{4})-(\d{2})-(\d{2})$

against 2026-06-06, group 1 is the year, group 2 the month, group 3 the day.

Many engines also support non-capturing groups:

(?:pattern)

Use these when you need grouping for structure but do not actually need to store the matched text.

Flags

Flags change how the engine interprets the pattern.

Common ones include:

  • i for case-insensitive matching
  • g for global matching
  • m for multiline behavior
  • s for dot-matches-newline behavior in engines that support it

Flags are easy to forget when debugging, because the exact same pattern can behave differently once a flag is turned on.

For example, ^error with multiline mode may match the start of any line in a multi-line string, not just the start of the whole string. And . usually does not cross line breaks unless a dot-all mode is enabled.

If a pattern feels "randomly broken," check the flags before rewriting the whole thing.

Escaping

Regex uses punctuation as syntax, so literal punctuation often needs escaping.

These characters are frequent trouble spots:

  • .
  • *
  • +
  • ?
  • (
  • )
  • [
  • ]
  • {
  • }
  • ^
  • $
  • |
  • \

If you want a literal dot, use \. instead of .. Otherwise you are saying "any character" rather than "a period."

There is also a second escape layer in many programming languages. In JavaScript, for example, a string literal may need \\d to produce a regex that contains \d. Regex syntax and language-string syntax can stack on top of each other, which is why copying a pattern from a tester into code sometimes breaks it.

A worked example: build a date regex step by step

Let us say you want to match a simple date in YYYY-MM-DD format.

Start too small:

\d+

That matches digits, but it is far too loose. It will happily match 2026 inside a sentence and stop there.

Now shape the full pattern:

\d{4}-\d{2}-\d{2}

Better. This matches something that looks like 2026-06-06.

But it can still match inside longer text, so anchor it:

^\d{4}-\d{2}-\d{2}$

Now the whole string must look like that format.

Suppose you want the parts separately:

^(\d{4})-(\d{2})-(\d{2})$

Now you can capture year, month, and day as separate groups.

That is useful, but it still allows nonsense like 2026-99-99. You can tighten the ranges:

^(\d{4})-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$

This is a better pattern for the shape of the date:

  • month is 01 through 12
  • day is 01 through 31

It still does not know whether February has 29 days this year or whether April has 30 days. That is the point where regex has done its job and normal date parsing should take over.

This is a great example of the right boundary for regex. Use it to confirm the basic format. Do not force it to become a full calendar engine unless you truly have to.

Try it in your browser

Our Regex Tester is useful when you want to build a pattern step by step, try different flags, and see exactly what matched before you drop the regex into code.

That kind of feedback loop matters because regex mistakes are often tiny:

  • one missing escape
  • a greedy quantifier that should be lazy
  • a class that is too wide
  • anchors missing from a validation pattern

A browser tester makes those mistakes visible faster than squinting at a source file and guessing.

Common performance pitfalls

Most everyday regexes are fine. Performance trouble usually shows up when the pattern mixes broad wildcards, nested repetition, and lots of backtracking.

Classic trouble patterns look like this:

(.*)+

or

(a+)+

These can lead to catastrophic backtracking, where the engine tries huge numbers of paths before giving up on a near-match. On a large string, that can turn a tiny pattern into a real slowdown.

Some safer habits:

  • prefer specific character classes over .* when you know the allowed set
  • anchor validation patterns when possible
  • avoid nested greedy quantifiers unless you really understand the engine behavior
  • test with long failure cases, not just short success cases

If a regex might run on user input in a hot path, performance is part of correctness.

Common mistakes

Forgetting anchors in validation. \d{5} is not the same as ^\d{5}$.

Using . when you mean a literal dot. That one character changes the whole pattern.

Expecting regex to parse everything. Regex is great for format checks and extraction. It is not always the right tool for full parsing with nested grammar rules.

Ignoring engine differences. JavaScript, PCRE, Python, Rust, and RE2-style engines overlap a lot, but they do not match feature-for-feature.

Writing the full pattern in one shot. Stepwise construction is slower at the keyboard and much faster in debugging time.

FAQ

Plain text search looks for exact characters. Regex describes a pattern, so it can match many possible strings at once.

Because many engines search for a match anywhere by default. Use anchors like ^ and $ when you want full-string validation.

It means "zero or more of any character," usually except newline unless a flag changes that behavior. It is powerful and easy to overuse.

Only to a point. A simple regex can catch obvious mistakes, but full email validation gets complicated quickly. In most apps, a moderate format check plus a real verification step is better.

Because certain failing inputs can trigger heavy backtracking. Success cases may look fast while worst-case failures expose the real cost.

Related guides