0

Regular expressions in PHP (PCRE) — syntax and common patterns

Intermediate5 min read·php-03-005
interview

Concept

PHP uses the PCRE (Perl-Compatible Regular Expressions) library for all regex operations. PCRE patterns are delimiters + pattern + modifiers — the delimiter can be any non-alphanumeric character (most commonly /, #, ~, or @). Using # or ~ is useful when matching URLs to avoid escaping every /.

Core syntax: . matches any char except newline, * zero-or-more, + one-or-more, ? zero-or-one, {n,m} range quantifiers. ^ anchors to start, $ to end. Character classes [abc], negated [^abc]. Shorthand: \d digit, \w word char, \s whitespace, and their uppercase negations \D, \W, \S. Parentheses capture; (?:...) non-capturing group.

Modifiers: i case-insensitive, m multi-line (^/$ match line boundaries), s dotall (. matches newline too), u UTF-8 mode (critical for non-ASCII input), x extended (whitespace ignored, allows comments).

Lookaheads and lookbehinds: (?=...) positive lookahead, (?!...) negative lookahead, (?<=...) lookbehind. These assert context without consuming characters.

Greedy vs lazy: quantifiers are greedy by default (match as much as possible). Append ? to make them lazy (*?, +?). Classic mistake: using <.+> to match HTML tags — it matches from the first < to the LAST > in the string.

Code Example

php
<?php
declare(strict_types=1);

// Email validation pattern
$emailPattern = '/^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$/i';
var_dump(preg_match($emailPattern, 'user@example.com')); // int(1)
var_dump(preg_match($emailPattern, 'not-an-email'));      // int(0)

// Capture groups
$datePattern = '/^(\d{4})-(\d{2})-(\d{2})$/';
if (preg_match($datePattern, '2024-01-15', $matches)) {
    [$full, $year, $month, $day] = $matches;
    echo "$day/$month/$year"; // "15/01/2024"
}

// Named captures — more readable than positional
$pattern = '/^(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})$/';
preg_match($pattern, '2024-01-15', $m);
echo $m['year'];  // "2024"
echo $m['month']; // "01"

// UTF-8 mode — REQUIRED for non-ASCII matching
$text = 'Héllo wörld';
preg_match_all('/\w+/u', $text, $words); // /u enables UTF-8
print_r($words[0]); // ["Héllo", "wörld"] — correct

// Without /u: \w doesn't match multi-byte chars
preg_match_all('/\w+/', $text, $words_broken);
// May miss or split multi-byte characters

// Greedy vs lazy
$html = '<b>bold</b> and <b>more</b>';
preg_match('/<.+>/', $html, $greedy);   // "<b>bold</b> and <b>more</b>" — too much
preg_match('/<.+?>/', $html, $lazy);    // "<b>" — just the first tag

// Lookahead — match word followed by a digit (without consuming the digit)
preg_match_all('/\w+(?=\d)/', 'php8 laravel11', $m);
// ["php", "laravel"]

// URL-friendly delimiter to avoid escaping slashes
$url = 'https://example.com/path/to/page';
preg_match('#https?://([^/]+)(.*)#', $url, $m);
echo $m[1]; // "example.com"
echo $m[2]; // "/path/to/page"