0

Multibyte strings — mbstring extension, mb_strlen vs strlen

Intermediate5 min read·php-03-007
interview

Concept

The mbstring extension provides multibyte-aware string functions — the safe replacement for PHP's byte-string builtins when dealing with UTF-8, UTF-16, Shift-JIS, or other multi-byte encodings. Without mbstring, standard functions like strlen, substr, and strtolower silently corrupt non-ASCII text.

The core problem: PHP's strlen('é') returns 2 because é is encoded as 2 bytes in UTF-8. substr('café', 0, 3) returns 'caf' plus half of the é byte sequence — a broken string. strtolower('HÉLLO') leaves É uppercase because it only handles the ASCII range.

Key mb_* functions: mb_strlen (character count, not bytes), mb_substr (character-based slicing), mb_strtolower/mb_strtoupper (Unicode case folding), mb_strpos (character-based search), mb_convert_encoding (transcode between encodings), mb_detect_encoding (guess encoding from byte patterns), mb_internal_encoding (get/set the default encoding for all mb functions).

Best practice: Set mbstring.internal_encoding = UTF-8 in php.ini (or call mb_internal_encoding('UTF-8') at bootstrap). This makes all mb_* functions default to UTF-8 without needing the encoding parameter on every call.

Laravel's Str facade transparently uses mb_* functions where appropriate — Str::length() calls mb_strlen, Str::lower() calls mb_strtolower. Always prefer Str::* helpers in Laravel code over raw string functions.

Code Example

php
<?php
declare(strict_types=1);

mb_internal_encoding('UTF-8');

$str = 'Héllo, Wörld! 🌍';

// Byte count vs character count
echo strlen($str);     // 20 bytes (é=2, ö=2, emoji=4)
echo mb_strlen($str);  // 16 characters

// Safe substring
echo substr($str, 0, 5);     // "H" + garbled bytes (corrupts é)
echo mb_substr($str, 0, 5);  // "Héllo"

// Case conversion
echo strtolower('HÉLLO');    // "hÉllo" — É not lowercased
echo mb_strtolower('HÉLLO'); // "héllo" — correct

// String search
$haystack = 'Üniversität';
echo strpos($haystack, 'si');     // byte offset (may be wrong for multi-byte)
echo mb_strpos($haystack, 'si');  // character offset

// Encoding detection and conversion
$latin1 = "\xE9\xE8\xEA"; // é, è, ê in ISO-8859-1
$encoding = mb_detect_encoding($latin1, ['UTF-8', 'ISO-8859-1'], strict: true);
echo $encoding; // "ISO-8859-1"

$utf8 = mb_convert_encoding($latin1, 'UTF-8', 'ISO-8859-1');
echo $utf8; // "éèê" as valid UTF-8

// String padding (mb-safe)
function mb_str_pad(string $str, int $length, string $pad = ' '): string
{
    $padLen = $length - mb_strlen($str);
    return $str . str_repeat($pad, max(0, $padLen));
}

// Truncate with ellipsis — character-safe
function truncate(string $str, int $maxChars, string $ellipsis = '…'): string
{
    if (mb_strlen($str) <= $maxChars) return $str;
    return mb_substr($str, 0, $maxChars - mb_strlen($ellipsis)) . $ellipsis;
}

echo truncate('This is a long string with unicode: héllo', 20);
// "This is a long stri…"