0

How PHP strings work internally (byte strings, not unicode)

Intermediate5 min read·php-03-001
interviewcompare

Concept

PHP strings are byte strings — sequences of arbitrary bytes with no inherent encoding. PHP has no native Unicode string type. A PHP string is just a length-prefixed array of bytes; character "position 3" means byte offset 3, which may be the middle of a multi-byte UTF-8 character. This is the root cause of most internationalization bugs in PHP codebases.

Internally, a PHP string is a zend_string heap allocation containing: a len field (the byte count), a val array (the raw bytes, null-terminated for C interop), a refcount, and an h field (pre-computed hash for hash-table lookups). The hash is computed lazily and cached on first use.

strlen() returns byte count, not character count: For ASCII text this is the same. For UTF-8, a single emoji or accented character may consume 2–4 bytes. strlen('é') returns 2, not 1. Every core string function — substr, strpos, strtolower, ucfirst — operates on bytes. Using them on multi-byte text silently corrupts non-ASCII characters. The mbstring extension provides UTF-8-aware equivalents (mb_strlen, mb_substr, etc.).

Practical rules: Always use mb_* functions when handling user input that may contain non-ASCII characters. Set mbstring.internal_encoding = UTF-8 in php.ini. Use mb_detect_encoding when accepting data from external sources. Laravel's Str helper facade internally uses mb_* functions by default, which is why it handles UTF-8 safely.

Code Example

php
<?php
declare(strict_types=1);

$str = 'Héllo'; // UTF-8: H=1 byte, é=2 bytes, l=1, l=1, o=1 → total 6 bytes

// Byte-based functions — WRONG for multi-byte text
echo strlen($str);          // 6  (bytes, not characters!)
echo substr($str, 0, 2);    // "H" + first byte of é → garbled output
echo strtolower('HÉLLO');   // "hÉllo" — é not lowercased (non-ASCII)

// MB-safe functions — CORRECT
echo mb_strlen($str);           // 5  (characters)
echo mb_substr($str, 0, 2);     // "Hé" (2 characters, 3 bytes)
echo mb_strtolower('HÉLLO');    // "héllo"

// String comparison is byte-by-byte
$a = 'café';
$b = "cafe\u{0301}"; // 'e' + combining accent — same visual, different bytes
var_dump($a === $b);  // false — different byte sequences

// Checking encoding
$input = "user input from form";
$encoding = mb_detect_encoding($input, ['UTF-8', 'ISO-8859-1'], strict: true);

// Binary-safe operations
$binary = file_get_contents('/path/to/image.png');
echo strlen($binary); // returns byte size — correct for binary data

// PHP string as binary buffer
$packed = pack('N', 1024); // 4-byte big-endian integer
echo bin2hex($packed);     // "00000400"
$unpacked = unpack('N', $packed);