How PHP strings work internally (byte strings, not unicode)
Concept
PHP strings are byte strings — sequences of arbitrary bytes with no inherent encoding. PHP has no native Unicode string type. A PHP string is just a length-prefixed array of bytes; character "position 3" means byte offset 3, which may be the middle of a multi-byte UTF-8 character. This is the root cause of most internationalization bugs in PHP codebases.
Internally, a PHP string is a zend_string heap allocation containing: a len field (the byte count), a val array (the raw bytes, null-terminated for C interop), a refcount, and an h field (pre-computed hash for hash-table lookups). The hash is computed lazily and cached on first use.
strlen() returns byte count, not character count: For ASCII text this is the same. For UTF-8, a single emoji or accented character may consume 2–4 bytes. strlen('é') returns 2, not 1. Every core string function — substr, strpos, strtolower, ucfirst — operates on bytes. Using them on multi-byte text silently corrupts non-ASCII characters. The mbstring extension provides UTF-8-aware equivalents (mb_strlen, mb_substr, etc.).
Practical rules: Always use mb_* functions when handling user input that may contain non-ASCII characters. Set mbstring.internal_encoding = UTF-8 in php.ini. Use mb_detect_encoding when accepting data from external sources. Laravel's Str helper facade internally uses mb_* functions by default, which is why it handles UTF-8 safely.
Code Example
<?php
declare(strict_types=1);
$str = 'Héllo'; // UTF-8: H=1 byte, é=2 bytes, l=1, l=1, o=1 → total 6 bytes
// Byte-based functions — WRONG for multi-byte text
echo strlen($str); // 6 (bytes, not characters!)
echo substr($str, 0, 2); // "H" + first byte of é → garbled output
echo strtolower('HÉLLO'); // "hÉllo" — é not lowercased (non-ASCII)
// MB-safe functions — CORRECT
echo mb_strlen($str); // 5 (characters)
echo mb_substr($str, 0, 2); // "Hé" (2 characters, 3 bytes)
echo mb_strtolower('HÉLLO'); // "héllo"
// String comparison is byte-by-byte
$a = 'café';
$b = "cafe\u{0301}"; // 'e' + combining accent — same visual, different bytes
var_dump($a === $b); // false — different byte sequences
// Checking encoding
$input = "user input from form";
$encoding = mb_detect_encoding($input, ['UTF-8', 'ISO-8859-1'], strict: true);
// Binary-safe operations
$binary = file_get_contents('/path/to/image.png');
echo strlen($binary); // returns byte size — correct for binary data
// PHP string as binary buffer
$packed = pack('N', 1024); // 4-byte big-endian integer
echo bin2hex($packed); // "00000400"
$unpacked = unpack('N', $packed);