0

Sharding — splitting a database horizontally across multiple servers

Advanced5 min read·eng-15-018
sqlperformanceinterview

Concept

Sharding — horizontally partitioning a database by splitting rows across multiple separate database servers (shards) based on a shard key.

Partitioning vs sharding:

  • Partitioning: Splitting a table across multiple files/disks on the SAME server. Database handles it transparently.
  • Sharding: Splitting data across MULTIPLE SERVERS. Application must know which server to query.

The shard key: The column used to determine which shard a row belongs to. user_id, tenant_id, or geographic_region. Choosing the right shard key is critical.

Sharding strategies:

  • Hash-based: shard = hash(user_id) % num_shards. Distributes evenly but resharding (adding shards) is painful.
  • Range-based: Shard 1: user_id 1-1M, Shard 2: 1M-2M. Easy to understand but can lead to hot spots.
  • Directory-based: A lookup table maps entity → shard. Flexible but lookup table is a bottleneck.

Why shard:

  • Single database server hit a limit: too many rows, too many writes for one machine.
  • Write throughput exceeds what one primary can handle.
  • Dataset too large for one server's storage/RAM.

Sharding problems:

  • Cross-shard queries: Joining users from shard 1 with orders from shard 2 requires a fan-out query to both shards and merging results in PHP.
  • Distributed transactions: Transactions across shards require 2-phase commit (complex) or eventual consistency.
  • Resharding: Adding a new shard requires rebalancing all data.
  • Hotspots: If one shard has much more traffic (a viral user), it overloads.

When to shard: Late-stage scaling concern. Most applications never need sharding. Try vertical scaling, caching, read replicas, and table partitioning first.

Code Example

php
<?php
// SHARDING IMPLEMENTATION IN LARAVEL
// config/database.php — define multiple shard connections
return [
    'connections' => [
        'shard_0' => ['driver' => 'mysql', 'host' => 'shard0.internal', 'database' => 'app_shard0', /* ... */],
        'shard_1' => ['driver' => 'mysql', 'host' => 'shard1.internal', 'database' => 'app_shard1', /* ... */],
        'shard_2' => ['driver' => 'mysql', 'host' => 'shard2.internal', 'database' => 'app_shard2', /* ... */],
    ],
];

// Shard router — determines which connection to use
class ShardRouter
{
    private const NUM_SHARDS = 3;

    public function connectionFor(int $userId): string
    {
        $shardId = $userId % self::NUM_SHARDS; // hash-based: 0, 1, or 2
        return "shard_{$shardId}";
    }
}

// Sharded repository
class ShardedUserRepository
{
    private ShardRouter $router;

    public function find(int $userId): ?array
    {
        $connection = $this->router->connectionFor($userId);
        return \DB::connection($connection)
                   ->table('users')
                   ->where('id', $userId)
                   ->first();
    }

    public function create(array $data): array
    {
        $userId     = $data['id'] ?? $this->generateId();
        $connection = $this->router->connectionFor($userId);
        \DB::connection($connection)->table('users')->insert($data + ['id' => $userId]);
        return $data + ['id' => $userId];
    }

    // CROSS-SHARD QUERY — fan-out to all shards
    public function findByEmail(string $email): ?array
    {
        foreach (range(0, 2) as $shardId) {
            $result = \DB::connection("shard_{$shardId}")
                          ->table('users')
                          ->where('email', $email)
                          ->first();
            if ($result) return (array) $result;
        }
        return null;
    }
}

// MULTI-TENANT SHARDING (common pattern)
class TenantShardMiddleware
{
    public function handle(Request $request, \Closure $next): mixed
    {
        $tenant     = $request->user()->tenant;
        $shardId    = $tenant->shard_id; // stored in a central directory
        $connection = "shard_{$shardId}";

        // Set the default connection for this request
        \DB::setDefaultConnection($connection);
        config(['database.default' => $connection]);

        return $next($request);
    }
}

// TABLE PARTITIONING (simpler alternative to sharding — one server)
// MySQL partition by range of created_at
\DB::statement("
    ALTER TABLE events PARTITION BY RANGE (YEAR(created_at)) (
        PARTITION p2023 VALUES LESS THAN (2024),
        PARTITION p2024 VALUES LESS THAN (2025),
        PARTITION p2025 VALUES LESS THAN (2026),
        PARTITION pmax  VALUES LESS THAN MAXVALUE
    )
");
// Queries with WHERE created_at BETWEEN '2024-01-01' AND '2024-12-31'
// only scan the p2024 partition — much faster!