0

Cardinality — the number of unique values in a column (affects index choice)

Intermediate5 min read·eng-15-007
sqlinterviewperformance

Concept

Cardinality — the number of distinct/unique values in a column. High cardinality = many unique values. Low cardinality = few unique values.

Why cardinality matters for indexes:

  • High cardinality columns make EXCELLENT index targets — the index can eliminate most rows quickly.
  • Low cardinality columns make POOR index targets — an index on a boolean column might return 50% of all rows on any query.

Cardinality examples:

  • id (PRIMARY KEY): Maximum cardinality — every value is unique. Perfect index target.
  • email: High cardinality — typically unique. Good index target.
  • created_at (timestamp): High cardinality — thousands of distinct values. Good for range queries.
  • country_code: Medium cardinality — ~200 distinct values. Index useful if you query by specific country.
  • status (enum: pending/paid/shipped/cancelled): Low cardinality — only 4 values. Poor index target.
  • is_active (boolean): Extreme low cardinality — only 2 values. Rarely useful as an index.
  • gender (M/F/Other): Very low cardinality. Index rarely beneficial.

When low cardinality STILL warrants an index:

  • Highly selective filter: WHERE is_active = 1 AND user_id = 42 — even if is_active alone is low-cardinality, combined with a high-cardinality column in a composite index, it can be very selective.
  • Very skewed distribution: 99.9% of rows are status = 'delivered' and 0.1% are status = 'pending'. An index can quickly find that 0.1%.

Partial index (PostgreSQL): Create an index only for specific values. CREATE INDEX ON orders (status) WHERE status = 'pending' — tiny index covering only pending orders.

Check cardinality with SQL:

sql
SELECT COUNT(DISTINCT status) FROM orders;  -- cardinality of status column

Code Example

php
<?php
// HIGH CARDINALITY — make great index targets
Schema::table('users', function (Blueprint $table) {
    $table->unique('email');       // cardinality = total users, very high — excellent!
    $table->index('last_login_at'); // many distinct timestamps — good for range queries
});

// Effective query: SELECT * FROM users WHERE email = 'alice@example.com'
// Index on email eliminates all but 1 row immediately — O(log n) → O(1)

// LOW CARDINALITY — poor standalone index targets
Schema::table('users', function (Blueprint $table) {
    // $table->index('is_active'); // ← AVOID: only 2 values, scans half the table
    // $table->index('gender');    // ← AVOID: 3-4 values, not selective enough
});

// BUT — composite index saves it!
// Low cardinality column + high cardinality = selective together
$table->index(['is_active', 'user_id']); // user_id is high cardinality → selective!

// Checking cardinality in code
$statusCardinality = \DB::selectOne('SELECT COUNT(DISTINCT status) as n FROM orders')->n;
// If n = 4 (only 4 statuses), don't index status alone

$emailCardinality = \DB::selectOne('SELECT COUNT(DISTINCT email) / COUNT(*) as ratio FROM users')->ratio;
// ratio close to 1.0 = high cardinality = good index target

// EXPLAIN shows the optimizer's decision
$result = \DB::select('EXPLAIN SELECT * FROM orders WHERE status = ?', ['pending']);
// If type = 'ALL' even with index on 'status' → optimizer chose full scan (low cardinality)
// The optimizer says: "50% of rows match 'pending'? Faster to scan all than use index."

// PARTIAL INDEX (PostgreSQL) — for skewed data
\DB::statement("
    CREATE INDEX idx_orders_pending ON orders (created_at)
    WHERE status = 'pending'
"); // Index ONLY for pending orders — tiny and very selective!

// Useful cardinality SQL patterns:
// SELECT column, COUNT(*) FROM table GROUP BY column ORDER BY COUNT(*) DESC;
// → Shows value distribution — low cardinality if few distinct rows