Pranay Netha Guda | Full Stack Developer

Let's dive deep into indexes and shared buffers in databases, particularly PostgreSQL. But before we get there, let's first understand what covering indexes are and how they impact performance.

Understanding Covering Indexes

A covering index is basically an index data structure that also includes the columns associated with it. So when you query for a particular column and if that column is already included in the index, the database wouldn't need to go back to the disk and perform an I/O operation – it could directly return that column from the index itself.

Example of a Covering Index

-- Regular index
CREATE INDEX idx_user_email ON users (email);

-- Covering index (includes additional columns)
CREATE INDEX idx_user_email_covering ON users (email)
INCLUDE (first_name, last_name, created_at);
sql

With the covering index, this query can be satisfied entirely from the index:

SELECT first_name, last_name, created_at
FROM users
WHERE email = 'john@example.com';
sql

Important Limitation: Column Order Matters

Here's an important thing to remember: we cannot skip the first column of the covering index and directly go to the second or third column.

-- This index is created on (last_name, first_name, email)
CREATE INDEX idx_name_email ON users (last_name, first_name, email);

-- This will use the index
SELECT * FROM users WHERE last_name = 'Smith';

-- This will also use the index
SELECT * FROM users WHERE last_name = 'Smith' AND first_name = 'John';

-- This will NOT use the index efficiently
SELECT * FROM users WHERE first_name = 'John';

-- This will also NOT use the index
SELECT * FROM users WHERE email = 'john@example.com';
sql

So if we try to skip the first column, the index would no longer be utilized and the database would actually do a sequential scan throughout the table unless there's a separate index created specifically for that column.

The Size Trade-off

Since we're storing the columns with the index data structure, it's quite obvious that the size of the index would be larger – this depends on the number of rows we have in the database. The conclusion here is: covering indexes could be large.

Shared Buffers: The Memory Cache

Now let's talk about shared buffers. Shared buffer is the place where the database stores the pages in memory such that it can cache the most frequently used pages into memory itself, so it won't have to go to the disk often.

Default Configuration

-- Check current shared_buffers setting
SHOW shared_buffers;
-- Default: 128MB (which is quite small for production!)
sql

The size of shared buffers by default in PostgreSQL is 128MB. Here's what happens:

Buffer Hit: If the required query data is already in the shared buffer → No I/O operation (best case scenario!)
Buffer Miss: If the data is not in the buffer → Database has to go to disk to fetch it

The Memory Challenge with Large Indexes

Let's assume we're constantly querying from a particular table, and let's say we're only getting a few columns (the most popular ones). We can create a covering index, but here's the problem: if the database has a really large number of rows, the index would not fit in the shared buffer.

Solution 1: Increase Shared Buffer Size

-- In postgresql.conf
shared_buffers = 2GB  -- Good rule of thumb: ~25% of total system memory
sql

A good rule of thumb is to set it around 25% of the total system memory. So obviously in production, we have to change the shared buffer limit so that our indexes actually fit into memory.

Solution 2: Database Partitioning

But there's only so much memory a system can have, right? If we're dealing with extremely large datasets, even if we increase the shared buffer to really large values, often the indexes might not fit into memory. Then another approach would be partitioning your database.

-- Example: Partitioning by date range
CREATE TABLE orders (
    order_id SERIAL,
    customer_id INT,
    order_date DATE,
    total_amount DECIMAL
) PARTITION BY RANGE (order_date);

-- Create partitions
CREATE TABLE orders_2024_q1 PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');

CREATE TABLE orders_2024_q2 PARTITION OF orders
    FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');

-- Each partition gets its own indexes
CREATE INDEX idx_orders_2024_q1_customer ON orders_2024_q1 (customer_id);
CREATE INDEX idx_orders_2024_q2_customer ON orders_2024_q2 (customer_id);
sql

We could partition the table by:

Row-based strategy (range partitioning)
Hash-based strategy (hash partitioning)

This way, each partition would have its own smaller memory index which could easily fit into memory, keeping the popular indexes in memory and easily serving the queries.

MVCC and Visibility Maps: The Update Challenge

Here's another interesting thing I observed in PostgreSQL: Let's say you're updating something at a really large scale or constantly frequently updating something. Each time you update something, there's something known as a visibility map in PostgreSQL.

Understanding MVCC Impact

-- Example of frequent updates
UPDATE users SET last_login = NOW() WHERE user_id = 123;
UPDATE users SET login_count = login_count + 1 WHERE user_id = 123;
sql

The visibility map determines whether there's another version of the same row – this comes down to the MVCC (Multi-Version Concurrency Control) architecture of PostgreSQL.

So if you're updating something, PostgreSQL would set the visibility map. Now when there's a query, the database would first check the visibility map and make a decision:

Serve the data directly from an index, OR
Actually go and verify if there's another version which is more applicable (the latest version)

To do that verification, there's a heap fetch which is expensive!

The Vacuum Solution

-- Manual vacuum
VACUUM ANALYZE users;

-- Check when tables were last vacuumed
SELECT
    relname,
    last_vacuum,
    last_autovacuum,
    n_tup_upd,
    n_tup_del
FROM pg_stat_user_tables
WHERE relname = 'users';
sql

The only way this visibility map could be reset is through a VACUUM operation. When there's a vacuum operation, PostgreSQL would:

Delete all the unnecessary versions which are no longer relevant
Reset the visibility map to its appropriate state

Once that's done, the database can again start serving users directly from memory.

Fine-tuning Auto Vacuum

If your updates are really frequent, then there's fine-tuning that can be done on auto vacuum:

-- Check current autovacuum settings
SELECT name, setting, unit, context
FROM pg_settings
WHERE name LIKE 'autovacuum%';

-- Adjust autovacuum frequency for specific tables
ALTER TABLE users SET (
    autovacuum_vacuum_scale_factor = 0.1,  -- Vacuum when 10% of rows change
    autovacuum_analyze_scale_factor = 0.05  -- Analyze when 5% of rows change
);
sql

But honestly, this shouldn't be that much of a problem, as often we're not exactly querying the same things which we're updating. And even if we're doing that, it's not on a really large scale – it's more like randomized and rare occurrences.

Key Takeaways

Covering indexes are powerful but can become large – use them wisely for frequently accessed column combinations
Shared buffers should be sized appropriately (~25% of system RAM) to keep hot indexes in memory
Partitioning helps when dealing with extremely large datasets that don't fit in memory
MVCC and visibility maps can impact index-only scans during heavy update workloads
Regular vacuuming is essential to maintain optimal query performance

These are just a few observations about PostgreSQL performance optimization. The key is understanding your workload patterns and tuning accordingly!

Understanding these concepts helps you make better decisions about indexing strategies and memory allocation in PostgreSQL deployments.

PostgreSQL Indexes, Shared Buffers, and Performance Optimization