Let's dive deep into indexes and shared buffers in databases, particularly PostgreSQL. But before we get there, let's first understand what covering indexes are and how they impact performance.
Understanding Covering Indexes
A covering index is basically an index data structure that also includes the columns associated with it. So when you query for a particular column and if that column is already included in the index, the database wouldn't need to go back to the disk and perform an I/O operation – it could directly return that column from the index itself.
Example of a Covering Index
-- Regular index CREATE INDEX idx_user_email ON users (email); -- Covering index (includes additional columns) CREATE INDEX idx_user_email_covering ON users (email) INCLUDE (first_name, last_name, created_at);sql
With the covering index, this query can be satisfied entirely from the index:
SELECT first_name, last_name, created_at FROM users WHERE email = 'john@example.com';sql
Important Limitation: Column Order Matters
Here's an important thing to remember: we cannot skip the first column of the covering index and directly go to the second or third column.
-- This index is created on (last_name, first_name, email) CREATE INDEX idx_name_email ON users (last_name, first_name, email); -- This will use the index SELECT * FROM users WHERE last_name = 'Smith'; -- This will also use the index SELECT * FROM users WHERE last_name = 'Smith' AND first_name = 'John'; -- This will NOT use the index efficiently SELECT * FROM users WHERE first_name = 'John'; -- This will also NOT use the index SELECT * FROM users WHERE email = 'john@example.com';sql
So if we try to skip the first column, the index would no longer be utilized and the database would actually do a sequential scan throughout the table unless there's a separate index created specifically for that column.
The Size Trade-off
Since we're storing the columns with the index data structure, it's quite obvious that the size of the index would be larger – this depends on the number of rows we have in the database. The conclusion here is: covering indexes could be large.
Shared Buffers: The Memory Cache
Now let's talk about shared buffers. Shared buffer is the place where the database stores the pages in memory such that it can cache the most frequently used pages into memory itself, so it won't have to go to the disk often.
Default Configuration
-- Check current shared_buffers setting SHOW shared_buffers; -- Default: 128MB (which is quite small for production!)sql
The size of shared buffers by default in PostgreSQL is 128MB. Here's what happens:
- Buffer Hit: If the required query data is already in the shared buffer → No I/O operation (best case scenario!)
- Buffer Miss: If the data is not in the buffer → Database has to go to disk to fetch it
The Memory Challenge with Large Indexes
Let's assume we're constantly querying from a particular table, and let's say we're only getting a few columns (the most popular ones). We can create a covering index, but here's the problem: if the database has a really large number of rows, the index would not fit in the shared buffer.
Solution 1: Increase Shared Buffer Size
-- In postgresql.conf shared_buffers = 2GB -- Good rule of thumb: ~25% of total system memorysql
A good rule of thumb is to set it around 25% of the total system memory. So obviously in production, we have to change the shared buffer limit so that our indexes actually fit into memory.
Solution 2: Database Partitioning
But there's only so much memory a system can have, right? If we're dealing with extremely large datasets, even if we increase the shared buffer to really large values, often the indexes might not fit into memory. Then another approach would be partitioning your database.
-- Example: Partitioning by date range CREATE TABLE orders ( order_id SERIAL, customer_id INT, order_date DATE, total_amount DECIMAL ) PARTITION BY RANGE (order_date); -- Create partitions CREATE TABLE orders_2024_q1 PARTITION OF orders FOR VALUES FROM ('2024-01-01') TO ('2024-04-01'); CREATE TABLE orders_2024_q2 PARTITION OF orders FOR VALUES FROM ('2024-04-01') TO ('2024-07-01'); -- Each partition gets its own indexes CREATE INDEX idx_orders_2024_q1_customer ON orders_2024_q1 (customer_id); CREATE INDEX idx_orders_2024_q2_customer ON orders_2024_q2 (customer_id);sql
We could partition the table by:
- Row-based strategy (range partitioning)
- Hash-based strategy (hash partitioning)
This way, each partition would have its own smaller memory index which could easily fit into memory, keeping the popular indexes in memory and easily serving the queries.
MVCC and Visibility Maps: The Update Challenge
Here's another interesting thing I observed in PostgreSQL: Let's say you're updating something at a really large scale or constantly frequently updating something. Each time you update something, there's something known as a visibility map in PostgreSQL.
Understanding MVCC Impact
-- Example of frequent updates UPDATE users SET last_login = NOW() WHERE user_id = 123; UPDATE users SET login_count = login_count + 1 WHERE user_id = 123;sql
The visibility map determines whether there's another version of the same row – this comes down to the MVCC (Multi-Version Concurrency Control) architecture of PostgreSQL.
So if you're updating something, PostgreSQL would set the visibility map. Now when there's a query, the database would first check the visibility map and make a decision:
- Serve the data directly from an index, OR
- Actually go and verify if there's another version which is more applicable (the latest version)
To do that verification, there's a heap fetch which is expensive!
The Vacuum Solution
-- Manual vacuum VACUUM ANALYZE users; -- Check when tables were last vacuumed SELECT relname, last_vacuum, last_autovacuum, n_tup_upd, n_tup_del FROM pg_stat_user_tables WHERE relname = 'users';sql
The only way this visibility map could be reset is through a VACUUM operation. When there's a vacuum operation, PostgreSQL would:
- Delete all the unnecessary versions which are no longer relevant
- Reset the visibility map to its appropriate state
Once that's done, the database can again start serving users directly from memory.
Fine-tuning Auto Vacuum
If your updates are really frequent, then there's fine-tuning that can be done on auto vacuum:
-- Check current autovacuum settings SELECT name, setting, unit, context FROM pg_settings WHERE name LIKE 'autovacuum%'; -- Adjust autovacuum frequency for specific tables ALTER TABLE users SET ( autovacuum_vacuum_scale_factor = 0.1, -- Vacuum when 10% of rows change autovacuum_analyze_scale_factor = 0.05 -- Analyze when 5% of rows change );sql
But honestly, this shouldn't be that much of a problem, as often we're not exactly querying the same things which we're updating. And even if we're doing that, it's not on a really large scale – it's more like randomized and rare occurrences.
Key Takeaways
- Covering indexes are powerful but can become large – use them wisely for frequently accessed column combinations
- Shared buffers should be sized appropriately (~25% of system RAM) to keep hot indexes in memory
- Partitioning helps when dealing with extremely large datasets that don't fit in memory
- MVCC and visibility maps can impact index-only scans during heavy update workloads
- Regular vacuuming is essential to maintain optimal query performance
These are just a few observations about PostgreSQL performance optimization. The key is understanding your workload patterns and tuning accordingly!
Understanding these concepts helps you make better decisions about indexing strategies and memory allocation in PostgreSQL deployments.