Let's dive deep into indexes and shared buffers in databases, particularly PostgreSQL. But before we get there, let's first understand what covering indexes are and how they impact performance.
A covering index is basically an index data structure that also includes the columns associated with it. So when you query for a particular column and if that column is already included in the index, the database wouldn't need to go back to the disk and perform an I/O operation – it could directly return that column from the index itself.
-- Regular index CREATE INDEX idx_user_email ON users (email); -- Covering index (includes additional columns) CREATE INDEX idx_user_email_covering ON users (email) INCLUDE (first_name, last_name, created_at);
sql
With the covering index, this query can be satisfied entirely from the index:
SELECT first_name, last_name, created_at FROM users WHERE email = 'john@example.com';
sql
Here's an important thing to remember: we cannot skip the first column of the covering index and directly go to the second or third column.
-- This index is created on (last_name, first_name, email) CREATE INDEX idx_name_email ON users (last_name, first_name, email); -- This will use the index SELECT * FROM users WHERE last_name = 'Smith'; -- This will also use the index SELECT * FROM users WHERE last_name = 'Smith' AND first_name = 'John'; -- This will NOT use the index efficiently SELECT * FROM users WHERE first_name = 'John'; -- This will also NOT use the index SELECT * FROM users WHERE email = 'john@example.com';
sql
So if we try to skip the first column, the index would no longer be utilized and the database would actually do a sequential scan throughout the table unless there's a separate index created specifically for that column.
Since we're storing the columns with the index data structure, it's quite obvious that the size of the index would be larger – this depends on the number of rows we have in the database. The conclusion here is: covering indexes could be large.
Now let's talk about shared buffers. Shared buffer is the place where the database stores the pages in memory such that it can cache the most frequently used pages into memory itself, so it won't have to go to the disk often.
-- Check current shared_buffers setting SHOW shared_buffers; -- Default: 128MB (which is quite small for production!)
sql
The size of shared buffers by default in PostgreSQL is 128MB. Here's what happens:
Let's assume we're constantly querying from a particular table, and let's say we're only getting a few columns (the most popular ones). We can create a covering index, but here's the problem: if the database has a really large number of rows, the index would not fit in the shared buffer.
-- In postgresql.conf shared_buffers = 2GB -- Good rule of thumb: ~25% of total system memory
sql
A good rule of thumb is to set it around 25% of the total system memory. So obviously in production, we have to change the shared buffer limit so that our indexes actually fit into memory.
But there's only so much memory a system can have, right? If we're dealing with extremely large datasets, even if we increase the shared buffer to really large values, often the indexes might not fit into memory. Then another approach would be partitioning your database.
-- Example: Partitioning by date range CREATE TABLE orders ( order_id SERIAL, customer_id INT, order_date DATE, total_amount DECIMAL ) PARTITION BY RANGE (order_date); -- Create partitions CREATE TABLE orders_2024_q1 PARTITION OF orders FOR VALUES FROM ('2024-01-01') TO ('2024-04-01'); CREATE TABLE orders_2024_q2 PARTITION OF orders FOR VALUES FROM ('2024-04-01') TO ('2024-07-01'); -- Each partition gets its own indexes CREATE INDEX idx_orders_2024_q1_customer ON orders_2024_q1 (customer_id); CREATE INDEX idx_orders_2024_q2_customer ON orders_2024_q2 (customer_id);
sql
We could partition the table by:
This way, each partition would have its own smaller memory index which could easily fit into memory, keeping the popular indexes in memory and easily serving the queries.
Here's another interesting thing I observed in PostgreSQL: Let's say you're updating something at a really large scale or constantly frequently updating something. Each time you update something, there's something known as a visibility map in PostgreSQL.
-- Example of frequent updates UPDATE users SET last_login = NOW() WHERE user_id = 123; UPDATE users SET login_count = login_count + 1 WHERE user_id = 123;
sql
The visibility map determines whether there's another version of the same row – this comes down to the MVCC (Multi-Version Concurrency Control) architecture of PostgreSQL.
So if you're updating something, PostgreSQL would set the visibility map. Now when there's a query, the database would first check the visibility map and make a decision:
To do that verification, there's a heap fetch which is expensive!
-- Manual vacuum VACUUM ANALYZE users; -- Check when tables were last vacuumed SELECT relname, last_vacuum, last_autovacuum, n_tup_upd, n_tup_del FROM pg_stat_user_tables WHERE relname = 'users';
sql
The only way this visibility map could be reset is through a VACUUM operation. When there's a vacuum operation, PostgreSQL would:
Once that's done, the database can again start serving users directly from memory.
If your updates are really frequent, then there's fine-tuning that can be done on auto vacuum:
-- Check current autovacuum settings SELECT name, setting, unit, context FROM pg_settings WHERE name LIKE 'autovacuum%'; -- Adjust autovacuum frequency for specific tables ALTER TABLE users SET ( autovacuum_vacuum_scale_factor = 0.1, -- Vacuum when 10% of rows change autovacuum_analyze_scale_factor = 0.05 -- Analyze when 5% of rows change );
sql
But honestly, this shouldn't be that much of a problem, as often we're not exactly querying the same things which we're updating. And even if we're doing that, it's not on a really large scale – it's more like randomized and rare occurrences.
These are just a few observations about PostgreSQL performance optimization. The key is understanding your workload patterns and tuning accordingly!
Understanding these concepts helps you make better decisions about indexing strategies and memory allocation in PostgreSQL deployments.