Database indexing is a powerful technique that can help optimize query performance and speed up data retrieval. In this beginner’s guide, we’ll explore the basics of database indexes, including their structure, storage blocks, and searching complexity. We’ll also look at the advantages and disadvantages of using indexes, and introduce some best practices for working with indexes in your database management system.
By the end of this guide, you’ll have a solid understanding of how indexes work and how to leverage them to improve your database performance.
Read moreLong Story Short – Why Do We Need Database Indexes?
Database search requests can be slow and sluggish due to the sheer amount of data being stored. As the data grows, so does the need for a more sophisticated way to filter out irrelevant information – leading to longer processing times.
One of the ways of query optimization is indexing.
The concept of database indexes in a relational database model was described over 50 years ago. Let’s figure out how they work nowadays!
Search in a table without an index
Database tables have many records. Each record stores the fields you defined for the table. If you search for something by a not-indexed field, the search engine should read every record and check if it contains the value we need.
In the example above, if you search by DateOfBirth in a table of 1000 records, the search engine will need to check the DateOfBirth field in each of the 1K records.
The complexity is the same as for a search in an unordered array, which is O(N). It means that query execution time will grow proportionally to the number of blocks.
Search in a table with an index
To improve this, you can create an index for one or more columns. The values of the indexed columns will be copied and organized into a data structure optimized for search – usually, a Hash Set or a Binary Tree.
As well as in the original table, the index stores data in rows. Each row contains the copied values plus a pointer to the original table record:
When you search by an indexed column, the search engine utilizes the index instead of scanning the original table. The index stores significantly less data than the table does, so the search requires fewer reads from the disk. Obviously, fewer means faster.
The most significant performance improvement though is brought by the optimized search algorithms.
As the data is sorted and organized, the complexity of the search is O (log N). It means that query execution time will grow logarithmically.
Data Search Complexity Comparison
Search in a table | Search in an index | |
Search algorithm | Linear scan | Binary search |
Complexity | O (N) | O (log N) |
Execution time grow | Linear | Logarithmic |
The number of takes (on average) | x = (N+1)/2 10K – 5K 100K – 50K 1M – 500K | x = Log2(N) 10K – 14 100K – 17 1M – 20 |
The table structure you saw before is just a simplified logical representation. When it comes to physical data storage, databases work with data blocks, not records and fields.
Let’s open the hood
Databases Storage Blocks
A data block is a specific number of bytes of physical storage on a disk. It is used by the database management system (DBMS) to store records, tables, and other types of data.
For instance, in PostgreSQL, a block is of 8 Kb size by default. The data is stored across multiple blocks linked in the way of a Linked List. Each block holds its unique address on the disk and a pointer to the next block:
The linked list structure is required because data is not stored contiguously on physical storage. The blocks are scattered across the disk as stars over the sky, depending on the fragmentation level of the storage:
How Many Storage Blocks Are Required to Store a Database Table?
The size of a row with the overhead
Given the example table we discussed before:
Field | Data Type | Size, bytes |
Id | Number(10) | 10 |
FirstName | Char(50) | 50 |
LastName | Char(50) | 50 |
DateOfBirth | DATE | 7 |
Total: 117 |
We need 117 bytes of mean size to store a row.
A row overhead on the disk is about 10 bytes. It includes metadata like row directory, row header, and column size.
So, the overall row size is about 127 bytes.
How many rows can a block enclose?
A block overhead is formed by the header (at least 107 bytes, depending on the RDBMS and the database) and the sum of each row overhead it contains.
So, in the best case, one block of data can contain up to (8192 – 107)/127) ≈ 63 records. Let’s assume 60, for good measure.
How much does the table weigh?
Let’s assume there are 12 M records in the table. To store it, we need at least 12 000 000/60 = 200 000 blocks or 200 000 * 8 Kib ≈ 1.6 Gib.
Searching in the table
Let’s assume, we need to select all underaged users:
SELECT * FROM users WHERE add_months(DateOfBirth, 18 * 12) > CURRENT_DATE
Here will count just the most time-consuming operation – reading the blocks from the disk. If we have an HDD with stable throughput of 160 Mb/s, then to read 1.6 Gib Data from the disk, it takes:
1.6 Gb/[160 Mb/s] ≈ 10 seconds!
Quite long, huh?
A Table With The Right Index
We can think of an index as a small ordered table when it comes to storage. The index requires fewer block stores and as the result, fewer block accesses which dramatically improves the speed of search.
If we create an ordered index for the DateOfBirth field, the index row will take as little as 11 bytes:
Field | Data Type | Size, bytes |
DateOfBirth | DATE | 7 |
(record pointer) | 4 | |
Total: 11 |
Remember the row overhead of 10 bytes and the block overhead of 107 bytes? One block can store up to:
(8192-107)/21 ≈ 385 index records
To store the whole index we need:
12 000 000/385 ≈ 31 169 blocks; 31 169 blocks * 8Kb ≈ 250 Mb
It takes about 7 times less time to read 250 Mb from the disk than for 1.6 Gb. If the indexes were only about the size, then this one would provide 7x performance over the regular search.
250 MB/[160 Mb/s] ≈ 1.5625 seconds!
But the size is not everything when it comes to indexes and I’ve got here some great news for you!
The Order-of-Magnitude Better Performance Provided by Database Indexes
The 7x improvement of performance we’ve had in this theoretical experiment is much slower than we can actually get from indexes. As the indexes store their records in an ordered way, the database does not need to scan the whole index.
When it comes to an indexed search a binary algorithm is applied. The number of blocks the search engine must take to find the right records will be about
Log2(31169)≈ 15 block reads
15 * 8Kb / [160 Mb/s] ≈ 0.00075 s = 0.75 ms
Regular table scan | Index scan | Binary search in the index |
10,000 ms | 1,562.5 ms | 0.75 ms |
Chiefly, we made the calculations just by one factor. The real execution time will depend on multiple other factors, for example, the number of rows to return, the network performance, the speed of the processor and RAM capacity, etc.
Conclusion
Database indexes are a critical component of modern database management systems, allowing for faster data retrieval and more efficient query processing. In this beginner’s guide, we explore the basics of index structure, storage blocks, and searching complexity, providing a comprehensive overview of how indexes work and how they can be used to optimize database performance.
Whether you’re new to database management or looking to deepen your knowledge of indexing, this guide provides a solid foundation in database indexing that’s easy to understand. With the right knowledge and best practices, you can unleash the power of database indexes to take your database performance to the next level.
Here is a short summary of what we’ve learned:
- When executing a query, the database search engine builds an execution plan
- Firstly, the engine analyses the query and checks if there are any appropriate indexes to use
- Then, if the right index is found, instead of scanning the whole table, the database picks up the most effective search algorithm and works with the index to filter out the records
- Indexes have some similarities with tables: both have rows and columns, but the index row contains just a part of the columns and a reference to the original table row
- If built properly, indexes provide significantly better performance for read operations.
- They make data updates slower and require additional space on disk, though. However, it is another story
Learn more about RDBMs that benefit from database indexing, such as Oracle, PostgreSQL, MS SQL Server, and others.
Hey guys,
Leave your comments and subscribe to receive updates and the most recent posts!
Stay inspired!
Oh, how thoughtful of you to remind me to leave a comment and subscribe to your non-existent subscription button. I mean, sure, I’d love to subscribe to your updates and recent posts, but I can’t seem to find that magical button you’re talking about, amigo. Maybe you could add a little more info on where to find it? Or better yet, just let me know when you’ve actually got your act together and have a functioning blog worth subscribing to. No hard feelings, eh?
It can be easy to overlook the importance of database indexes, but they really are a game-changer when it comes to performance. What other topics would you like to see covered in a beginner’s guide to database management?
Oh no, wait a minute homie, I know indexes are important, but let’s not forget who we’re dealing with here, a beginner! Sure, data blocks and storage details might sound cool, but that’s not something you need to worry about just yet. Let’s focus on getting you to understand how to manage a database, and then we can start talking about more advanced stuff. Don’t sweat it compa, I got your back!