The SQL Server stores Data, lots of it. And you usually want to retrieve the data as fast as possible. So what are you going to do? You create an index on your data tables.
Well great. That’s a step in the right direction, and most likely you have improved the performance in the order of magnitudes. But what kind of index did you create and what are the differences? Is it a clustered index? Is the index covering your query? Why should it cover? What’s “new” in SQL Server 2005 or 2008? What’s a fill factor?
I wanted to cover everything in one article, but it looks like this is getting out of hand, so this will be a small series.
A table in the SQL Server can have 2 different kinds of indexes. Today we will take a closer look at the:
- Clustered Index
This index can only exist ONCE. It defines the “physical layout” of your data. Basically the clustered index is your table. All your data will be contained in the leaf nodes of a binary-tree (the clustered index).
Let’s look at the properties of a clustered index.
Pros:
· Extreme performance for selecting ranges of values ( between , > , < …)
· Good performance for foreign key columns
· Grouped columns
· Columns with very distinct values (NOT primary key)
· Data that is processed sequentially
· Columns that grow constantly (e.g. DateCreated, ID,…)
· Static data (Write once, change never)
· The Clustered index will always “cover” your query, since it contains all data
Cons:
· Bad for data with “random” inserts. This can cause “page splits”
· Lookup of specific Values (this can be handled better by a regular index)
· Dynamic data (Data that changes often. It will force the Serer to relocate the entire row, if the data changes)
· Might need extra space, if the index value is not constantly growing (Called “Fill Factor”).
· Must (well really should but it’s a very urgent should) be aligned with a partitioned table
What are page splits and fill factors? Those 2 terms are somewhat related to each other. I already mentioned that a clustered index will contain the data in its physical layout. So imagine a phonebook. This stores people by location & name. So it’s very easy to retrieve all “Smith in New York” (e.g. 10 pages).
But if we have a new Mr. Smith moving to NY, then we have a problem. We won’t be able to insert him into the phonebook, since there is no extra room. Instead we have to rearrange all persons and print a new phonebook. SQL Server is not that extreme, it will only process pages of data. Each data page is only 8 kb in size.
The fill factor will tell SQL Server how much of this space it should use when (re)creating an index. If we would pick the Location/Name as our clustered Index, then we know that we will most likely have some inserts/deletes/updates over time. So we will let the SQL Server know that it should keep some spare room on every page. This will allow us to insert our new Mr. Smith if he chooses to move to NY.
An important point to note is that a fill factor will only force an INITIAL amount of free space. When the index gets filled over time, since Mr. Smith is also bringing his whole clan to NY, the free space will decrease, until there is no more space left, and that will introduce the “page split”.
When a page split happens, the SQL Server will distribute the data from the page and move some of it to a new page, so there will be some room. After the new room is created, we can insert our Mr. Smith on the new, half empty page.
So you could avoid this problem by telling SQL Server to leave lots of room (e.g.: Fill factor 50%) for new data. Well… That’s not really that good either, since now we will spread the data over many more pages. And if we ask the server for “Smith in New York” again, it will now be forced to read twice as many pages (20) from the storage system to answer the same question. And the bad news is that IO is still the slowest part of your average Database…
This can be avoided, if we have a static, always increasing key on our clustered index, since the data will always be larger then the last one we inserted, we can set the fill factor to 100% and use all the space in the in the data pages.
Also a relation like order-order_items (Order_ID, Order_Position) is a good candidate for a clustered index. Since the orders don’t change often, once they are created. And the typical access is usually performed by coming from an order. And you can retrieve all related information without the need for an extra table lookup. Also all the Information will be close together, since the data is stored in the Index leave nodes. So there is no need to load any other datapages to answer the order query.
Something to note… If you create a primary key constraint on a table, SQL Server will automatically create a clustered index for you. This might not always be the optimal choice, so be aware of this.
So when should you create a clustered index? There is no right answer that can be given here. If it makes sense to keep the data together or if it will almost allways be retrieved in a “grouped” / aggregated way. So I would settle for a very general “It depends…”
That’s it for this post. I tried to keep it “simple”, so let me know if you like this style, or if you would prefer a more detailed approach with lots of SQL and DBCC statements etc…
On the next post I will take a closer at the regular indexes, covering queries and filtered index in SQL 2008.
P.s.: This is not my 1st language… So have mercy