5 Things I Never Wanted To Know About SQL Server

I've never loved SQL Server, but it is unfortunately the database that I have the most familiarity with. So here are things I learned about it that I didn't know before starting to use it.

1. Adding new columns can be a meta-data only operation.

In some databases, adding a column requires rebuilding the entire table. This means that adding a column to a table with millions of records could take minutes or hours. However, in SQL server, adding a column with a default value (such as nullable columns) will store the default in the tables meta-data, meaning that adding the column is almost instantaneous.

2. The ghost record "problem".

When SQL Server deletes a record, it doesn't actually delete it. Instead, as an optimisation, it marks the record as a "ghost" record to be cleaned up automatically by a later process. This process is called the ghost cleanup process. Unfortunately, this has a big downside, in that queries need to skip over ghost records, which when there are massive numbers of ghost records, this will slow those queries down.

Unfortunately, SQL Server does not provide any means to effectively fine-tune the ghost record cleanup, and certain things, such as long running queries, can cause it to be delayed for a long time.

A weird trick: In the case that the table is very small, (due to frequently adding/deleting records for example if the table is being used as an ad-hoc queue), this may be worked around by rebuilding the table which effectively flushes all ghost records.

3. Column order in indexes matters.

Something that I ought to have learned but never learned at University, is that the column order of indexes matters. This is because an index is a tree, and each additional column adds more fidelity to the sort order.

The way this can come in handy, is that if a table with two columns, named columna and columnb is queried with queries that always contain columna but sometimes also contain columnb, a single index will be sufficient to cover both of these cases, however, if columna and columnb are used independently, it will be necessary to have at least two indexes to cover these usages.

4. Creating new indexes can re-use old indexes.

Sql Server can be clever sometimes, and if you create a filtered index that is a subset of an existing index, it can use that index to build the new index without doing a table scan.

For example if you have an index like:

create index index1 on table1 (column1, column2)

And then you create a new filtered index like:

create index index2 on table1 (column1) where column1 = 'foo'

Because of the shared dependency on column1, the existing index1 will be used to build index2. This avoids a table scan.

5. Weird strategies for updating many records in a giant table.

Updating very large tables in SQL Server often cannot be done within a single transaction, because the transaction will run out of memory before it can be committed.

To get around this, we want to update tables with multiple statements.

Let's say, for example, we wish to set every instance of column1 from table1 where the value is "foo" to "bar", the naive approach would be to run something like:

update top 10000 table1
set column1 = 'bar'
where column1 = 'foo'

The first time we run this statement, it executes relatively quickly. But as we run it subsequent times it gets slower and slower. This is because on subsequent runs, it will have to scan over all of the previously scanned rows. We seem to be caught between a rock and a hard place though: If we try to update it all at once, the query hangs, but if we try to run it repeatedly, we suffer from an O(n^2) slowdown.

There is a solution! One solution is to create an index that covers the records we wish to transform. For example create index index1 on table (column1) where column1 = 'foo', but!

  1. This can run into the problem where the creation of the index takes too long and it times out.

  2. Even when the index is created the queries still are slower and slower? Why is that? It could be because of the ghost record problem mentioned above.

So what can we do? We can use a column with a unique index as a watermark to update sections of the table.

e.g. We can repeatedly run a query like, incrementing @start and @end parameters by batches of 10000 as we go along.

update table1
set column1 = 'bar'
where column1 = 'foo'
and sequence_column between @start and @end

This is also possible to do on unique indexes for uniqueidentifier columns.