Use of Checksum and SCDs in Datawarehousing

I’ve come across a number of articles that speak about the benefits and dangers of using CHECKSUM or BINARY_CHECKSUM functions as an alternate method to complete SCD column comparisons. One blogger looks at these functions at a very basic level and shows that for single letters of the alphabet there is a high degree of collisionsAnother blogger finds problems with BINARY_CHECKSUM collisions. An interesting alternative idea suggested by David, is to perform checksum against the input value and its reverse, thus effectively doubling the key length.

By collisions one means that function will generate the same hash output value for two different input values. This is very true, but I feel that there are uses that can be had of these functions that are effective for SCDs.

By way of introduction to these SQL Server system functions, the CHECKSUM and BINARY_CHECKSUM functions produce a 32-bit hash value from one or more input columns. The HASHBYTES function produce a hash value from one or more input columns of length dependant on the algorithm chosen – two examples are MD5 gives 128 bit, SHA1 gives 160 bit hash.

Function restrictions:

  • HASHBYTES can only accept varchar, nvarchar or varbinary columns, but columns can be cast into these data types.
  • CHECKSUM(*) and BINARY_CHECKSUM(*) returns an error if any column is of non-comparable data type. text, ntext, image, cursor, xml, and non-comparable common language runtime (CLR) user-defined types.

Dangers:

  • CHECKSUM ignores the case of strings, thus CHECKSUM(‘HELLO’) = CHECKSUM(‘hello’).
  • BINARY_CHECKSUM and CHECKSUM are dependent on the order of columns, thus you shouldn’t do CHECKSUM(*), one should always use CHECKSUM(column_A, column _B, column _C, etc), in case the column order changes.

When using these as methods to improve performance of an SCD in a DWH, most people either reject them and use a full SCD implementation (if they are of the more cautious disposition) or use one of CHECKSUM or BINARY_CHECKSUM (the more courageous and perhaps foolhardy).

I’m suggesting due to the differing nature of their calculations, that using both of these functions will improve the correctness of this algorithm in determining as used in an SCD – thus decreasing the collision rate. Further to that, one would be using the unique keys of rows in a table to merge and compare last load checksum values with current load checksum values. In our real life data we found with a moderately wide dimension table (70 attribute columns) of 10 million rows that there were no collisions.

The hash method chosen obviously depends on the risk of not collecting all changed rows.

As a side note, with regard to SSIS, Darren Green has written a Checksum SSIS transform. Keith Martin has implemented MHASH which is a DFT supporting multiple hash types. I’ve not looked at either of these in detail, just came across them in my research.

Latest from this author