Surrogate Key and Its Role in Data Warehouse or Data Mart

Last modified date:

Surrogate key is term mostly used in data warehouse environments. We might not come by this term in application production databases simply because there is no need to generate these keys.

We can do just fine with primary keys. It is much needed to create surrogate keys in data warehouses for many reasons, read further.

What is Surrogate Key?

Surrogate keys are primary and foreign keys generated on the data warehouse level instead of being taken out of the source system (business key). Primary (business) key on the source can be e.g. customer number, number of order or product etc.

There are many options how to generate surrogate key:

  • IDENTITY – option where the key is generated automatically. It is numeric format starting from zero. Data types can be INTEGER (TINYINT, SMALLINT, INT, BIGINT) or NUMERIC, respectively DECIMAL
  • SEQUENCE – another way of generating primary/surrogate key is by creating sequence-type object in a database. As with IDENTITY, SEQUENCE can use data type INT, NUMERIC or DECIMAL. This type opf object can provide sequential numbers on demand.
  • NONSEQUENTIAL GRID – yet another option is to generate unique identifier of nonsequential GUID type of UNIQUEIDENTIFIER data type using function NEWID()
  • SEQUENTIAL GUID – sequential GUID can be generated using T-SQL function NEWSEQUENTIALID()
  • Other solution – last possible way is to use some primary key generator or create your own logic

Why is it useful to use surrogate keys in data warehouse?

1) Surrogate keys are independent on changes in source systems

When building a data warehouse it is good to know that it is long-term and strategical matter. It can and most probably will overpass lifetime of some source system. It is also not good to create dependencies between data warehouse and primary system.

2) Surrogate keys have lower storage requirements

Regarding surogate keys, I mostly use IDENTITY AND SEQUENCE. I am not a fan of GUID surrogate keys for performance reasons:

  • UNIQUEIDENTIFIER = 16 bytes
  • INT = 4 bytes
  • BIGINT = 8 bytes

3) JOIN optimization using surrogate keys

It is crucial for data warehouse to have their architecture optimized mainly for queries. We use multiple techniques as schematic type to achieve that: (star scheme, snowflake scheme), structure of fact and dimensional tables etc. One of the influencers of data warehouse performance is also structure of referential integrity – e.g. primary and foreign keys. Surrogate keys of IDENTITY AND SEQUENCE type are extremely worth using.

4) Possibility of using SCD (Slowly changing dimensions)

If we use natural or business key from primary system as a primary key we complicate possibility of record historization application = application SCD (type 2 and higher), so-called slowly changing dimension. The principle of historization is creation of a new record in data wareouse in case when change of some attribute is made on the source system.

5) Unknown value and Surrogate key

You will probably find unknown value as “NULL” in the primary system. It however should not appear in precisely designed data warehouses and data marts. It is suitable to replace NULL values via Surrogate keys. Using surrogate keys in data warehouse comes with yet another advantage – NULL values can be further divided into:

  • Empty value being error condition (-1)
  • Empty value being allowed value (-2)

Interesting article on Surrogate keys written by Ralph Kimball can be found on Kimball Group web, here

My name is Jan Zedníček and I work as a BI Developer at Kentico Software in Brno. Mostly you can see me there working in the office, but I also work partly as a freelancer. I have been working as a freelancer for many companies for more than 5 years, but Kentico is the matter of my heart. I used to work as a financial controller at companies like Aero Vodochody or Sberbank and I also used to be a bond program manager in Unicapital Investment group. When I am not at work, I like playing volleyball, chess, doing a workout in the gym and I enjoy tasting of best quality rums. I am trying to summarize all my knowledge on this website not to forget them (because of the rum effect, you know =) and to put them forward to anyone. Don´t worry about asking for help or writing some comments.

Category: SQL Server Tags: ,

About Ing. Jan Zedníček - BI Developer, Finance controller

My name is Jan Zedníček and I work as a BI Developer at Kentico Software in Brno. Mostly you can see me there working in the office, but I also work partly as a freelancer. I have been working as a freelancer for many companies for more than 5 years, but Kentico is the matter of my heart. I used to work as a financial controller at companies like Aero Vodochody or Sberbank and I also used to be a bond program manager in Unicapital Investment group. When I am not at work, I like playing volleyball, chess, doing a workout in the gym and I enjoy tasting of best quality rums. I am trying to summarize all my knowledge on this website not to forget them (because of the rum effect, you know =) and to put them forward to anyone. Don´t worry about asking for help or writing some comments.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.