Wednesday, October 21, 2009

Surrogate vs. Natural Keys

There's no shortage of topics relating to data and data management to write about (only a shortage of time in which to do so!), and whenever a potentially interesting topic occurs to me I note it down with the intention of writing about it here. One such topic which I jotted down many months ago was the evergreen subject of Surrogate vs. Natural Keys, but I avoided it for a long time because it always arouses such passions (Yes! Database designers have feelings too!) and life is too short for arguments.

However, a recent and decent article on the subject on SQLTeam.com prompted me to finally break cover and nail my colours to the proverbial mast. What's the controversy about? Put simply, you always need a means of uniquely identifying each record in your data set, so each record needs a key.

A natural key is an identifier that has a "real-world" meaning, a functional value independent of its use in a database. So, if you have (for instance) a list of employees in a payroll database table, you could choose to use their Social Security number as a key. It's unique per person, and you have to record it anyway, so you might as well re-use it as your database table key, right?

It's certainly possible, yes, but in my opinion almost always a bad idea. Firstly, you'll need to record a (probably large/long and thus costly) digit sequence (or, worse still, string) throughout your database. The key will be used to identify the employee in pay history log tables, for instance, and department lists, and so on and so forth. More importantly though, because it's externally created and assigned, it's out of your control, so should the government ever decide to change the format or content of the number you're automatically forced to update your entire database structure and content, or risk the integrity of your data.

The answer then is to use a surrogate key. This is a purely technical key, used only by and for the database itself, probably invisible to the user and ideally automatically assigned by the system. It's unchanging, can (nearly always) be a simple integer for performance reasons and has no functional meaning in and of itself. It's merely a number, unique to the data entity, that is used throughout the database to denote a single record. It's my strong opinion that surrogate keys should be used by default for all data entities unless very good reasons exist not to do so.

What might those good reasons be? Well, one fairly obvious one is when data from a system needs to be interfaced with other systems. Perhaps you're in the fortunate position of being able to dictate standards for data management in your own company, but you may need to share data with suppliers or customers. They've all got their own systems with their own databases with their own set of keys, so you might choose to identify books by ISBN number to facilitate communication with your suppliers instead of using your own system-generated keys.

Opponents of surrogate keys also cite the abstractness of the data as a reason to choose natural keys instead. For instance, a "Countries of the World" table might use "FR" as a natural key for the "France" record, whereas a table with a surrogate key would theoretically allow two or more records to exist for "France", one with key 50 and the other with key 29. Because the numbers don't "mean" anything, it's harder to see that they refer to the same entity.

To my mind, though, this is a misuse of keys. Functional controls should be implemented as constraints or unique indexes, independent from the technical underpinnings of the database structure. The business meaning of data is obviously sacred, and every element of the design and development of the database should be aimed at preserving that meaning. This is best achieved, in this writer's humble opinion, by keeping the technical relationships of database objects strictly separate from the functional content of line-of-business data  by using surrogate keys.

Labels: , , ,