What is Surrogate keys and how can we handle during data warehouse migration?

What is surrogate key?

Surrogate key is nothing but unique identifier assigned to each row in a dimension table. Isn’t simple? Yes.

For one, this might raise few questions, because what about primary key, its also unique in nature and assigned to each row. Then,

how it differs from primary key of a table, what is the difference between these two (surrogate key and primary key) or both are same?

Will the dimension table have both primary key and surrogate key or will it have only surrogate key?

Both surrogate keys and primary keys serve the purpose of uniquely identifying each record in a database table, but they differ in their origin and purpose:

Origin:

Primary key: Can be a natural key, meaning it directly corresponds to a meaningful attribute of the entity represented by the table (e.g., customer ID, order number). Can be a composite key, formed by combining two or more attributes for unique identification.

Surrogate key: Externally generated identifier with no inherent meaning beyond identifying the record.

Purpose:

Primary key: Enforces data integrity: ensures each record is distinct and identifiable for querying and relationships with other tables. Often carries business meaning and can be used for user interaction (e.g., searching by customer ID).

Surrogate key: Primarily optimizes database performance: allows for faster indexing and joins due to its simple, efficient structure. Decouples the identifier from potential changes in the natural key, preventing issues like duplicate or invalid values.

Generation of Surrogate keys in Legacy Datawarehouse

Usually, Surrogate keys are generated many ways in legacy data warehouse. One of the widely used approach is to generate sequence number whenever the new record is being added in the table. y Many databases support this through, default column options like Oracle supports this through CREATE SEQUENCE. Below is one of the example

CREATE SEQUENCE customers_seq

START WITH 1000

INCREMENT BY 1

NOCACHE

NOCYCLE;

Teradata supports with IDENTITY attribute for a COLUMN.

Example:

CREATE TABLE customers

(

customers_seq INT GENERATED ALWAYS AS IDENTITY

(START WITH 1000

INCREMENT BY 1

MINVALUE 0

MAXVALUE 99999

NO CYCLE),

first_name VARCHAR(50) NOT NULL,

last_name VARCHAR(50) NOT NULL,

gender CHAR(1) NOT NULL

);

Apart from using RDBMS support, most of the time unique sequence numbers are generated through window functions like ROW_NUMBER(),CSUM(),RANK()..etc.

Example:

SELECT ROW_NUMBER (ORDER BY customer_id) as customers_seq from customers_table.

Common Challenges during surrogate key generation:

1. If there are duplicate record arrives for a given natural key in delta records, then row_number() generates two different sequence numbers. That will result failures when these sequence numbers are being referenced/populated into fact table. Two possible errors:

a. Mostly dimension table used as one of the join as lookup with natural key while populating fact tables, in this case, the join results in multiplication of duplicated records

b. In case if the update is being used to populate surrogate key in fact, then it will fail with “ Target row updated by multiple source rows .”

2. A prevalent challenge lies in inconsistencies arising from sequence number corrections discovered at a later stage. To illustrate, a data integrity issue was identified within a set of delta records populated three months prior, characterized by the omission of numerous records. Specifically, only 5,000 records were processed instead of the expected 10,000. Resolution necessitates the retrieval of absent records and the subsequent updating of the existing record batch to reflect these changes. This calls for a comprehensive impact analysis and the development of a distinct ETL process dedicated to remediation.

Challenges during Datawarehouse Migration

Surrogate keys, as unique identifiers for each row in a table, play a crucial role in maintaining referential integrity and consistency.

Key Strategies:

Coordinated Key Generation:

Centralized Mechanism: Employ a centralized key generation service or sequence generator accessible to both systems for consistent key assignment.
Synchronized Ranges: Allocate distinct key ranges to each system to prevent collisions and simplify reconciliation.

Mapping and Cross-Referencing:

Map Natural Keys: Establish a clear mapping between natural keys (e.g., customer IDs) and their corresponding surrogate keys in both systems.
Cross-Reference Tables: Create tables to maintain these mappings for efficient data retrieval and comparison.

Data Loading and Staging:

Load with Surrogates: If feasible, load data into staging tables with surrogate keys already assigned.
Generate During Load: If not, generate keys during the load process, adhering to coordinated allocation strategies.

Fact Table Handling:

Temporary Surrogates: Use temporary surrogate keys during parallel runs to maintain referential integrity within fact tables.
Remap After Validation: Upon successful validation, replace temporary keys with final surrogate keys from the new system.

Thorough Validation and Reconciliation:

Compare Key Mappings: Rigorously validate the consistency of key mappings and data integrity between both systems.
Resolve Discrepancies: Address any inconsistencies or errors promptly to ensure accurate data migration.

There are no issues, while populating history data to target system. Issue arises only during parallel runs, where both old and new systems operate simultaneously. Here careful management of surrogate keys is essential to ensure seamless integration and accurate data comparison. Its definetly crucial to choose right key strategies during datawarehouse migration.

Share this:

Related

Leave a comment Cancel reply