Monday, 13 June 2016

Design and Implement Slowly Changing Dimension Type 2 (SCD Type2) in SQL

Dimensions in data warehousing contain relatively static data about entities such as customers, stores, locations etc. Slowly changing dimensions commonly known as SCD, usually captures the data that changes slowly but unpredictably, rather than regular bases.  Slowly changing dimension is most popular method used in dimensional modelling to preserve historical data.
For example, lets take the example of patient details. The fact table may contains the information about patient expense details. The fact and dimensions are always linked by means of foreign keys. One of the dimension may contain the information about patient (say, patient dimension ) . The patient information will not change on day to day bases. It’ll change once in a while (slowly changes).

On In depth Implementation read here: Design and Implement Slowly Changing Dimension Type 2 (SCD Type2) in SQL 

If you want to implement the SCD type 2 in SQL (without ETL tools), it’s gonna take bit longer route but you’ll end up with best feeling in world of implementing SCD Type 2.

Okay!. Let’s get started with building SCD Type 2 on patient dimension table.

Let us consider the patient details table which has attributes like SK or PAT_ID, Phone number, name etc. Here we need to capture the updated phone number or name and store that historical data in same table.

In data  warehouse environment, usually tables are divided into three tables viz. LOAD (just  a data dump) , ITERMEDIATE (transformed data) and TARGET (replica of TARGET) tables, will be referred as LOAD, INT and TGT henceforth.  In this SCD type 2 implementation , we will be using all these three tables.

Below are the content of LOAD and TARGET table at the time building SCD Type 2.

LOAD : PAT_DTLS_LOAD -  Hold new and updated data. Consider we received data as below and we have to load this data into patient dimension which has SCD Type2 implemented on it. 

SK
PHONE_NO
NAME
123
987654321
Jhoney
234
876543210
Stuart
567
345678901
Max Stuart
456
765432101
Jeff

TARGET: PAT_DTLS_TGT- Hold actual patient data – This is actual initial load i.e. there are no patient who updated their data yet.

SK
PHONE_NO
NAME
EFF_FR_DT
EFF_TO_DT
FLAG
345
345678901
Jhoney Stuart
2016-06-01
NULL
1
234
234567890
Stuart
2016-06-01
NULL
1
123
123456789
Jhoney
2016-06-01
NULL
1
567
345678901
Max Stuart
2016-06-01
NULL
1

INT: PAT_DTLS_LOAD_INT- Table to hold intermediate results

Now we are all set with tables and data. Let’s get started with detailed steps...

Step 1: Load expired records from TGT to INT table

 Copy all expired records INT table . This step just copying all records with non-null EFF_TO_DT and flag = 0 records. Since there are no expired records already in database, hence there won’t be any records in intermediate table.

Query: Below query can be used to perform this task

INSERT INTO PAT_DTLS_LOAD_INT
SELECT * FROM PAT_DTLS_TGT
WHERE FLAG = 0 and EFF_TO_DATE is not null;

Result: In first run there will not be any expired records

SK
PHONE_NO
NAME
EFF_FR_DT
EFF_TO_DT
FLAG







Step 2: Get all records which are going to expire

These are the records that are updated  and are recieved in LOAD table. Since these records are expired, You also need set the EFF_TO_DT to the ‘current_date-1’ that means record was active till yesterday and flag as ‘0’. The updated records will be active from the moment we load them so EFF_FR_DT will be current date

Query: Below query is very helpful to perform this task

INSERT into PAT_DTLS_LOAD_INT
SELECT TGT.SK ,
TGT.PHONE_NO,
TGT.NAME,
EFF_FR_DT,
now() - 1 as EFF_TO_DT,
'0' as flag
FROM PAT_DTLS_TGT TGT
WHERE TGT.flag = 1
AND EXISTS (SELECT 1 FROM
 PAT_DTLS_LOAD SRC
WHERE TGT.sk = src.sk
and (
 TGT.PHONE_NO <> src.PHONE_NO
 or  TGT.NAME <> src.NAME )
);

Result: Results shows expired records. Updated version of these records are available in LOAD table

SK
PHONE_NO
NAME
EFF_FR_DT
EFF_TO_DT
FLAG
123
123456789
Jhoney
2016-06-01
2016-06-12
0
234
234567890
Stuart
2016-06-01
2016-06-12
0

Step 3: Copy active records from TGT to INT table

These are the records which are active and not updated. Since target table is going to be the truncate and load, You will have to copy all active records to intermediate table.  

Query: Below query can perform the mentioned task

INSERT INTO PAT_DTLS_LOAD_INT
SELECT TGT.SK ,
TGT.PHONE_NO,
TGT.NAME,
EFF_FR_DT,
EFF_TO_DT,
FLAG
FROM PAT_DTLS_TGT TGT
WHERE FLAG = 1
AND  NOT EXISTS (SELECT 1 FROM
                  PAT_DTLS_LOAD SRC
                 WHERE TGT.SK = SRC.SK )
UNION ALL --union all is used to includeany unchaged records we got inpresent cycle that are already present in target table
SELECT TGT.SK ,
TGT.PHONE_NO,
TGT.NAME,
EFF_FR_DT,
EFF_TO_DT,
FLAG
FROM PAT_DTLS_TGT TGT
WHERE TGT.FLAG = 1
AND EXISTS (SELECT 1 FROM
 PAT_DTLS_LOAD SRC
WHERE TGT.SK = SRC.SK
AND (
 TGT.PHONE_NO = SRC.PHONE_NO
 AND  TGT.NAME = SRC.NAME )
);

Results: Highlighted records are active records copied from TGT table.These are the records which not updated in this load  cycle

SK
PHONE_NO
NAME
EFF_FR_DT
EFF_TO_DT
FLAG
123
123456789
Jhoney
2016-06-01
2016-06-12
0
234
234567890
Stuart
2016-06-01
2016-06-12
0
345
345678901
Jhoney Stuart
2016-06-01
NULL
1
567
345678901
Max Stuart
2016-06-01
NULL
1

Step 4: Copy only updated records from LOAD table

These are records which are updated in this load cycle. Since updated records are already in the LOAD table, you have  to compare the LOAD and TGT records on Patient ID or SK column and copy the record which is updated. You should also keep in mind to set EFF_FROM_DT and flag to indicate active record. The updated records will be active from the moment we load them so EFF_FR_DT will be current date and EFF_TO_DT will be either NULL or future date.

Query: Below query can be used to perform task

SELECT SRC.SK ,
SRC.PHONE_NO,
SRC.NAME,
now(),
null,
1
FROM PAT_DTLS_LOAD src
WHERE EXISTS (SELECT 1
                       FROM   PAT_DTLS_LOAD_INT INT1
                       WHERE  src.sk = INT1.sk
                       AND    flag = 0
                       AND NOT EXISTS
                                                                                   (SELECT 1
                       FROM   PAT_DTLS_LOAD_INT INT2
                       WHERE  INT1.sk = INT2.sk
                       AND   flag = 1 )
                  );

Results: Highlighted records are updated records. You can also see the expired records in below intermediate results

SK
PHONE_NO
NAME
EFF_FR_DT
EFF_TO_DT
FLAG
123
123456789
Jhoney
2016-06-01
2016-06-12
0
123
987654321
Jhoney
2016-06-13
NULL
1
234
234567890
Stuart
2016-06-01
2016-06-12
0
234
876543210
Stuart
2016-06-13
NULL
1
345
345678901
Jhoney Stuart
2016-06-01
NULL
1
567
345678901
Max Stuart
2016-06-01
NULL
1

Step 5: Copy fresh records from LOAD to INT

At  any given load cycle , you may receive either updated or fresh records. In this step you are going to load the fresh records. This is the simple step to compare the LOAD and TGT table on patient ID or SK and copy the record which is new.

Query:Below query is useful to perform mentioned task

INSERT INTO PAT_DTLS_LOAD_INT
SELECT SRC.SK ,
SRC.PHONE_NO,
SRC.NAME,
NOW(),
NULL,
1
FROM PAT_DTLS_LOAD SRC
WHERE NOT EXISTS     (SELECT 1
                       FROM   PAT_DTLS_LOAD_INT INT
                       WHERE  SRC.SK = INT.SK);

Results: The fresh record is highlighted.

SK
PHONE_NO
NAME
EFF_FR_DT
EFF_TO_DT
FLAG
123
123456789
Jhoney
2016-06-01
2016-06-12
0
123
987654321
Jhoney
2016-06-13
NULL
1
234
234567890
Stuart
2016-06-01
2016-06-12
0
234
876543210
Stuart
2016-06-13
NULL
1
345
345678901
Jhoney Stuart
2016-06-01
NULL
1
456
765432101
Jeff
2016-06-13
NULL
1
567
345678901
Max Stuart
2016-06-01
NULL
1

Step 6: Truncate and load TGT table.

Finally, you just need to truncate the TGT and copy records from INT table.

Query: Below query can be used for this task

TRUNCATE TABLE PAT_DTLS_TGT;
INSERT INTO PAT_DTLS_TGT
SELECT * FROM PAT_DTLS_LOAD_INT;

Results: Since TGT is replica of INT table. It hold all records which are available in INT table

SK
PHONE_NO
NAME
EFF_FR_DT
EFF_TO_DT
FLAG
123
123456789
Jhoney
2016-06-01
2016-06-12
0
123
987654321
Jhoney
2016-06-13
NULL
1
234
234567890
Stuart
2016-06-01
2016-06-12
0
234
876543210
Stuart
2016-06-13
NULL
1
345
345678901
Jhoney Stuart
2016-06-01
NULL
1
456
765432101
Jeff
2016-06-13
NULL
1
567
345678901
Max Stuart
2016-06-01
NULL
1

Hope you have enjoyed reading this blog. Please drop comment for feedback and any queries.


Related Articles:



No comments:

Post a Comment