Sunday, 19 June 2016

Skew in Netezza tables

This is the word you will hear a lot about if you are developing data warehouse on Netezza , Teradata , hive or Impala database.

The performance of the system is directly linked to uniform distribution of the user data  across all of the data slices in the system. When you create a table and then load the data into the system, the rows of the table should be distributed uniformly among all the data slices. If some data slices have more rows of a table than others this scenarios is called skew.

Read More about Netezza Skew : Netezza Skew and how to avoid it

Skew can happen while distributing or loading the data into the following types of tables:

 Base tables — Database administrators define the schema and create tables.
 Intra-session tables — Applications or SQL users create temp tables.

On Netezza server,  data in the table should be evenly distributed among all the data slices. If the data slices with more row and its associated SPU and FPGA will have to work hard, longer and need more resource  to process the data. These data slices and SPUs become the performance bottleneck for queries that are being processed on the NPS.  This type of the skew is caused by the bad distribution and is called table or explicit skew.

Below is how skewed table looks like when checked in Administrator tool

Skews in Netezza tables

Fig 1. Skewed table

Specifying Distribution Keys

Netezza uses the table’s distribution key to determine how to distribute (or stripe) the  table’s data across all active data slices in the system. The Netezza system requires that all tables have a distribution method, either hash or random.

Read: Importance of right Distribution key in Netezza 

When you use the commands CREATE TABLE or CREATE TABLE AS, you can either specify the method or allow the Netezza to select one.
With the DISTRIBUTE ON (hash) command, you can specify up to four columns as the distribution key.
 If there is no obvious group of columns that can be combined as the distribution key, you can specify random distribution. Random distribution means that the Netezza distributes  the data evenly (in a  round-robin format) across all the data slices.

Random distribution results in the following:

·         Avoiding skew when loading data.
·         Eliminating the need to pick a distribution key when loading a large database that has many tables with a small number of rows. In such cases picking a good distribution key may have little performance  benefit, but it gains the advantage of equal distribution of data.
·         Allowing you to verify a good distribution key by first loading the data round-robin, then using the GENERATE STATISTICS command, and running selects on the database columns to get the min/max and counts. With this information, you can better choose which columns to use for the distribution key.
·         If you do not specify a distribution when you create a table, the system chooses a distribution key and there is no guarantee what that choice will be. If a particular key is used largely in Equi-join clauses, then that key is a good choice for the distribution key.

A good distribution key should always distributes data uniformly. Good distributed table looks like below when checked in Admin tool

Netezza perfect distibuted table

Fig 2. Perfect distribution

 Implicit skew in Netezza

Implicit skew is occurs when data get redistributed or broadcasted on some other column to perform join operations. Data will be redistributed or broadcasted to perform co-located joins. The column on which data get redistributed or joined could be skewed that is, most of the redistributed data get inserted to in few or single disk (SPU) on the system. These types of skews can only occur in run-time when we join the two large tables using SQL. 



Let’s understand the actual situation with the help of an example. In the following tables PAT ID refers to patient ID and DIVN refers to hospital division. In the below example DIVN number is same for all the patients indicating they belong to same DIVN in hospital (e.g. cardiac, urology etc)
PAT_ID
DIVN_ID
100
2001
101
2001
102
2001
103
2001
104
2001

Below is the DIVN table:

DIVN_ID
DIVN_NAME
2001
Cardiac
2002
Urology

If you write query that that take Patients belongs to particular division just like below

SELECT PAT_ID
FROM PATIENT PAT
JOIN DIVISION DIV
ON (PAT.DIVN_ID = DIV.DIVN_ID)


In order to join above two tables, data should have to be co-located. Data in one of the table will have to be redistributed or broadcasted to make co-located tables. Say DIVN table get re-distributed on DIVN_ID. This makes the Netezza to place all data belong to the DIVN_ID 2001 to single disk causing skew.  These types of skew are called implicit or intermediate skew. You cannot easily identify the implicit skew without proper knowledge of data and table structures.

Related Articles:


No comments:

Post a Comment