This is the
word you will hear a lot about if you are developing data warehouse on Netezza ,
Teradata , hive or Impala database.
The performance of the system is directly linked to
uniform distribution of the user data across
all of the data slices in the system. When you create a table and then load the
data into the system, the rows of the table should be distributed uniformly
among all the data slices. If some data slices have more rows of a table than
others this scenarios is called skew.
Read More about Netezza Skew : Netezza Skew and how to avoid it
Read More about Netezza Skew : Netezza Skew and how to avoid it
Skew can happen while distributing or loading the
data into the following types of tables:
Base tables — Database administrators define the
schema and create tables.
Intra-session tables —
Applications or SQL users create temp tables.
On Netezza server, data in the table should be evenly
distributed among all the data slices. If the data slices with more row and its
associated SPU and FPGA will have to work hard, longer and need more resource to process the data. These data slices and
SPUs become the performance bottleneck for queries that are being processed on
the NPS. This type of the skew is caused
by the bad distribution and is called table or explicit skew.
Below is how skewed table looks
like when checked in Administrator tool
Fig 1. Skewed table
Specifying Distribution Keys
Netezza uses the table’s distribution key to
determine how to distribute (or stripe) the
table’s data across all active data slices in the system. The Netezza
system requires that all tables have a distribution method, either hash or
random.
Read: Importance of right Distribution key in Netezza
Read: Importance of right Distribution key in Netezza
When you use the commands CREATE TABLE or CREATE
TABLE AS, you can either specify the method or allow the Netezza to select one.
With the DISTRIBUTE ON (hash) command, you can
specify up to four columns as the distribution key.
If there is no
obvious group of columns that can be combined as the distribution key, you can
specify random distribution. Random distribution means that the Netezza
distributes the data evenly (in a round-robin format) across all the data
slices.
Random distribution results in the following:
·
Avoiding skew when loading data.
·
Eliminating the need to pick a distribution key
when loading a large database that has many tables with a small number of rows.
In such cases picking a good distribution key may have little performance benefit, but it gains the advantage of equal distribution
of data.
·
Allowing you to verify a good distribution key
by first loading the data round-robin, then using the GENERATE STATISTICS
command, and running selects on the database columns to get the min/max and
counts. With this information, you can better choose which columns to use for
the distribution key.
·
If you do not specify a distribution when you
create a table, the system chooses a distribution key and there is no guarantee
what that choice will be. If a particular key is used largely in Equi-join
clauses, then that key is a good choice for the distribution key.
A good distribution key should
always distributes data uniformly. Good distributed table looks like below when
checked in Admin tool
Fig 2. Perfect distribution
Implicit skew in Netezza
Implicit skew is
occurs when data get redistributed or broadcasted on some other column to
perform join operations. Data will be redistributed or broadcasted to perform
co-located joins. The column on which data get redistributed or joined could be
skewed that is, most of the redistributed data get inserted to in few or
single disk (SPU) on the system. These types of skews can only occur in
run-time when we join the two large tables using SQL.
Let’s understand
the actual situation with the help of an example. In the following tables PAT
ID refers to patient ID and DIVN refers to hospital division. In the below
example DIVN number is same for all the patients indicating they belong to same
DIVN in hospital (e.g. cardiac, urology etc)
PAT_ID
|
DIVN_ID
|
100
|
2001
|
101
|
2001
|
102
|
2001
|
103
|
2001
|
104
|
2001
|
Below is the DIVN table:
DIVN_ID
|
DIVN_NAME
|
2001
|
Cardiac
|
2002
|
Urology
|
If you write query that that take Patients belongs to particular division just like below
SELECT PAT_ID
FROM PATIENT PAT
JOIN DIVISION DIV
ON (PAT.DIVN_ID =
DIV.DIVN_ID)
In
order to join above two tables, data should have to be co-located. Data in one
of the table will have to be redistributed or broadcasted to make co-located
tables. Say DIVN table get re-distributed on DIVN_ID. This makes the Netezza to
place all data belong to the DIVN_ID 2001 to single disk causing skew.
These types of skew are called implicit or intermediate skew. You cannot
easily identify the implicit skew without proper knowledge of data and table
structures.
Related Articles:
No comments:
Post a Comment