This post is all about how data
is distributed in Netezza server. Feel free to make comments or suggestions to
improve it, or pass it on if you like.
Read: Importance of right Distribution key in Netezza
Read: Importance of right Distribution key in Netezza
Let’s first understand how NPS stores the data
on disk drives. Each Snippet Processor in the Snippet Processing Unit (SPU) has
a dedicated hard drive has its separate CPU, FPGA, separate RAM memory, hard
disks and the data on stored on drive is
called a data slice.
Each snippet processing unit
(SPU) disk is divided into three partitions.
Primary Partition: This partition contains its own set of data.
Mirror Partition: This partition contains another SPU’s primary
partition data. The mirror partition automatically mirrors another SPU's
primary data slice. This process is called mirroring. In multi-rack
configuration, a SPU mirror is located in another rack. This enables the
Netezza to handle the fault tolerance. In case of racks fails, data is mirrored
in other rack and can be recovered. This process is also called replication.
Swap Partition: Swap partition is used for aggregating, sorting and
other query operations. Swap partition is not mirrored in other word it is
called temp or intermediate processing data. Netezza makes use of this temp
partition to perform the SQL operations.
Fig.1. Data
distribution and mirroring
Tables are
split across multiple SPUs, data slices and the data is stored in groups in
same or nearby data slices according to rows, while data is compressed
according to identical column values i.e. columnar compression.
Now the
question is, how the data is distributed?
The actual distribution of data across disks is
determined by the distribution key(s) provided in table definition.
In fact, in Netezza there are two types of
distribution methods available, a Hash and Random distribution. There should be
a distribution column(s) specified, if the DISTRIBUTE ON clause is not
specified, the system uses first column as the default Distribution column
using the hash algorithm. The maximum number of columns that can specify in distribution
on clause is four. When the system
creates record in the Netezza data base it assign those records to logical
slices based on their distribution key(s) and then pushed to physical stored
identified in logical slices.
Fig.2.
IBM Netezza storage design
Data should uniformly distribute
across all the data slices, the processing performance is directly dependent on
it. The even distribution of the data over all data slices is directly related
to key column(s) used in distribution on clause. Bad distribution key(s) can
cause the Netezza to place data on same slices and that will cause the data
skew, a major performance bottleneck.
For example, consider table is
distributed on the column that contains only flags such as Y or N. When data is
inserted into table, hashed value will be pointing to only two data slices and
data will be place on those slices.
A distribution method that distributes
data evenly across all data slices is the single most important factor that can
increase overall system performance. Bad distribution key(s) can result in
uneven distribution of a table across data slices and SPUs that will cause skew,
causing data to be redistributed or broadcasted, of course that will hamper the
system performance. It is very important to identify the correct and proper
distribution key when creating table definition and that require the extremely
good knowledge on data and system.
Keep the table structure simple
that helps to attain maximum co-locations so that Netezza can use its MPP power
to give you result as fast as possible.
Distribution on multiple keys
The maximum number of columns
that can specify in distribution on clause is four. Keep in mind that you
should not use the multiple keys to distribute data if the chosen key provides
good data distribution. However, distribution keys should be used in join
condition in order to achieve co-location. If we use multiple distribution
keys, then those columns must be used in joining conditions. That is very rare
case so it is better to stick to single distribution key that provides good
data distribution. The distribution is
not an index - it is based on hash algorithm. In multi-key distributions, the
join does not look at one column then the next; it looks at all three
combinations at once because they are hashed together for the distribution.
Leaving out ‘Distribution on’ clause
If distribution key is not
specified, system will, by default distribute the data on first column (just
like Teradata) using hash partition method. This process sometime can cause the
skew if the first column is something like flag. The best method is to distribute data on
column or on random if you are not able identify the best column.
What are factors when choosing distribution key?
When
choosing which columns should be the distribution key for a table, your goal
should be uniform distribution of the rows and optimal access to the data. The main
goal of the distribution is to get data stored co-located.
Consider
the following factors when choosing distribution keys:
·
The more
distinct the distribution key values, the better.
·
The system
distributes rows with the same distribution key value to the same data slice.
·
Parallel
processing is more efficient when you have distributed table rows evenly across
the data slices.
·
Tables
used together should use the same columns for their distribution key. For
example, in an order system application, use the Customer ID as the
distribution key for both the customer table and the order table.
·
If a
particular key is used largely in Equi-join clauses, then that key is a good
choice for the distribution key.
No comments:
Post a Comment