Monday, 20 June 2016

Distribution key in Netezza. Importance of right Distribution key

This post is all about how data is distributed in Netezza server. Feel free to make comments or suggestions to improve it, or pass it on if you like.

Read: Importance of right Distribution key in Netezza 

Let’s first understand how NPS stores the data on disk drives. Each Snippet Processor in the Snippet Processing Unit (SPU) has a dedicated hard drive has its separate CPU, FPGA, separate RAM memory, hard disks  and the data on stored on drive is called a data slice.
Each snippet processing unit (SPU) disk is divided into three partitions.

Primary Partition: This partition contains its own set of data.

Mirror Partition: This partition contains another SPU’s primary partition data. The mirror partition automatically mirrors another SPU's primary data slice. This process is called mirroring. In multi-rack configuration, a SPU mirror is located in another rack. This enables the Netezza to handle the fault tolerance. In case of racks fails, data is mirrored in other rack and can be recovered. This process is also called replication.

Swap Partition: Swap partition is used for aggregating, sorting and other query operations. Swap partition is not mirrored in other word it is called temp or intermediate processing data. Netezza makes use of this temp partition to perform the SQL operations.

How data is distributed in Netezza

Fig.1. Data distribution and mirroring


Tables are split across multiple SPUs, data slices and the data is stored in groups in same or nearby data slices according to rows, while data is compressed according to identical column values i.e. columnar compression.

Now the question is, how the data is distributed?

The actual distribution of data across disks is determined by the distribution key(s) provided in table definition.
In fact, in Netezza there are two types of distribution methods available, a Hash and Random distribution. There should be a distribution column(s) specified, if the DISTRIBUTE ON clause is not specified, the system uses first column as the default Distribution column using the hash algorithm. The maximum number of columns that can specify in distribution on clause is four. When the  system creates record in the Netezza data base it assign those records to logical slices based on their distribution key(s) and then pushed to physical stored identified in logical slices.


 Fig.2. IBM Netezza storage design

Data should uniformly distribute across all the data slices, the processing performance is directly dependent on it. The even distribution of the data over all data slices is directly related to key column(s) used in distribution on clause. Bad distribution key(s) can cause the Netezza to place data on same slices and that will cause the data skew, a major performance bottleneck.

For example, consider table is distributed on the column that contains only flags such as Y or N. When data is inserted into table, hashed value will be pointing to only two data slices and data will be place on those slices.

A distribution method that distributes data evenly across all data slices is the single most important factor that can increase overall system performance. Bad distribution key(s) can result in uneven distribution of a table across data slices and SPUs that will cause skew, causing data to be redistributed or broadcasted, of course that will hamper the system performance. It is very important to identify the correct and proper distribution key when creating table definition and that require the extremely good knowledge on data and system.
Keep the table structure simple that helps to attain maximum co-locations so that Netezza can use its MPP power to give you result as fast as possible.

Distribution on multiple keys

The maximum number of columns that can specify in distribution on clause is four. Keep in mind that you should not use the multiple keys to distribute data if the chosen key provides good data distribution. However, distribution keys should be used in join condition in order to achieve co-location. If we use multiple distribution keys, then those columns must be used in joining conditions. That is very rare case so it is better to stick to single distribution key that provides good data distribution.  The distribution is not an index - it is based on hash algorithm. In multi-key distributions, the join does not look at one column then the next; it looks at all three combinations at once because they are hashed together for the distribution.

Leaving out ‘Distribution on’ clause

If distribution key is not specified, system will, by default distribute the data on first column (just like Teradata) using hash partition method. This process sometime can cause the skew if the first column is something like flag.  The best method is to distribute data on column or on random if you are not able identify the best column.

What are factors when choosing distribution key?

When choosing which columns should be the distribution key for a table, your goal should be uniform distribution of the rows and optimal access to the data. The main goal of the distribution is to get data stored co-located.

Consider the following factors when choosing distribution keys:

·         The more distinct the distribution key values, the better.
·         The system distributes rows with the same distribution key value to the same data slice.
·         Parallel processing is more efficient when you have distributed table rows evenly across the data slices.
·         Tables used together should use the same columns for their distribution key. For example, in an order system application, use the Customer ID as the distribution key for both the customer table and the order table.
·         If a particular key is used largely in Equi-join clauses, then that key is a good choice for the distribution key.

Related Articles:

No comments:

Post a Comment