Thursday, 9 June 2016

Netezza Interview Questions and Answers - Part-1

1.       How execution plan is used to increase performance in Netezza?

Netezza uses a cost-based optimizer to determine the best method for scan and join operations, join order, and data movement between SPUs that is, redistribute or broadcast operations. The system may redistribute data for joins, grouping aggregates, create tables, or when loading. Decisions about redistribution are made by the planner and are based on costs like expected table sizes.

Read More Details on Netezza here:

The optimizer uses statistics to determine the optimal execution plan for queries. The statistics include the following:

·         The number of rows in the table
·         The number of unique or distinct values of each column
·         The number of NULLs in each column
·         The minimum and maximum of each column
To obtain an EXPLAIN plan, add the EXPLAIN VERBOSE keyword before the SQL statement.

2.       How do you Analyse Query Performance in Netezza?

To evaluate query performance, use the NZAdmin tool to determine what is running on your system.

·         View active queries.
·         Check if any queries are queued.
·         Check if there are any long-running queries.
·         Use the EXPLAIN command to display the execution plan for specific queries.
·         Analyse the query plan.
·         Review the estimated costs. Do they seem reasonable?
·        Is the system performing table broadcasts or distributes? If there is a broadcast, is it on a small table or small result set? If there is distribute, validate the distribution for the tables.
·         Review the scan and join order. Is the largest table scanned last?
·         Make sure that the optimizer has current statistics. If not, generate statistics.
·     Evaluate table distributions. Is there a more optimal distribution strategy? If so, change the distribution method.
·         Run the query again after you have updated the statistics and changed the distribution.

 Use EXPLAINS VERBOSE or the NZAdmin tool to review any changes in the query plan.

3.    What is distribution key? How to choose proper distribution key(s)?

Distribution key is the Netezza table column on which the data is distributed in SPU’s.  When choosing which columns should be the distribution key for a table, your goal should be uniform distribution of the rows and optimal access to the data.
Consider the following factors when choosing distribution keys:

·      The more distinct the distribution key values, the better.
·      The system distributes rows with the same distribution key value to the same data slice.
·   Parallel processing is more efficient when you have distributed table rows evenly across the data slices.
·     Tables used together should use the same columns for their distribution key. For example, in an order system application, use the Customer ID as the distribution key for both the customer table and the order table.
·   If a particular key is used largely in Equi-join clauses, then that key is a good choice for the distribution key.

4.    What is implicit skew in Netezza?

Implicit skew is occurs when data get redistributed or broadcasted on some other column to perform join operations. Data will be redistributed or broadcasted to perform co-located joins. The column on which data get redistributed or joined could be skewed that is, most of the redistributed data get inserted to in few or single disk (SPU) on the system. These types of skews can only occur during run-time when we join the two large tables using SQL. 

Check out Skew in Netezza tables for more information.

Let’s understand the actual situation with the help of an example. In the following tables PAT ID refers to patient ID and DIVN refers to hospital division. In the below example DIVN number is same for all the patients indicating they belong to same DIVN in hospital (e.g. cardiac, urology etc)
PAT_ID
DIVN_ID
100
2001
101
2001
102
2001
103
2001
104
2001

Below is the DIVN table:

DIVN_ID
DIVN_NAME
2001
Cardiac
2002
Urology

If you write query that that take Patients belongs to particular division just like below

SELECT PAT_ID
FROM PATIENT PAT
JOIN DIVISION DIV
ON (PAT.DIVN_ID = DIV.DIVN_ID)

In order to join above two tables, data should have to be co-located. Data in one of the table will have to be redistributed or broadcasted to make co-located tables. Say DIVN table get re-distributed on DIVN_ID. This makes the Netezza to place all data belong to the DIVN_ID 2001 to single disk causing skew.  These types of skew are called implicit or intermediate skew. You cannot easily identify the implicit skew without proper knowledge of data and table structures.
5.    What are joins in Netezza?
Equi-Join
In this form of join, there will be a equality comparison and each term is a column reference instead of expressions
SELECT ...
FROM tableA A
   JOIN tableB B
   ON A.col1 = B.col1

Base algorithms: Netezza resolves the equi-joins using Hash join, where the hashed values to joining columns compared. In certain case, Netezza may use the Merge sort join if data is already in sorted order. That make usually make expensive merge join to inexpensive. Netezza may use the materialized views for these kind of merge sort joins.

Expression Joins

Expression join is the join where comparison condition anything but equality operator. Since each records in one table is compared with each column in other table, these type of joins are very expensive in database and one should avoid using as much as possible. Expression join can never be a co-located join in Netezza
SELECT ...
FROM tableA a
   JOIN tableB b
   ON a.col1 < b.col1;

Base algorithms: Netezza handles these kind of joins with a Nested Loop approach. Each row in tableA is evaluated individually against each row in tableB. The Nested Loop approach is very expensive.

Cross Joins

Cross joins are join where the join or comparison criteria doesn’t include any of columns from at least one of the table.
SELECT ...
FROM tableA a
   JOIN tableB b
   ON a.col1 = 0

Base algorithms: Netezza will take cartesion product approach to resolving these, which can be quite expensive, and also cannot be a co-located join as every row in every data slices must be joined to every row in every other data slice.

Implict IN/EXISTS Joins

These types of joins are not join in SQL, but may be processed as joins by the database server.
SELECT ...
FROM tableA a
WHERE col1 IN
   (
      SELECT col1
      FROM tableB
   )

Base algorithms: Netezza will resolve these statements with an Exists Join approach, which is same as the Hash Join approach, but since no joined data is required in the result set, it only has to tally existence in the process.

Consolidating all joins, typically Netezza prefers in below order:

A) Hash join in memory
B) Hash join on disk
C) Sort Merge join
D) Nested loops
E) Cross join

6.    What is equivalent of replication of a table in Netezza?

In data warehouse environment, there may be situation that requires us to join small dimension table with large fact tables. When we perform the join between tables, Netezza always tried to make those two tables as co-located so that it can use MPP power to process the data quickly. If the table is small, Netezza reads dimension table from all SPU and assembles in host. Then table is broadcasted to all SPU’s, resulting a copy of dimension table on each SPU thus making tables co-located.

In the Netezza database any table count beyond parameter factrel_size_threshold value is considered as a fact table. That is, if this parameter is set to 1 million, any table count more than 1 million rows is considered as fact table by  Netezza and will not result in this broadcast.
Show or set fact table size in Netezza is given below. You can run below command from any client tool or from NZSQL prompt.

Show factrel_size_threshold
Set factrel_size_threshold = 10000000

7.    What are environmental variables required to connect to  Netezza? And what are states of Netezza?

  Environment variables: NZ_HOST, NZ_DATABASE, NZ_USER and NZ_PASSWORD

-Online: Normal or usual state.
-Stopped: Netezza will shut down after completing current queries, no new queries allowed.
-Offline: Waits for completion of current queries, new or queries in queue receive error.
-Paused: Same as above, but no error displayed. Typically caused during Netezza boot up or start-up.
-Down: Just plain down, could be due to Netezza server problem or user initiated.

8.    How data is stored in Netezza?

In Netezza data is stored based on a field(s) which are select as distribution key(s).

Typical data storage steps involved in Netezza are

Data (A) à Hash Function (B)à Logical SPU identifier list (C) à Physical SPU list (D) à Storage (E)

When data arrives to netezza server, it is hashed based on field(s) given in distribution key(s) and a hash function (B) is used for this purpose. For example, in 32 node system, logical SPU identifier list has 32 unique entries. If there are 500 hashed data items from (B), there are 500 entries in (C), all having only 32 SPU entries. For instance, (C) has values [2,20,30,25,1,4,12...]. This way, 500 data entries are mapped. (D) has physical IP address of both primary and failover SPU. This is the way netezza handles to data storage.




1 comment: