1.
How execution plan is used to increase performance in Netezza?
Netezza uses a cost-based optimizer to determine the best method
for scan and join operations, join order, and data movement between SPUs that
is, redistribute or broadcast operations. The system may redistribute data for
joins, grouping aggregates, create tables, or when loading. Decisions about
redistribution are made by the planner and are based on costs like expected
table sizes.
Read More Details on Netezza here:
Read More Details on Netezza here:
- Netezza Data Loading Guide
- FPGA-Accelerated Streaming Technology (FAST) Framework in Netezza
- Netezza Architecture
- Slowly Changing Dimensions (SCD)
- Netezza Skew and how to avoid it
- Importance of right Distribution key in Netezza
- Design and Implement Slowly Changing Dimension Type 2 (SCD Type2) in SQL
- Netezza Zone Maps
The optimizer uses statistics to determine the optimal execution
plan for queries. The statistics include the following:
·
The number of rows in
the table
·
The number of unique or
distinct values of each column
·
The number of NULLs in
each column
·
The minimum and maximum
of each column
To
obtain an EXPLAIN plan, add the EXPLAIN VERBOSE keyword before the SQL
statement.
2.
How do you Analyse Query Performance in Netezza?
To evaluate query performance, use the NZAdmin tool to determine
what is running on your system.
·
View active queries.
·
Check if any queries are
queued.
·
Check if there are any
long-running queries.
·
Use the EXPLAIN command
to display the execution plan for specific queries.
·
Analyse the query plan.
·
Review the estimated
costs. Do they seem reasonable?
· Is the system performing
table broadcasts or distributes? If there is a broadcast, is it on a small
table or small result set? If there is distribute, validate the distribution for
the tables.
·
Review the scan and join
order. Is the largest table scanned last?
·
Make sure that the
optimizer has current statistics. If not, generate statistics.
· Evaluate table
distributions. Is there a more optimal distribution strategy? If so, change the
distribution method.
·
Run the query again
after you have updated the statistics and changed the distribution.
Use EXPLAINS VERBOSE or the NZAdmin tool to
review any changes in the query plan.
3. What is distribution
key? How to choose proper distribution key(s)?
Distribution key is the Netezza table column on which the data
is distributed in SPU’s. When choosing
which columns should be the distribution key for a table, your goal should be
uniform distribution of the rows and optimal access to the data.
Consider the following factors when choosing distribution keys:
· The more distinct the
distribution key values, the better.
· The system distributes
rows with the same distribution key value to the same data slice.
· Parallel processing is
more efficient when you have distributed table rows evenly across the data
slices.
· Tables used together
should use the same columns for their distribution key. For example, in an
order system application, use the Customer ID as the distribution key for both the
customer table and the order table.
· If a particular key is
used largely in Equi-join clauses, then that key is a good choice for the distribution
key.
4.
What is implicit skew in Netezza?
Implicit
skew is occurs when data get redistributed or broadcasted on some other column
to perform join operations. Data will be redistributed or broadcasted to
perform co-located joins. The column on which data get redistributed or joined
could be skewed that is, most of the redistributed data get inserted to in
few or single disk (SPU) on the system. These types of skews can only
occur during run-time when we join the two large tables using SQL.
Check out Skew in Netezza tables for more information.
Check out Skew in Netezza tables for more information.
Let’s
understand the actual situation with the help of an example. In the following
tables PAT ID refers to patient ID and DIVN refers to hospital division. In the
below example DIVN number is same for all the patients indicating they belong
to same DIVN in hospital (e.g. cardiac, urology etc)
PAT_ID
|
DIVN_ID
|
100
|
2001
|
101
|
2001
|
102
|
2001
|
103
|
2001
|
104
|
2001
|
Below is the DIVN table:
DIVN_ID
|
DIVN_NAME
|
2001
|
Cardiac
|
2002
|
Urology
|
If you write query that that take Patients belongs to particular division just like below
SELECT PAT_ID
FROM PATIENT PAT
JOIN DIVISION DIV
ON (PAT.DIVN_ID = DIV.DIVN_ID)
In
order to join above two tables, data should have to be co-located. Data in one
of the table will have to be redistributed or broadcasted to make co-located
tables. Say DIVN table get re-distributed on DIVN_ID. This makes the Netezza to
place all data belong to the DIVN_ID 2001 to single disk causing skew. These types of skew are called implicit or
intermediate skew. You cannot easily identify the implicit skew without proper
knowledge of data and table structures.
5. What are joins in
Netezza?
Equi-Join
In this form of join, there will
be a equality comparison and each term is a column reference instead of
expressions
SELECT
...
FROM
tableA A
JOIN tableB B
ON A.col1 = B.col1
Base algorithms: Netezza resolves the
equi-joins using Hash join, where the hashed values to joining columns compared.
In certain case, Netezza may use the Merge sort join if data is already in
sorted order. That make usually make expensive merge join to inexpensive. Netezza
may use the materialized views for these kind of merge sort joins.
Expression Joins
Expression join is the join where comparison
condition anything but equality operator. Since each records in one table is compared
with each column in other table, these type of joins are very expensive in
database and one should avoid using as much as possible. Expression join can
never be a co-located join in Netezza
SELECT
...
FROM
tableA a
JOIN tableB b
ON a.col1 < b.col1;
Base algorithms: Netezza
handles these kind of joins with a Nested Loop approach. Each row in
tableA is evaluated individually against each row in tableB. The Nested Loop
approach is very expensive.
Cross Joins
Cross joins are join where the join or comparison
criteria doesn’t include any of columns from at least one of the table.
SELECT
...
FROM
tableA a
JOIN tableB b
ON a.col1 = 0
Base algorithms: Netezza
will take cartesion product approach to resolving these, which can be quite
expensive, and also cannot be a co-located join as every row in every data
slices must be joined to every row in every other data slice.
Implict IN/EXISTS Joins
These types of joins are not join in SQL, but may
be processed as joins by the database server.
SELECT
...
FROM
tableA a
WHERE
col1 IN
(
SELECT col1
FROM tableB
)
Base algorithms: Netezza
will resolve these statements with an Exists Join approach, which is same
as the Hash Join approach, but since no joined data is required in the result
set, it only has to tally existence in the process.
Consolidating
all joins, typically Netezza prefers in below order:
A)
Hash join in memory
B)
Hash join on disk
C) Sort
Merge join
D) Nested
loops
E)
Cross join
6. What is equivalent of replication
of a table in Netezza?
In data warehouse
environment, there may be situation that requires us to join small dimension
table with large fact tables. When we perform the join between tables, Netezza
always tried to make those two tables as co-located so that it can use MPP
power to process the data quickly. If the table is small, Netezza reads dimension
table from all SPU and assembles in host. Then table is broadcasted to all SPU’s,
resulting a copy of dimension table on each SPU thus making tables co-located.
In the Netezza database
any table count beyond parameter factrel_size_threshold
value is considered as a fact table. That is, if this parameter is set to 1
million, any table count more than 1 million rows is considered as fact table
by Netezza and will not result in this broadcast.
Show
or set fact table size in Netezza is given below. You can run below command
from any client tool or from NZSQL prompt.
Show
factrel_size_threshold
Set
factrel_size_threshold = 10000000
7. What are environmental
variables required to connect to Netezza?
And what are states of Netezza?
Environment
variables: NZ_HOST, NZ_DATABASE, NZ_USER
and NZ_PASSWORD
-Online: Normal or usual state.
-Stopped:
Netezza will shut down after completing current queries, no new queries
allowed.
-Offline: Waits
for completion of current queries, new or queries in queue receive error.
-Paused:
Same as above, but no error displayed. Typically caused during Netezza boot up
or start-up.
-Down: Just plain down, could
be due to Netezza server problem or user initiated.
8. How data is stored in
Netezza?
In
Netezza data is stored based on a field(s) which are select as distribution key(s).
Typical
data storage steps involved in Netezza are
Data
(A) à Hash Function (B)à Logical SPU identifier list (C)
à Physical SPU list (D) à Storage (E)
When
data arrives to netezza server, it is hashed based on field(s) given in
distribution key(s) and a hash function (B) is used for this purpose. For
example, in 32 node system, logical SPU identifier list has 32 unique entries.
If there are 500 hashed data items from (B), there are 500 entries in (C), all
having only 32 SPU entries. For instance, (C) has values [2,20,30,25,1,4,12...].
This way, 500 data entries are mapped. (D) has physical IP address of both
primary and failover SPU. This is the way netezza handles to data storage.
Related Articles:
Very nice vithal sir... :)
ReplyDelete