Sunday, 26 June 2016

nzload Command and Usage in Netezza

How the nzload Command Works

The nzload command is a SQL CLI client tool that allows you to load data from the local or a remote client, on all the supported client platforms (Linux/windows).  The nzload command processes command-line load options to send queries to the host to create an external table definition, run the insert/select query to load data, and when the load completes, then drop the external table.

The nzload command connects to a Netezza database with user name and password associated with server. In order to use any of the netezza tool, you should first install nzodbc or nzjdbc drivers on the client or remote machines.

Required Privileges

To run the nzload command, you must have the CREATE EXTERNAL TABLE privilege and access privileges to that table or database (LIST, INSERT, SELECT).

Parallel loading and transactions


While the nzload job is running on the host or client machine, it sends records to be loaded to the SPUs along with the current transaction ID. Soon after receiving those new records, SPU immediately allocates resources and writes the records to the database.

You can run multiple nzload jobs in parallel with all nzload job inserting records to the same tables. While loading, you can execute concurrent queries, inserts, updates, and deletes against all committed records in the target tables.

The nzload command performs all insertions into the target table within a single transaction. The nzload command commits the transaction at the end of the job, provided it does not detect any fatal errors. If the nzload command cannot commit the transaction, these storage and resources remain allocated. To free up this disk space, use the groom command on the specific table or on database.

Order of command execution


The nzload command is a command-line program that accepts input values from multiple sources.
The precedence order is the following:

·         Command line
·         Control file
·         Environmental variables (only used for user, password, database, and host)
·         Built-in defaults

Option names are case insensitive.

Command syntax


Following command can be used to load flat file to Netezza data warehouse

nzload -host <host IP/name> -db <database> -u <username> -pw <password> -t table –df ‘/home/dw/data/flat_file.csv’ -bf  ‘/home/dw/data/flat_file.bad’ -lf ‘/home/dw/data/flat_file.log’ [optional args]

default delimiter in nzload is pipe(|). The above syntax is basic one and can be used along with various nzload command line options.

Command line options


Option
Description
-Host
Database host address or DNS name
-Port
Database port to connect to
-d
Netezza database name
-u
Database user name
-pw
Database password
-t
Database table to insert data
df
data file path to be loaded to Netezza table
Use named-pipe
Check this option if you like to use a named-pipe instead of a data file
Named-pipe name
Specify a name for the named-pipe to use for loading
-delim
The delimiter to use while loading data
-cf
use control file and provide path here
-RequireQuotes
use this option if column values are embedded into quotes. YES
-QuotedValue
specify the type of quote. single, double
-lf
Name of the log file to generate.
-bf
Name of the bad file to generatein case of bad records
-ouputDir
Directory path to where the log and the bad file is generated. If the parameter is not specified the files are generated under the current directory where the job is currently running.
-logFileSize
Maximum size for the log file. The value is in MB. The default value is 2000 or 2GB.
-compress
Specify this option if the data file is compressed. Valid values are "TRUE" or "FALSE". Default value if "FALSE".
-skipRows <n>
Number of rows to skip from the beginning of the data file.
-maxRows <n>
Maximum number of rows to load from the data file.
-maxErrors
Maximum number of error records to allow before terminating the load process. The default value if "1".
-ignoreZero
Binary zero bytes in the input data will generate errors. Set this option no "NO" to generate error or to "YES" to ignore zero bytes. The default value is "NO".
-requireQuotes
This option requires all the values to be wrapped in quotes. default value is "FALSE".


-nullValue <token>
Specify the token to indicate a null value in the data file. The default value is "NULL". To improve slightly performance you can set this value to an empty field by specifying the value for the option as to single quotes: "\'\'".
-fillRecord
Treat missing trailing input fields as null.

-ctrlChar
Accept control chars in char/varchar fields (must escape NUL, CR and LF).
-ctInString
Accept un-escaped CR in char/varchar fields (LF becomes only end of row).
-truncString
Truncate any string value that exceeds its declared char/varchar storage.
-dateStyle
Specify the date format in which the input data is written in. Valid values are: "YMD", "Y2MD", "DMY", "DMY2", "MDY", "MDY2", "MONDY", "MONDY2". The default value is "YMD".


-dateDelim
Delimiter character between date parts. The default value is "-" .

-y2Base
First year expressible using two digit year (Y2) dateStyle.
-timeStyle
Specify the time format in which the input data is written in. Valid values are: "24HOUR" and "12HOUR".


-timeDelim
Delimiter character between time parts. The default value is ":".

-timeRoundNanos
Allow but round non-zero digits with smaller than microsecond resolution.
-boolStyle
Specify the format in which Boolean data is written in the data. The valid values are: "1_0", "T_F", "Y_N", "TRUE_FALSE", "YES".
-allowRelay
Allow load to continue after one or more SPU reset or failed over. The default is not allowed.
-allowRelay <n>
Specify number of allowable continuation of a load. Default value is "1".

nzload exit codes


The nzload command exits with the following codes:

·         0 – Successful, all input records were inserted.
·         1 – Failed, no records were inserted due to an error or errors found during the load.

Saturday, 25 June 2016

FPGA-Accelerated Streaming Technology (FAST) Framework in Netezza

Most of the companies around the globe using Netezza data warehouse appliance to perform their near real-time , complex analytics and data warehouse needs.

Read More Details : FPGA- FAST framework

NPS performance is made of a result of framework  of FPGA Accelerated streaming technology (FAST) engine. This engine is just like a turbo-charger to an already powerful engine. This FAST engine what makes the Netezza a powerful server. Netezza has introduced a FAST Engine, Compress, which uses technology to deliver yet another 100-200% performance increase in streaming analytic performance. Netezza’s Compress Engine is designed primarily for performance improvement. Rather than the CPU-intensive compression efforts employed by other vendors to reduce disk usage that also result in reduced performance, the Compress Engine accelerates performance.


The FAST Engines Framework

The FPGA is a the heart of Netezza’s patented streaming architecture. Netezza uses the FPGA to handle important functionalities and filter out unwanted data during early stage of the disk read. This process of early unnecessary data elimination improves the performance as it frees up the downstream systems like  I/O, CPU, memory and network. This enables NPS to process only the required data and client request as soon as possible. Those primary functions that are built into the FPGA in the NPS system are referred to as "engines" and compose the FAST Engines framework.
The FPGA-Accelerated Streaming Technology (FAST) Engines Framework is based on three main concepts:
·       FAST Engines are basic analytic functions electronically programmed into the FPGA to accelerate query performance
·        FAST Engines are dynamically reconfigurable
·        FAST Engines are customized at run-time for each snippet executed in the SPU.

 Dynamic re-configurability allows each of the engines embedded in the FPGA to be modified, disabled or extended by the NPS system at any point of time. Run-time customization enables the FPGA to incorporate parameters passed to each engine to optimize the behavior of the FPGA for a particular query snippet. By enabling this snippet-by-snippet customization, one could think of the NPS platform as providing an ‘optimized hardware configuration for each snippet’.

The five current FAST Engines in the framework include Control, Parse, Visibility, Project and Restrict. These engines work in a combined serial and parallel manner.

Overall FAST framework is depicted in below diagram.

FPGA FAST framework in Netezza

Fig.1. Netezza FAST engine framework

Control Engine

The Control Engine controls the hard disk and manages direct memory access (DMA) data flows from the disk to the SPU memory. It manages reads and writes to the disk drive and data flows to memory. Storage management performance enhancement functions, such as Zone Maps are enforced through the Control Engine.

Parse Engine

The Parse Engine performs  two key roles in the FAST Engines framework.

 First, it provides Error-Correcting Code (ECC) checking and correction as data streams flows out of disk drive.

Second, it parses the incoming data stream from the disk drive and passes the appropriate data to each of the remaining three downstream engines such as Visibility, Project and Restrict.

 Visibility Engine

The Visibility Engine provides the ACID (Atomicity, Consistency, Isolation and Durability) isolation enforcement for the NPS appliance. As data are streamed from the disk drive, the Visibility Engine will removes rows of data that should not be visible to the particular query snippet being executed - either because the records had been marked deleted by an earlier query or because they had been added to the database after the start of the current query. Data that should not be visible for processing is immediately filtered away and eliminated in this FPGA engine.

Project Engine

 The Project Engine enforces filtering of the column data read from the disk based on the "SELECT" clause of the SQL statement. Only columns included by the clause are allowed to continue for further processing and any columns not selected will be filtered and eliminated prior to any records being sent to the CPU and memory. This steps increases the query performance as unwanted columns are already filtered out.

Restrict Engine

 The Restrict Engine enforces filtering of records that are read from the disk based on the "WHERE" clauses of the SQL statement. Only records that satisfy the clauses will be allowed to continue for further processing and any records not selected are filtered and eliminated prior to being sent to the on-board CPU and memory

The Restrict Engine typically results in an exceptionally high degree of data record filtering, greatly reducing the follow-on snippet processing work required of the CPU and memory on each SPU.

Compress Engine

The Compress Engine will boost streaming query performance by 2-3x (100-200%) via software enhancement to the NPS appliance. Compress Engine will be able to compress the data to use disk much more effectively and increase query performance as data flows from the disk drives. Netezza’s Compress Engine is designed primarily for performance improvement, and not just to reduce data sizes.

The Compress Engine, like all other FAST Engines, is dynamically configurable and optimized at run-time. As data is written to disk (e.g., during data load, insert or update operations) it is compressed into a compiled format, column-by-column with the original data replaced by the Compress Engine instruction set for decompilation. As this data is read from the disk, the Compress Engine reads its instruction set and reassembles the original data as it streams from the disk, effectively raising the streaming data rate by as much as 200% - this step increases effective scanning rate per SPU node from over 60 MB/s to approximately 200 MB/s.

Read about how FAST engine processes the SQL query here How FPGA Works When you run SQL query in Netezza?

Related Articles:


How FPGA Works When you run SQL query in Netezza?



The performance multiplying effect of the FPGA FAST engine: Control, Parse, Visibility, Project and Restrict is shown with help of small example of processing SQL query in Netezza.

Consider following simple query in Netezza:

Select state, gender, age, count(*)
From Fact_table
Where dob > '12/31/1999'  and dob < '04/01/2000' And zip = 32605
Group by state, gender, age;

Below are the steps followed by the FPGA FAST framework to process this query;

Step 1: FPGA uncompress the data by referring the instruction set that is maintained during compress process.

Step 2: Storage manager and FPGA uses Zone Maps to limit the disk read to only those disk extents on each SPU with dates of birth occurring in the three-month period of January through March 2000, instead of scanning the entire table.

Step 3: When the data was read from the disk, the FPGA would apply the Visibility Engine to restrict the records being investigated to only those that met the ACID isolation requirements.

Step 4: It will further restrict the rows of data to those records within the three-month range and a zip code matching to 32605.

Step 5: Finally, projection engine filter out all columns, are streamed data to the memory and CPU would be limited to only state, gender and age information of each record. Project engine filters out all columns and provides only 2-3% of columns to CPU and memory.


Netezza Data Loading Guide


Data loading in Netezza means transferring data into Netezza appliance. There are several ways to transfer the data into appliance

Read: Netezza Data Loading Guide

External tables

 These are tables stored as flat files on the host or client systems and not in the Netezza appliance database. Create table on top of the flat file that needs to be loaded to tables, then use that external table to load target table in Netezza appliance.

An external table allows Netezza to treat an external flat file as a database table. An external table has a definition (a table schema) that matches the columns in the flat file, but the actual data exists outside of the Netezza appliance database. External tables can be used to access files which are stored on the Netezza host server or, in the case of a remote external table, Netezza can treat a file on a client system as an external table. In order to use external tables you should have the “create external table” permission on host or client system.

After you have created the external table definition, you can use INSERT INTO statements to load data from the external file into a database table, or SELECT FROM statements to query the external table. [ Read about external table command here ]

Command Syntax

The CREATE EXTERNAL TABLE command has the following syntax.
1.       To create an external table based on another table:

CREATE EXTERNAL TABLE table_name
SAMEAS table_name
USING external_table_options

2.       To create an external table by defining columns:

CREATE EXTERNAL TABLE table_name
({ column_name type
[ column_constraint [ ... ] ]} [, ... ]
)
[USING external_table_options]

nzload


nzload is bulk copy command available in Netezza.  This is a command that provides an easy method for using external tables and getting data into the Netezza appliance.

The nzload command is a SQL CLI client tool that allows you to load data from the local or a remote client, on all the supported client platforms (Linux/windows).  The nzload command processes command-line load options to send queries to the host to create an external table definition, run the insert/select query to load data, and when the load completes, drop the external table. In order to use any of the netezza tool, you should first install nzodbc or nzjdbc drivers on the client or remote machines. 

As the nzload uses external tables, you should have CREATE EXTERAL privilege in order to use nzload command.

Command Syntax

Following command can be used to load flat file to Netezza data warehouse

nzload -host <host IP/name> -db <database> -u <username> -pw <password> -t table –df ‘/home/dw/data/flat_file.csv’ -bf  ‘/home/dw/data/flat_file.bad’ -lf ‘/home/dw/data/flat_file.log’ [optional args]

default delimiter in nzload is pipe(|). The above syntax is basic one and can be used along with various nzload command line options. 

nz_migrate


This is a separate tool, not part of the Netezza software package. This utility is a script that can migrate (copy) a database/table from one Netezza appliance to another, or make a copy of a database/table on the same server. If you are upgrading to other version of Netezza appliance then this is the best available tool to sync up the data between two servers.

Command syntax

nz_migrate  -sdb <dbname> -tdb <dbname> -thost <name/IP>  [optional args]

The command can be run either from source or target host.  Make sure all the parameter values are correct.  Otherwise, you will end up with messing up the with existing data.

 Backup and Restore


 There are different methods for doing backups and restores to transfer data between systems. One method is to create external tables and use nzload.

Wednesday, 22 June 2016

Working with Zone Maps in Netezza


Before going to zone maps first let’s understand extent.  An extent is smallest unit of storage on the disk. In fact it is smallest disk allocation on storage to store the data. Netezza would maintain the maximum and minimum value of the column that is stored in an extent (each extent is of 3MB in size)

Read in more details: Netezza Zone Maps

 Zone map in Netezza is an internal mapping structure to the extent that takes advantage of internal ordering of the data.  Zone maps are automatically generated internal table structures that the Netezza system uses to improve the response time of SQL queries when queries data consists of large grouped or nearly ordered date, timestamp, byteint, smallint, integer, and bigint data types. Netezza does not create zone maps for numeric types that are larger than 8 bytes.

Zone maps transparently reduce disk scan operations by eliminating the unreferenced data. A zone map contains MIN and MAX of the data in each extent.  Whenever SQL query is issued with where clause, the Netezza system first scans the created zone maps to find the addresses of minimum and maximum extents that contains the column values. Then Netezza scans the column values in sequential order, results are displayed once the matches found. 

Netezza creates zone maps in below scenarios

·         Generate statistics command
·         Data is loaded using nzload command
·       Data is inserted or updated to tables -- As new rows are inserted into the table or updated, entries in the zone map are updated for the new ranges.
·         When groom operation performed to remove deleted records

Netezza does not create zone maps for very small table which is < 10 MB (default size). Zone Maps will give excellent results if table is ordered most restricting (i.e. column used in WHERE clause). This would reduce the I/O operations since all the rows would be grouped in the same or nearby extents and rest of extent need not be scanned.

To verify all the columns on which Netezza can create zone maps, use nz_zonemap from the sql toolkit. If you don’t have that installed, get it done ASAP as it has lot of useful syntax and script, those will make your life much easier.

Query performance and data slice issues

If the table size on data slice exceeds 192 GB, you could get error. Follow below mentioned steps to handle this situation.

·      Perform GROOM on the table to remove deleted records. This will free up some space on the data slice.
·      Redistribute the data on different column if existing column is causing high skew.
Create smaller table out of big table and create a view with a UNION ALL operation.  
·      Use truncate table statement over delete.

Even query planner wont shows if zone maps are used. Zone map debug tools are very helpful in getting information on zone maps. Use the set enable_zmap_debug=1 in nzsql script to find information on zone maps usage in SQL. This command will enable zone maps debug for only current session. It will be disable automatically once the sql execution is completed.


During poor query performance, always remember to verify if zone maps are disabled zone maps or un-ordered data or deleted records are still available in database (forgot to run groom command?). Generate statistics command will disable the zone maps. If by chance, the generate statistics command is interrupted, that will lead to disabled zone maps. Netezza will never scan the disable zone maps. One should re-generate the statistics on the column to re-enable zone maps.

Related Articles:

Monday, 20 June 2016

Distribution key in Netezza. Importance of right Distribution key

This post is all about how data is distributed in Netezza server. Feel free to make comments or suggestions to improve it, or pass it on if you like.

Read: Importance of right Distribution key in Netezza 

Let’s first understand how NPS stores the data on disk drives. Each Snippet Processor in the Snippet Processing Unit (SPU) has a dedicated hard drive has its separate CPU, FPGA, separate RAM memory, hard disks  and the data on stored on drive is called a data slice.
Each snippet processing unit (SPU) disk is divided into three partitions.

Primary Partition: This partition contains its own set of data.

Mirror Partition: This partition contains another SPU’s primary partition data. The mirror partition automatically mirrors another SPU's primary data slice. This process is called mirroring. In multi-rack configuration, a SPU mirror is located in another rack. This enables the Netezza to handle the fault tolerance. In case of racks fails, data is mirrored in other rack and can be recovered. This process is also called replication.

Swap Partition: Swap partition is used for aggregating, sorting and other query operations. Swap partition is not mirrored in other word it is called temp or intermediate processing data. Netezza makes use of this temp partition to perform the SQL operations.

How data is distributed in Netezza

Fig.1. Data distribution and mirroring


Tables are split across multiple SPUs, data slices and the data is stored in groups in same or nearby data slices according to rows, while data is compressed according to identical column values i.e. columnar compression.

Now the question is, how the data is distributed?

The actual distribution of data across disks is determined by the distribution key(s) provided in table definition.
In fact, in Netezza there are two types of distribution methods available, a Hash and Random distribution. There should be a distribution column(s) specified, if the DISTRIBUTE ON clause is not specified, the system uses first column as the default Distribution column using the hash algorithm. The maximum number of columns that can specify in distribution on clause is four. When the  system creates record in the Netezza data base it assign those records to logical slices based on their distribution key(s) and then pushed to physical stored identified in logical slices.


 Fig.2. IBM Netezza storage design

Data should uniformly distribute across all the data slices, the processing performance is directly dependent on it. The even distribution of the data over all data slices is directly related to key column(s) used in distribution on clause. Bad distribution key(s) can cause the Netezza to place data on same slices and that will cause the data skew, a major performance bottleneck.

For example, consider table is distributed on the column that contains only flags such as Y or N. When data is inserted into table, hashed value will be pointing to only two data slices and data will be place on those slices.

A distribution method that distributes data evenly across all data slices is the single most important factor that can increase overall system performance. Bad distribution key(s) can result in uneven distribution of a table across data slices and SPUs that will cause skew, causing data to be redistributed or broadcasted, of course that will hamper the system performance. It is very important to identify the correct and proper distribution key when creating table definition and that require the extremely good knowledge on data and system.
Keep the table structure simple that helps to attain maximum co-locations so that Netezza can use its MPP power to give you result as fast as possible.

Distribution on multiple keys

The maximum number of columns that can specify in distribution on clause is four. Keep in mind that you should not use the multiple keys to distribute data if the chosen key provides good data distribution. However, distribution keys should be used in join condition in order to achieve co-location. If we use multiple distribution keys, then those columns must be used in joining conditions. That is very rare case so it is better to stick to single distribution key that provides good data distribution.  The distribution is not an index - it is based on hash algorithm. In multi-key distributions, the join does not look at one column then the next; it looks at all three combinations at once because they are hashed together for the distribution.

Leaving out ‘Distribution on’ clause

If distribution key is not specified, system will, by default distribute the data on first column (just like Teradata) using hash partition method. This process sometime can cause the skew if the first column is something like flag.  The best method is to distribute data on column or on random if you are not able identify the best column.

What are factors when choosing distribution key?

When choosing which columns should be the distribution key for a table, your goal should be uniform distribution of the rows and optimal access to the data. The main goal of the distribution is to get data stored co-located.

Consider the following factors when choosing distribution keys:

·         The more distinct the distribution key values, the better.
·         The system distributes rows with the same distribution key value to the same data slice.
·         Parallel processing is more efficient when you have distributed table rows evenly across the data slices.
·         Tables used together should use the same columns for their distribution key. For example, in an order system application, use the Customer ID as the distribution key for both the customer table and the order table.
·         If a particular key is used largely in Equi-join clauses, then that key is a good choice for the distribution key.

Related Articles: