DataWareHousing Technology: DataStage to Teradata Connector

DataStage (DS) has stages that allow you to use FastExport, MultiLoad, FastLoad and TPump. In addition, you can use the Teradata (TD) API stage and ODBC stage to do Extraction/loading/Lookup/manipulating of data. With IBM Information Server (DataStage 8x - The latest version of DataStage), the most awaited Teradata Connecter for Teradata Parallel Transporter (TPT / Teradata PT) stage joined the TD stages fleet. There is more good news coming with IBM Information Server:
>> Supports TD stored procedures
>> Supports TD macros
>> Supports restart capability and reject links for bulk loads

Utilities which can be used in DataStage:
--> FAST EXPORT
--> FAST LOAD
--> MULTI LOAD (MLOAD)
--> BTEQ (Basic Teradata Query facility)
--> TPUMP
--> TPT / Teradata PT (Teradata Parallel Transporter) [With IBM Information Server / DataStage 8x onwards only]

DataStage Teradata (TD) Native connection stages available
--> TD Enterprise Edition (EE)
--> TD TPUMP
--> TD MLOAD
--> TD API
--> TD Connector for TPT / Teradata PT (Teradata Parallel Transporter) [With IBM Information Server / DataStage 8x onwards only]

DataStage generic ODBC connection available:
--> DataStage Generic ODBC
--> DataStage DRS (DRS is Dynamic RDBMS Stage. It was introduced to allow the likes of PeopleSoft to deploy canned DataStage jobs to their customers: with a DRS you dynamically choose the database connection).

TD API as Source/Lookup/Target:
Uses TD CLI native programming interface (API). This API will let the network attached systems to connect and process SQL statements in native TD environment. It submits whatever statements you define exactly as you define them on a per-incoming row basis, with a COMMIT sent after each "Transaction Size" of rows.

Advantages:
--> Better performance and speed (rows / sec) over generic ODBC stage.
--> Support for TD client
--> Simplified configuration on UNIX platforms.
--> Adv. support for target table DDL (i.e. For create and drop)
--> Native metadata import support (Ability to import table metadata and store in DS repository)
--> Reject rows handling

Disadvantages:
--> Does not support Non-ANSI SQL statements in stage generated SQL statements.
--> Does not support byte data types.
--> Does not generate TD version specific SQLs as stage generated SQL statements.
--> Does not support 'like' in the user defined sql when used as lookup.

TD API has the option to run in parallel mode also. One can write/insert records using TD API in parallel mode and it gives much better performance than using the same in sequential mode. When used in parallel mode, you must set the partitioning property to HASH and select all the columns in the Primary Index of the target table for partitioning. You might get occasional blocks due to Teradata hash synonyms only, but this should be pretty minimal on most low volume loads. If you plan to implement restart strategy in a job using TD API for loading data, since there is no cleanup, it is advised to write as UPSERT. The API will generate upsert code automatically, but it will be 2 SQL statements, not atomic upsert. If you want ATOMIC upsert you will have to take the auto generated code and then modify it slightly to the ATOMIC upsert syntax.
Note: Upsert strategy - If more rows exist in the DB it is faster to update first (Update existing or insert new) than Insert new or update existing. (And vice-versa).

Findings:
-->Data loading modes supported - Insert/Delete/Upsert/Append
-->Uses a Teradata Utility - No
-->If Yes, Utility used - NA
-->Parallel Features of DataStage Supported - Yes (Conditional)
-->Runs in sequential or Parallel Mod - Both
-->Reject rows handling inside DataStage - Yes
-->Before and After SQL execution facility - Yes
-->Lock strategy (Row/Page/Table) - Row
-->Uses Temp/Work tables - No
-->Uses Error Tables - No
-->Ability to control job based on rows rejected to error table - No
-->Ability to write output to files - No
-->Uses named pipes - No
-->Check Point restart support - No
-->Direct loading support - No (No TD Utility used)
-->Can be used as look up stage? - Yes
-->Can be used as for sparse look up - No

Tips:
* For best performance, TD client should NOT be the same machine where TD server is installed.
* Take note of timestamp format used in TD - "%yyyy-%mm-%dd %hh%nn%ss" while doing date conversions.
* ODBC stage is the only stage which allows you to do sparse lookup on teradata tables.

TD EE as Source/Lookup/Target:
When used as Source, it calls FASTEXPORT (TD Utility) and when used as Target, it calls FASTLOAD (TD Utility). Number of sessions created is governed by the RequestedSessions and SessionsPerPlayer options in the stage.
TD EE (Source/Lookup) invokes FastExport which produces 1 answer set. All the parallel processing is performed on the TD Server, not in DataStage. FastExport cannot handle selects that use unique key constraint that would only return one row. You have to be careful when using unique key constraints. This is set from the TD side and not by Datastage. If you use this 2 or 3 times for source tables/lookups in a job and you run a couple of jobs at once, then you have invoked too many fastexports. For explicit exports or big data pulls, the TD EE can work fine, or you can use the MLOAD stage (which does Fastexport if it is used as source stage)

TD EE (Target) invokes FastLoad for bulk loading into TD Tables. Fastload does not support Secondary Indexes in Teradata. TD EE Stage will create a work table when an append to a Teradata table is selected for the job. This is because a FastLoad must load to an empty table which is impossible with an append operation. To get around this, DataStage FastLoads to a generated empty work table and then does insert into (select * from work table) on the database side. Append option will create an empty TD table with ALL fields but NO defaults. It generates this empty table from the Teradata metadata, NOT your DataStage job metadata. Also, unknown columns are replaced with null.

Ex:
Incoming columns are col1, col2, col3
Target table columns are col1, col2, col3, col4, col5 with col4 default value as 0 and col5 as current_timestamp.
Step 1:
Creation of orch_work table with:

CREATE TABLE ORCH_WORK_xxxxx AS ( SELECT * FROM TargetTable ) WITH NO DATA PRIMARY INDEX (col1)
Note: Orch_work table doesn't keep default values. Ie. Orch_work table has col4 and col5 also, but with no default values.

Step 2:
Incoming records are loaded in orch_work with values col1, col2, col3,null,null.
INSERT INTO ORCH_WORK_xxxxx (:col1, :col2, :col3, null, null)

Step 3:
Append using Insert command into Target table:
INSERT INTO TargetTable SELECT * FROM ORCH_WORK_xxxxx
Caution: Step 3 will fail if Col4 and Col5 (or any of them) are set as not null in the Target table. To avoid this you need to pass col4 and col5 with default values inside the job itself.

FastLoad will create 2 error tables by executing drop table statements for the same table (As a part of "BEGIN LOADING" ). FastLoad drops these 2 tables if they are empty as a part "END LOADING" transaction. In case of a problem (duplicate primary index key for example ), the error table number 2 should be removed manually. The names of the two tables are ERR_cookie_1 and ERR_cookie_2. Cookie is to be found on the terasync table. In that table, you may use start time and end time (integer) fields in order to help find the last ones inserted. You cannot modify these error table names.
Fastloading delivers very high performance with only two constraints: duplicate rows are silently dropped and error detection/correction is weak.
Findings:
-->Data loading modes supported - Insert/Delete/Upsert/Append
-->Uses a Teradata Utility - Yes
-->If Yes, Utility used - FastExport and FastLoad
-->Parallel Features of DataStage Supported - Yes
-->Runs in sequential or Parallel Mode - Parallel
-->Reject rows handling inside DataStage - No(7x) & Yes(8x)
-->Lock strategy (Row/Page/Table) - Table
-->Uses Temp/Work tables - Yes
-->Uses Error Tables - Yes
-->Ability to control job based on rows rejected to error table - Yes
-->Ability to write output to files - Yes
-->Uses named pipes - Yes
-->Any related APT or DS Parameters that can change/control the functionality - Yes
-->Direct loading support - Yes (TD Utility used)
-->Can be used as lookup stage - Yes
-->Before and after sql (open and close command) option available - Yes
-->Can be used as for sparse lookup - No

Tips:
1) TD EE stage creates a special terasync table to the source database and if you don't have create and write privileges in that db, you will encounter an error. The way to do it is to point TD EE stage to write the terasync in another place in the db where you have enough privileges.

The way to do it is to set couple of env parameters:
APT_TERA_SYNC_DAEMON_PROG = [server name for terasync table]
APT_TERA_SYNC_DATABASE = [db name for terasync table]
APT_TERA_SYNC_USER = [user name used for terasync table]
APT_TERA_SYNC_PASSWORD = [password for user used to access terasync table]

2) **There is a relation between the "SessionsPerPlayer", the number of nodes and the resulting TeraData Sessions generated. For example, for SessionsPerPlayer=2 you will have 16 Teradata sessions for a 32 Teradata AMP system. Requestedsessions is completely independent of the # of nodes in the DataStage configuration file. It is dependent on the number of vprocs or AMPS in the Teradata system. A 168 AMP system will create 84 sessions using the defaults (SessionsPerPlayer = 2) regardless of the EE Configuration file. In high volume environment try to balance performance and TeraData sessions by tuning the SessionsPerPlayer.

3) If TENACITY is not set with the timeout limit, an process that can not get a Teradata session will abort immediately.

4) Restart on TD EE works basically by starting over from the beginning. It does not cleanup the old Fastload tables also. So using insert strategy should be with strategies to take care of this scenario.

5) If you have large TD instance (A lot of VPROCs) set RequestedSessions to something more manageable to keep the repartitioning to a minimum. Repartitioning the data is expensive and can cause various resource issues.
Set the right RequestedSessions in the Advanced connection options in the stage to keep this minimum. TD operator uses two values to determine how many players to start - RequestedSessions and SessionsPerPlayer. By default, if no value is set for RequestedSessions, it looks at the number of VPROCs used for the table and SessionsPerPlayer defaults to 2 (one read session and one write session). Here's a rule of thumb:

RequestedSessions / SesssionsPerPlayer = Number of player processes (Which works best when it is the same value as the number of nodes in the config file)
Setting up these values will definitely improve startup time and runtime.

6) For some actions the Teradata Enterprise stage lets you put in open and close commands that run at the start or end of the DataStage job.

Teradata FastLoad features
--> Target table must be empty
--> Insert only (no updates)
--> Only one target table
--> Must drop secondary indexes before loading and recreate them afterward
**Also note: For FastLoad, IDENTITY columns are unsupported

TD Mload as Source/Lookup/Target

Multiload (as source) uses Fast export utility.

Multiload (as target) is very efficient when you are doing maintenance activities on multiple large tables. At a time, Multiload stage can perform Inserts/Updates on upto 5 different tables in one pass. Work tables and error tables are created each time you perform an operation using Multiload also. It is automatically dropped once the job has run successfully. However if the job aborts, the work tables have to be manually dropped before the job is run again.

Mload Options - Load Method:

Invoke Multiload - Mload is invoked automatically when the job runs. Stage creates named pipes to transmit data to Mload and then starts Mload process. Stage allows a 720 secs (12) minutes time to write to named pipe (ie. Acquisition phase) and then kicks off load process. Else fails the job. MultiLoad places locks on the entire table.
You can always change this phase time by changing:

DS_TDM_PIPE_OPEN_TIMEOUT

Manual - Data is stored in .dat file with name you specified in the stage. You can execute an Mload script independent of the job in this case which points to this .dat file.

Teradata MultiLoad features
--> Target table can be populated regardless of it/they is/are empty or not
--> Allows inserts, updates, deletes, and upserts
--> Loads into multiple target tables
--> Allows non-unique secondary indexes â€” automatically rebuilds them after loading

TD TPUMP as Source/Lookup/Target
TPump is a highly parallel utility designed to continuously move data from data sources into Teradata. TPump is typically used for loading a small quantity of records in relation to the size of your target table. TPump works at the row level, whereas MultiLoad and FastLoad update whole blocks of data. TPump allowing us to load data into tables with referential integrity which MultiLoad doesn't. TPump only needs to take row-level locks; in other words, TPump only places a lock upon the row it is modifying. In contrast, MultiLoad places locks on the entire table. If you need multiple processes updating a table simultaneously, TPump may be the better solution. TPump also uses fewer system resources, so you can run it concurrently with user queries without impacting system performance. TPump lets you control the rate at which updates are applied. You can dynamically "throttle down" when the system is busy and throttle up when activity is lighter.
TPump does not take a utility slot. TPump is designed for "trickle feed" taking individual row-level locks. If you use TPUMP, you need to make sure that you follow normal TPUMP standards (Set the KEY statements equal to the PI and turn SERIALIZE on). Tpump is typically for processes that are constantly retrieving and processing data, like from a message queue. Using TPump offers a controllable transition to updating that is closer to real time. Best fit in "Active", "real-time" or "closed-loop" data warehousing.

Findings:
-->Data loading modes supported - Insert/Delete/Upsert/Append
-->Uses a Teradata Utility - TPump does not take a utility slot.
-->If Yes, Utility used - NA
-->Parallel Features of DataStage Supported - Yes
-->Direct loading support - No
-->Can be used as for sparse lookup - No

Tips:
* Feeding clean data from your transformation processes into TPump is important for overall performance, data with errors makes TPump slow (Longer runtime (+40%)). Also Longer runtime (+45%) with fallback, SERIALIZE adds 30% and ROBUST adds 15%.
* Ask your DBA to assign the TPump user to a higher priority performance group when the TPump job runs at the same time as decision support queries, if the TPump completion time is more critical than the other work active in the system.
* It uses time based checkpoints not count based.
* Does not support MULTI-SET tables.
* Can have Secondary Indexes and RI on tables.

TD Connector for TPT / Teradata PT(Teradata Parallel Transporter)
TPT is an object-oriented client suite that executes multiple instances of data extraction, transformations, and loading functions in a scalable, high-speed, parallel-processing environment.

Main features:
--> All-in-one! Single infrastructure for all loading/unloading needs using single scripting language.
--> Greatly reduces the amount of file I/O and significantly improve performance.
--> Push up and Push down features.
--> Provides unlimited symbolic substitution for the script language and application programming interface (API).
--> Combines functionality of TD FastLoad, MultiLoad, FastExport, TPump and API.
--> Respective modules or operators for the protocols of fastload, multiload, tpump and fastexport are named as Load, Update, Stream and Export.
-->Apart from the four operators, on API & ODBC front, there are operators like Selector, Inserter and more.

Think of TD, think of one stage - TD Connector for TPT stage.
Two access methods/modes of execution are available in IBM Information Server (DataStage 8) for this stage:

1) Immediate - For Selector, Inserter... operators. Same as CLI interface/API. (No Load/Unload Utilities).
2) Bulk - For Load, Update, Stream and Export operators (i.e. fastload, multiload, tpump and fastexport respectively).

Remember to install this utility (TPT) on the DataStage server for using the range of operators it supports [Same case as most of the other TD utilities].

DataWareHousing Technology

About Me

Monday, September 1, 2014

DataStage to Teradata Connector - Briefly explained for beginners

No comments:

Post a Comment

Adsense bar