DataWareHousing Technology: 2014

Monday, September 1, 2014

DataStage Job Execution Flow

When you execute a job, the generated OSH and contents of the configuration
file ($APT_CONFIG_FILE) is used to compose a "score". This is similar to a SQL
query optimization plan.
At runtime, IBM InfoSphere DataStage identifies the degree of parallelism and
node assignments for each operator, and inserts sorts and partitioners as
needed to ensure correct results. It also defines the connection topology (virtual
data sets/links) between adjacent operators/stages, and inserts buffer operators
to prevent deadlocks (for example, in fork-joins). It also defines the number of
actual OS processes. Multiple operators/stages are combined within a single OS
process as appropriate, to improve performance and optimize resource
requirements.
The job score is used to fork processes with communication interconnects for
data, message and control3. Processing begins after the job score and
processes are created. Job processing ends when either the last row of data is
processed by the final operator, a fatal error is encountered by any operator, or
the job is halted by DataStage Job Control or human intervention such as
DataStage Director STOP.
Job scores are divided into two sections — data sets (partitioning and collecting)
and operators (node/operator mapping). Both sections identify sequential or
parallel processing.

The execution (orchestra) manages control and message flow across processes
and consists of the conductor node and one or more processing nodes as shown
in Figure 1-6. Actual data flows from player to player — the conductor and
section leader are only used to control process execution through control and
message channels.
_ Conductor is the initial framework process. It creates the Section Leader (SL)
processes (one per node), consolidates messages to the DataStage log, and
manages orderly shutdown. The Conductor node has the start-up process.
The Conductor also communicates with the players.
Note: You can direct the score to a job log by setting $APT_DUMP_SCORE.
To identify the Score dump, look for "main program: This step....".
_ Section Leader is a process that forks player processes (one per stage) and
manages up/down communications. SLs communicate between the
conductor and player processes only. For a given parallel configuration file,
one section leader will be started for each logical node.
_ Players are the actual processes associated with the stages. It sends stderr
and stdout to the SL, establishes connections to other players for data flow,
and cleans up on completion. Each player has to be able to communicate
with every other player. There are separate communication channels
(pathways) for control, errors, messages and data. The data channel does
not go through the section leader/conductor as this would limit scalability.
Data flows directly from upstream operator to downstream operator.

How to Release Job Lock in DataStage from Unix backend ?

1. Login as datastage server administrator user in your UNIX machine where DataStage Server is installed.
2. cd into DataStage home directory. You can enter "cat /.dshome" to find the DataStage home directory.
3. Enter ". ./dsenv" (i.e. .space./dsenv) to source the dsenv file.
4. Enter "bin/uvsh". It will bring you into UV Administration menu. Hit ESC key to get out of the menu and it bring you to ">" TCL prompt.
5. At TCL prompt, you can enter "LOGTO project name" to log into the project account. The project name is case sensitive.
6. To find out what the job description number, enter "LIST DS_JOBS
job name".
7. Enter "LIST.READU EVERY" at TCL. Check active record locks under "Item Id" column for job name or RT_CONFIG# or RT_LOG# (# matches the job description number.
8. Write down the inode number and user number for the lock that is not a valid lock.
9. Enter "LOGTO UV" to log into UV account. The UNLOCK command lives in UV account.
10. Enter "UNLOCK INODE inode# USER user# ALL". This will unlock the lock hold on this file (inode#) and hold by this user (user#) for file locks, group locks and record locks.
If you need to see all the locks again, enter "LIST.READU EVERY". OR Enter "UNLOCK USER user# ALL"
11. Type Q to logout of Universe.

DataStage Tips for Fast Coding

Ever needed to trim a lot of VarChar columns?
In the Transformer stage you can effect a global change to selected output column derivation expressions. Select the columns whose definitions you want to change, and then click the right button on the mouse. From the menu choose Derivation Substitution. The dialog that opens allows you to modify the entire expression or some part thereof, identified by a regular expression.
For example, to apply a Trim() function to all the selected derivation expressions, type in Trim($1) in the bottom panel, then click OK. After acknowledging that you do want to make this change, all the selected derivations will be modified and a message displayed indicating how many were changed. To modify just the same part of a selection of derivation expressions, choose "partial" rather than whole expression, and supply a regular expression that identifies just that part of the expression to be modified.

Making JMS Connection from DataStage ETL Jobs

This example shows a sample java class that is integrated into ETL jobs using Java input Stage and send messages to JMS.
Step 1: Create a java class using the code in the attached file.
Step 2: place the java class file on Data Stage server.
Step 3: Configure java input Stage in ETL jobs. ( Create a row and send some sample record for testing ).
Step 4: ensure the JMS connection factory & URL are correctly encoded into the code.

Example :

import com.ascentialsoftware.jds.Row;
import com.ascentialsoftware.jds.Stage;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import javax.jms.ConnectionFactory;
import javax.jms.Connection;
import javax.jms.Session;
import javax.jms.MessageProducer;
import javax.jms.Queue;
import javax.jms.TextMessage;
//Import the classes to use JNDI.
import javax.naming.*;

import java.util.*;

public class JmsTest extends Stage
{
public void initialize() {
trace("UpperCaseTarget.initialize");
_rowCount = 0;
_resultWriter = null;
String userProperties = getUserProperties();
try {
_resultWriter = new PrintWriter(new FileWriter(userProperties), true);
} catch (IOException eIO) {
fatal("Cannot open '" + userProperties + "': " + eIO.getMessage());
}
}
public void terminate() {
trace("UpperCaseTarget.terminate");
}
public int process() {
// Read a row, convert all its columns to upper case
// surrounded by double-quotes and delimited by commas,
// and write the result to a file. The file path is given
// in the user's properties of the stage.
_rowCount++;
Row inputRow = readRow();
if (inputRow == null) {
//34 Java Pack Guide
// No row currently available or end of data.
// The function must return but it could be called again later on.
// The stage actually ends when "terminate" is called.
return OUTPUT_STATUS_END_OF_DATA;
}

try {

ConnectionFactory myConnFactory;
Queue myQueue;
String MYCF_LOOKUP_NAME = "Test_ConnectionFactory";
String MYQUEUE_LOOKUP_NAME = "CDC_Queue";

Hashtable env;
Context ctx = null;

env = new Hashtable();

env.put(Context.INITIAL_CONTEXT_FACTORY, "com.webmethods.jms.naming.WmJmsNamingCtxFactory");
env.put(Context.PROVIDER_URL, "wmjmsnaming://Broker #1@<Host Name>:<Port Number>");

ctx = new InitialContext(env);

myConnFactory = (javax.jms.ConnectionFactory) ctx.lookup(MYCF_LOOKUP_NAME);

myQueue = (javax.jms.Queue)ctx.lookup(MYQUEUE_LOOKUP_NAME);

Connection myConn = myConnFactory.createConnection();

Session mySess = myConn.createSession(false, Session.AUTO_ACKNOWLEDGE);

MessageProducer myMsgProducer = mySess.createProducer(myQueue);

TextMessage myTextMsg = mySess.createTextMessage();
myTextMsg.setText("Hello World From ECllipse");
_resultWriter.print("Sending Message: " + myTextMsg.getText());
myMsgProducer.send(myTextMsg);

mySess.close();
myConn.close();

} catch (Exception jmse) {
System.out.println("Exception occurred : " + jmse.toString());
jmse.printStackTrace();
}
// No rows were written since there is no output link.
return OUTPUT_STATUS_NOT_READY;
}
private int _rowCount;
private PrintWriter _resultWriter;
}

Configuration of DataStage to SVN Repository for Source code Version Control

The below steps provide the way to configure IBM IS server Manager tool to SVN repository for auto check in and checkout of DataStage jobs. ( Applicable from 8.5 version onwards ).

1) Download Subclipse from below website on to your PC. This not a executable. There are only plug-in Jar Files.

http://subclipse.tigris.org/servlets/ProjectDocumentList?folderID=2240

2) Unzip onto one of your local PC location.

3) Then use the software updates in the help tab in 'Information Sever Manager'
Client tool.

4) Note when selecting which Subclipse features to install, the user should de-select the optional Mylyn feature.

5) Successful installation can be verified after restarting Information Server Manager and switching to the SVN Repository Exploring perspective.

6) From here the user can add an SVN Repository location and connect to their existing SVN host

DataStage Environment Variables at a Glance

eneral variables

LD_LIBRARY_PATH - specifies the location of dynamic libraries on Unix
PATH - Unix shell search path
TMPDIR - temporary directory

Parallel properties

APT_CONFIG_FILE - the parallel job configuration file. It points to the active configuration file on the server. Please refer to Datastage EE configuration guide for more details on creating a config file.
APT_DISABLE_COMBINATION - prevents operators (stages) from being combined into one process. Used mainly for benchmarks.
APT_ORCHHOME - home path for parallel content.
APT_STRING_PADCHAR - defines a pad character which is used when a varchar is converted to a fixed length string

Operator specific
The operator specific variables under parallel properties are stage specific settings and usually set during an installation. The settings apply to the supported parallel database engines (DB2, Oracle, Sas and Teradata).

APT_DBNAME - default DB2 database name to use
APT_RDBMS_COMMIT_ROWS - RDBMS commit interval

Reporting
The reporting variables control logging options and take True/False values only.

APT_DUMP_SCORE - shows operators, datasets, nodes, partitions, combinations and processes used in a job.
APT_RECORD_COUNTS - helps detect and analyze load imbalance. It prints the number of records consumed by getRecord() and produced by putRecord()
OSH_PRINT_SCHEMAS - shows unformatted metadata for all stages (interface schema) and datasets (record schema). OSH_PRINT_SCHEMAS environment variable should be set to verify that runtime schemas match the job design column definitions (especially from Oracle).
OSH_DUMP - shows an OSH script and produces a verbose description of a step before executing it
APT_NO_JOBMON - disables performance statistics and process metadata reporting in Designer.
APT_PM_PLAYER_TIMING - Shows how much CPU time each stage uses.
APT_PM_SHOW_PIDS - Show process ID of each stage.
Compiler

APT_COMPILER - path to the C++ compiler needed to compile transformer stages.

DataStage Error Code debugging

What should you do , if your datastage does not provide proper error message ..

The APT_CombinedOperatorController is part of the internal working of DataStage. When you hit compile, DataStage is going to try to tune your code by combining several of your stages into one section of code. This is the APT_CombinedOperator. Normally you see a more meaningful error in your log. It could be before or after this error.

If you don't see anything more meaningful add the environmental variable $APT_DISABLE_COMBINATION and set it equal to True. Recompile and run again. This will keep all of your stages separate and should yield you a more meaningful error message.

$APT_DISABLE_COMBINATION- this will turn off Data Stage’s attempt in optimizing your code for better performance, and will give you better information.

How to read Score Dump format in DataStage

Reading Score DUMP ..

The dump score contains two sections -- the data sets (DS) and the operators (OP).
Data sets - The data sets that are listed in the score are the same type of data sets that you create with the Data Set stage -- in this context, they are temporary memory and/or disk storage during the job's run.
Operators - Operators are individual parallel engine stages that you might see on the user interface.

In a typical job flow, operators are end-points, and data sets are the links between the operators. (An exception is when data sets are used to actually output to a file.)

Each and every link on the job design is potentially a data set. However, unlike the data set stage which you put in your resource disk group by using the specified node pool within the job's configuration file (APT_CONFIG_FILE), these data sets are in memory. These temporary data sets are only placed in the scratch disk space when an imposed limit is reached. A limit can be imposed due to environment settings, or due to physical memory limitations.

Each operator listed in the score spawns a number of processes that are dependent on:

the job's established configuration file (APT_CONFIG_FILE) constrained by the node pool settings the operator configuration in the parallel engine code Several environment variables, such as APT_DISABLE_COMBINATION, being set/unset.
First, let us focus on the operators, which are listed after the data sets in the score:

op0[1p] {(sequential PacifBaseMCES)
on nodes (
node1[op0,p0]
)}
op1[4p] {(parallel RemDups.IndvIDs_in_Sort)
on nodes (
node1[op1,p0]
node2[op1,p1]
node3[op1,p2]
node4[op1,p3]
)}

In the proceeding example, the two operators are: op0 and op1. The operator name is prefixed with the code name "op" and appended with an incremental numeric value starting with zero (0). Next to the operator name, there is an enclosed bracket with a value that is followed by the letter "p", for example, "[1p]". The value indicates the number of partitions given to that operator by the engine. For the first operator, only one (1) partition is provided, and the second operator is given four (4) partitions.

Within the curly brackets, the execution mode ("parallel" or "sequential") and the name of that operator is provided. The operator name is based on the name shown on the parallel canvas in the Designer client. The operator name is not the same as the operator type.

In the proceeding example, the first operator is listed as "PacifBaseMCES" and is the stage name in its entirety. However, the second operator, is listed as "remDups.IndvIDs_in_Sort". The stage name"IndvIDs" is renamed to indicate that the sort process triggered by the remove duplicate stage occured.

Following each operator name are the specific nodes that the operators are tagged to run on. In the proceeding example, node1 is for the first operator, and node1, node2, node3, and node4 are for the second operator. The name of nodes are defined in your configuration file (APT_CONFIG_FILE).

Now let us focus on the data sets:

ds0: {op0[1p] (sequential PacifBaseMCES)
eOther(APT_ModulusPartitioner { key={ value=MBR_SYS_ID }
})<>eCollectAny
op1[4p] (parallel RemDups.IndvIDs_in_Sort)}
ds1: {op1[4p] (parallel RemDups.IndvIDs_in_Sort)
[pp] eSame=>eCollectAny
op2[4p] (parallel RemDups)}

The name of the data set is provided first. Within the curly brackets, there are three stages:
the source of the data set - operator 0, sequential PacifBaseMCES the activity of the data set - operator 1, parallel RemDups.IndvIDs_in_Sort the target of the data set - operator 2, parallel RemDups.

In the example for the first dataset, you see "eOther" and "eCollectAny".These are input and target read methods. The second method indicates the method that the receiving operator uses to collect the data.

In this example, "eOther" is the originating or input method for op0. It is an indication that something else is being imposed outside the expected partitioning option (and that you need to observe the string within the parenthesis -- APT_ModulusPartitioner in this example -- Modulus partitioning is imposed.)

"eCollectAny" is the target read method. Any records that are fed to this data set are collected in a round robin manner. The round robin behavior is less significant than the behavior that occurs for input partitioning method, which is eOther(APT_ModulusPartitioner) for ds0.

In the first example in this document, where the operator and stage uses the APT_SortedMergeCollector for ds9, the "eCollectOther"method indicates where actual partitioning occurs and is usually specified when you are referencing a sequential flat file. Shown again, in part, is the example:

ds8: {op8[4p] (parallel APT_TransformOperatorImplV22S14_ETLTek_HP37FMember_PMR64262_Test1_SplitTran2 in SplitTran2)
eSame=>eCollectAny
op9[4p] (parallel buffer(1))}
ds9: {op9[4p] (parallel buffer(1))
>>eCollectOther(APT_SortedMergeCollector { key={ value=MBR_SYS_ID,
subArgs={ asc }

The symbols between the originating partitioning method and the target read method translates to the parallelism of the partitioning. The following is the list of the symbols, and their definition:

-> Sequential to Sequential
<> Sequential to Parallel
=> Parallel to Parallel ( SAME)
#> Parallel to Parallel ( NOT SAME)
>> Parallel to Sequential
> No Source or Target

DataStage to Teradata Connector - Briefly explained for beginners

DataStage (DS) has stages that allow you to use FastExport, MultiLoad, FastLoad and TPump. In addition, you can use the Teradata (TD) API stage and ODBC stage to do Extraction/loading/Lookup/manipulating of data. With IBM Information Server (DataStage 8x - The latest version of DataStage), the most awaited Teradata Connecter for Teradata Parallel Transporter (TPT / Teradata PT) stage joined the TD stages fleet. There is more good news coming with IBM Information Server:
>> Supports TD stored procedures
>> Supports TD macros
>> Supports restart capability and reject links for bulk loads

Utilities which can be used in DataStage:
--> FAST EXPORT
--> FAST LOAD
--> MULTI LOAD (MLOAD)
--> BTEQ (Basic Teradata Query facility)
--> TPUMP
--> TPT / Teradata PT (Teradata Parallel Transporter) [With IBM Information Server / DataStage 8x onwards only]

DataStage Teradata (TD) Native connection stages available
--> TD Enterprise Edition (EE)
--> TD TPUMP
--> TD MLOAD
--> TD API
--> TD Connector for TPT / Teradata PT (Teradata Parallel Transporter) [With IBM Information Server / DataStage 8x onwards only]

DataStage generic ODBC connection available:
--> DataStage Generic ODBC
--> DataStage DRS (DRS is Dynamic RDBMS Stage. It was introduced to allow the likes of PeopleSoft to deploy canned DataStage jobs to their customers: with a DRS you dynamically choose the database connection).

TD API as Source/Lookup/Target:
Uses TD CLI native programming interface (API). This API will let the network attached systems to connect and process SQL statements in native TD environment. It submits whatever statements you define exactly as you define them on a per-incoming row basis, with a COMMIT sent after each "Transaction Size" of rows.

Advantages:
--> Better performance and speed (rows / sec) over generic ODBC stage.
--> Support for TD client
--> Simplified configuration on UNIX platforms.
--> Adv. support for target table DDL (i.e. For create and drop)
--> Native metadata import support (Ability to import table metadata and store in DS repository)
--> Reject rows handling

Disadvantages:
--> Does not support Non-ANSI SQL statements in stage generated SQL statements.
--> Does not support byte data types.
--> Does not generate TD version specific SQLs as stage generated SQL statements.
--> Does not support 'like' in the user defined sql when used as lookup.

TD API has the option to run in parallel mode also. One can write/insert records using TD API in parallel mode and it gives much better performance than using the same in sequential mode. When used in parallel mode, you must set the partitioning property to HASH and select all the columns in the Primary Index of the target table for partitioning. You might get occasional blocks due to Teradata hash synonyms only, but this should be pretty minimal on most low volume loads. If you plan to implement restart strategy in a job using TD API for loading data, since there is no cleanup, it is advised to write as UPSERT. The API will generate upsert code automatically, but it will be 2 SQL statements, not atomic upsert. If you want ATOMIC upsert you will have to take the auto generated code and then modify it slightly to the ATOMIC upsert syntax.
Note: Upsert strategy - If more rows exist in the DB it is faster to update first (Update existing or insert new) than Insert new or update existing. (And vice-versa).

Findings:
-->Data loading modes supported - Insert/Delete/Upsert/Append
-->Uses a Teradata Utility - No
-->If Yes, Utility used - NA
-->Parallel Features of DataStage Supported - Yes (Conditional)
-->Runs in sequential or Parallel Mod - Both
-->Reject rows handling inside DataStage - Yes
-->Before and After SQL execution facility - Yes
-->Lock strategy (Row/Page/Table) - Row
-->Uses Temp/Work tables - No
-->Uses Error Tables - No
-->Ability to control job based on rows rejected to error table - No
-->Ability to write output to files - No
-->Uses named pipes - No
-->Check Point restart support - No
-->Direct loading support - No (No TD Utility used)
-->Can be used as look up stage? - Yes
-->Can be used as for sparse look up - No

Tips:
* For best performance, TD client should NOT be the same machine where TD server is installed.
* Take note of timestamp format used in TD - "%yyyy-%mm-%dd %hh%nn%ss" while doing date conversions.
* ODBC stage is the only stage which allows you to do sparse lookup on teradata tables.

TD EE as Source/Lookup/Target:
When used as Source, it calls FASTEXPORT (TD Utility) and when used as Target, it calls FASTLOAD (TD Utility). Number of sessions created is governed by the RequestedSessions and SessionsPerPlayer options in the stage.
TD EE (Source/Lookup) invokes FastExport which produces 1 answer set. All the parallel processing is performed on the TD Server, not in DataStage. FastExport cannot handle selects that use unique key constraint that would only return one row. You have to be careful when using unique key constraints. This is set from the TD side and not by Datastage. If you use this 2 or 3 times for source tables/lookups in a job and you run a couple of jobs at once, then you have invoked too many fastexports. For explicit exports or big data pulls, the TD EE can work fine, or you can use the MLOAD stage (which does Fastexport if it is used as source stage)

TD EE (Target) invokes FastLoad for bulk loading into TD Tables. Fastload does not support Secondary Indexes in Teradata. TD EE Stage will create a work table when an append to a Teradata table is selected for the job. This is because a FastLoad must load to an empty table which is impossible with an append operation. To get around this, DataStage FastLoads to a generated empty work table and then does insert into (select * from work table) on the database side. Append option will create an empty TD table with ALL fields but NO defaults. It generates this empty table from the Teradata metadata, NOT your DataStage job metadata. Also, unknown columns are replaced with null.

Ex:
Incoming columns are col1, col2, col3
Target table columns are col1, col2, col3, col4, col5 with col4 default value as 0 and col5 as current_timestamp.
Step 1:
Creation of orch_work table with:

CREATE TABLE ORCH_WORK_xxxxx AS ( SELECT * FROM TargetTable ) WITH NO DATA PRIMARY INDEX (col1)
Note: Orch_work table doesn't keep default values. Ie. Orch_work table has col4 and col5 also, but with no default values.

Step 2:
Incoming records are loaded in orch_work with values col1, col2, col3,null,null.
INSERT INTO ORCH_WORK_xxxxx (:col1, :col2, :col3, null, null)

Step 3:
Append using Insert command into Target table:
INSERT INTO TargetTable SELECT * FROM ORCH_WORK_xxxxx
Caution: Step 3 will fail if Col4 and Col5 (or any of them) are set as not null in the Target table. To avoid this you need to pass col4 and col5 with default values inside the job itself.

FastLoad will create 2 error tables by executing drop table statements for the same table (As a part of "BEGIN LOADING" ). FastLoad drops these 2 tables if they are empty as a part "END LOADING" transaction. In case of a problem (duplicate primary index key for example ), the error table number 2 should be removed manually. The names of the two tables are ERR_cookie_1 and ERR_cookie_2. Cookie is to be found on the terasync table. In that table, you may use start time and end time (integer) fields in order to help find the last ones inserted. You cannot modify these error table names.
Fastloading delivers very high performance with only two constraints: duplicate rows are silently dropped and error detection/correction is weak.
Findings:
-->Data loading modes supported - Insert/Delete/Upsert/Append
-->Uses a Teradata Utility - Yes
-->If Yes, Utility used - FastExport and FastLoad
-->Parallel Features of DataStage Supported - Yes
-->Runs in sequential or Parallel Mode - Parallel
-->Reject rows handling inside DataStage - No(7x) & Yes(8x)
-->Lock strategy (Row/Page/Table) - Table
-->Uses Temp/Work tables - Yes
-->Uses Error Tables - Yes
-->Ability to control job based on rows rejected to error table - Yes
-->Ability to write output to files - Yes
-->Uses named pipes - Yes
-->Any related APT or DS Parameters that can change/control the functionality - Yes
-->Direct loading support - Yes (TD Utility used)
-->Can be used as lookup stage - Yes
-->Before and after sql (open and close command) option available - Yes
-->Can be used as for sparse lookup - No

Tips:
1) TD EE stage creates a special terasync table to the source database and if you don't have create and write privileges in that db, you will encounter an error. The way to do it is to point TD EE stage to write the terasync in another place in the db where you have enough privileges.

The way to do it is to set couple of env parameters:
APT_TERA_SYNC_DAEMON_PROG = [server name for terasync table]
APT_TERA_SYNC_DATABASE = [db name for terasync table]
APT_TERA_SYNC_USER = [user name used for terasync table]
APT_TERA_SYNC_PASSWORD = [password for user used to access terasync table]

2) **There is a relation between the "SessionsPerPlayer", the number of nodes and the resulting TeraData Sessions generated. For example, for SessionsPerPlayer=2 you will have 16 Teradata sessions for a 32 Teradata AMP system. Requestedsessions is completely independent of the # of nodes in the DataStage configuration file. It is dependent on the number of vprocs or AMPS in the Teradata system. A 168 AMP system will create 84 sessions using the defaults (SessionsPerPlayer = 2) regardless of the EE Configuration file. In high volume environment try to balance performance and TeraData sessions by tuning the SessionsPerPlayer.

3) If TENACITY is not set with the timeout limit, an process that can not get a Teradata session will abort immediately.

4) Restart on TD EE works basically by starting over from the beginning. It does not cleanup the old Fastload tables also. So using insert strategy should be with strategies to take care of this scenario.

5) If you have large TD instance (A lot of VPROCs) set RequestedSessions to something more manageable to keep the repartitioning to a minimum. Repartitioning the data is expensive and can cause various resource issues.
Set the right RequestedSessions in the Advanced connection options in the stage to keep this minimum. TD operator uses two values to determine how many players to start - RequestedSessions and SessionsPerPlayer. By default, if no value is set for RequestedSessions, it looks at the number of VPROCs used for the table and SessionsPerPlayer defaults to 2 (one read session and one write session). Here's a rule of thumb:

RequestedSessions / SesssionsPerPlayer = Number of player processes (Which works best when it is the same value as the number of nodes in the config file)
Setting up these values will definitely improve startup time and runtime.

6) For some actions the Teradata Enterprise stage lets you put in open and close commands that run at the start or end of the DataStage job.

Teradata FastLoad features
--> Target table must be empty
--> Insert only (no updates)
--> Only one target table
--> Must drop secondary indexes before loading and recreate them afterward
**Also note: For FastLoad, IDENTITY columns are unsupported

TD Mload as Source/Lookup/Target

Multiload (as source) uses Fast export utility.

Multiload (as target) is very efficient when you are doing maintenance activities on multiple large tables. At a time, Multiload stage can perform Inserts/Updates on upto 5 different tables in one pass. Work tables and error tables are created each time you perform an operation using Multiload also. It is automatically dropped once the job has run successfully. However if the job aborts, the work tables have to be manually dropped before the job is run again.

Mload Options - Load Method:

Invoke Multiload - Mload is invoked automatically when the job runs. Stage creates named pipes to transmit data to Mload and then starts Mload process. Stage allows a 720 secs (12) minutes time to write to named pipe (ie. Acquisition phase) and then kicks off load process. Else fails the job. MultiLoad places locks on the entire table.
You can always change this phase time by changing:

DS_TDM_PIPE_OPEN_TIMEOUT

Manual - Data is stored in .dat file with name you specified in the stage. You can execute an Mload script independent of the job in this case which points to this .dat file.

Teradata MultiLoad features
--> Target table can be populated regardless of it/they is/are empty or not
--> Allows inserts, updates, deletes, and upserts
--> Loads into multiple target tables
--> Allows non-unique secondary indexes â€” automatically rebuilds them after loading

TD TPUMP as Source/Lookup/Target
TPump is a highly parallel utility designed to continuously move data from data sources into Teradata. TPump is typically used for loading a small quantity of records in relation to the size of your target table. TPump works at the row level, whereas MultiLoad and FastLoad update whole blocks of data. TPump allowing us to load data into tables with referential integrity which MultiLoad doesn't. TPump only needs to take row-level locks; in other words, TPump only places a lock upon the row it is modifying. In contrast, MultiLoad places locks on the entire table. If you need multiple processes updating a table simultaneously, TPump may be the better solution. TPump also uses fewer system resources, so you can run it concurrently with user queries without impacting system performance. TPump lets you control the rate at which updates are applied. You can dynamically "throttle down" when the system is busy and throttle up when activity is lighter.
TPump does not take a utility slot. TPump is designed for "trickle feed" taking individual row-level locks. If you use TPUMP, you need to make sure that you follow normal TPUMP standards (Set the KEY statements equal to the PI and turn SERIALIZE on). Tpump is typically for processes that are constantly retrieving and processing data, like from a message queue. Using TPump offers a controllable transition to updating that is closer to real time. Best fit in "Active", "real-time" or "closed-loop" data warehousing.

Findings:
-->Data loading modes supported - Insert/Delete/Upsert/Append
-->Uses a Teradata Utility - TPump does not take a utility slot.
-->If Yes, Utility used - NA
-->Parallel Features of DataStage Supported - Yes
-->Direct loading support - No
-->Can be used as for sparse lookup - No

Tips:
* Feeding clean data from your transformation processes into TPump is important for overall performance, data with errors makes TPump slow (Longer runtime (+40%)). Also Longer runtime (+45%) with fallback, SERIALIZE adds 30% and ROBUST adds 15%.
* Ask your DBA to assign the TPump user to a higher priority performance group when the TPump job runs at the same time as decision support queries, if the TPump completion time is more critical than the other work active in the system.
* It uses time based checkpoints not count based.
* Does not support MULTI-SET tables.
* Can have Secondary Indexes and RI on tables.

TD Connector for TPT / Teradata PT(Teradata Parallel Transporter)
TPT is an object-oriented client suite that executes multiple instances of data extraction, transformations, and loading functions in a scalable, high-speed, parallel-processing environment.

Main features:
--> All-in-one! Single infrastructure for all loading/unloading needs using single scripting language.
--> Greatly reduces the amount of file I/O and significantly improve performance.
--> Push up and Push down features.
--> Provides unlimited symbolic substitution for the script language and application programming interface (API).
--> Combines functionality of TD FastLoad, MultiLoad, FastExport, TPump and API.
--> Respective modules or operators for the protocols of fastload, multiload, tpump and fastexport are named as Load, Update, Stream and Export.
-->Apart from the four operators, on API & ODBC front, there are operators like Selector, Inserter and more.

Think of TD, think of one stage - TD Connector for TPT stage.
Two access methods/modes of execution are available in IBM Information Server (DataStage 8) for this stage:

1) Immediate - For Selector, Inserter... operators. Same as CLI interface/API. (No Load/Unload Utilities).
2) Bulk - For Load, Update, Stream and Export operators (i.e. fastload, multiload, tpump and fastexport respectively).

Remember to install this utility (TPT) on the DataStage server for using the range of operators it supports [Same case as most of the other TD utilities].

DataStage to DataBase connector Stages

As a general guideline, new projects should give preference to Connector stages, and take advantage of existing Enterprise equivalents in specific cases where these have an edge in terms of performance.

a. ODBC Connector:

The ODBC Connector has the following features:

1. Source, target and lookup context:

The ODBC Connector can be used similar to typical ODBC stage for extracting (in the context of source), for loading (in the context of target) and for lookup (in the context of lookup). The lookup context is similar to the source context, with the difference that the SELECT SQL statements are parameterized. The parameter values are provided dynamically on the input link by the other stages in the job.

2. Reject links:

The ODBC Connector supports a special type of link called reject link. The Connector can be configured to direct the data that it cannot process to the reject link, from which it can be sent to any stage in the job.

3. Passing LOBs by reference:

The Connector allows the option of passing LOBs by reference, rather than by extracting the data and passing it inline into the job flow. When configured to pass LOB values by reference, the ODBC Connector assembles a special block of data, called a locator or reference, that it passes into the job dataflow. Other Connectors are placed at the end of the job dataflow. When the LOB locator arrives at the other Connector stage, the Connector framework initiates the retrieval of the actual ODBC data represented by the reference and provides the data to the target Connector so that it can be loaded into the represented resource. This way it is possible to move LOB data from one data resource to another where the size of the data is measured in megabytes and gigabytes, without having to move the actual data through the job. The drawback is that the data passed this way cannot be altered by the intermediate stages in the job.

4. Arrays:

The Connector supports array insert operations in target context. The Connector buffers the specified number of input rows before inserting them to the database in a single operation. This provides for better performance when inserting large numbers of rows.

5. SQL Builder

The Connector user uses the SQL Builder tool to design the SQL statements.

6. Pre/post run statements:

The Connector provides a mechanism for executing SQL statements that typically are used to initialize the database objects (for example to create a new table or truncate the existing table before inserting data into the table).

7. Metadata import:

The Connector supports metadata operations, such as the discovery of database objects and describing these objects.

b. WebSphere MQ Connector:

WebSphere MQ Connector provides access to message queues in the WebSphere MQ enterprise messaging system. It provides the following types of support:
WebSphere MQ versions 5.3 and 6.0, and WebSphere Message Broker 6.0 for the publish/subscribe mode of work.

1. MQ client and MQ server connection modes. The choice can be made dynamically through the special connection property in the Connector’s stage editor.

2. Filtering of messages based on various combinations of message header fields. Complex filtering conditions might be specified.

3. Synchronous (request/reply) messaging. This scenario is configured by defining both input link (for request messages) and output link (for reply messages) for the Connector’s stage.

4. Publish/Subscribe mode of work. Both the WebSphere MQ broker (with MQRFH command messages) and WebSphere Message Broker (with MQRFH2 command messages) are supported. The Connector stage can be configured to run as a publisher or as a subscriber. Dynamic registration and deregistration of publisher/subscriber is supported.

5. MQ dynamic queues, name lists, transmission queues and shared cluster queues for remote queue messaging.

6. Designated error queues and standard reject links.

c. Teradata Connector:

The Teradata Connector includes the following features similar to the ODBC Connector:

Source, target and lookup context

Reject links

Passing LOBs by reference

Arrays

SQL Builder

Pre/post run statements

Metadata import

The Teradata Connector uses CLIv2 API for immediate operations (SELECT, INSERT, UPDATE, DELETE) and Parallel Transporter Direct API (formerly TEL-API) for bulk load and bulk extract operations.Parallel bulk load is supported through LOAD, UPDATE, and STREAM operators in Parallel Transporter. This corresponds to the functionality provided by the FastLoad, MultiLoad, and TPump Teradata utilities, respectively. When the UPDATE operator is used it supports the option for deleting rows of data (MultiLoad delete task). Parallel bulk export is supported through the EXPORT operator in Parallel Transporter. This corresponds to the functionality provided by the FastExport Teradata utility. The Connector persists the bulk-load progress state and provides sophisticated support for restarting the failed bulk-load operations. The Connector uses a designated database table for synchronization of distributed Connector instances in the parallel bulk-load. A limited support for stored procedures and macros is also available.

a. Teradata Connector advantages:
The following list details the Teradata Connector advantages:
Parallel MultiLoad capability including MPP configurations
Parallel immediate lookups and writes
Array support for better performance of immediate writes
Reject link support for lookups and writes on DS Enterprise Edition
Reject link support for bulk loads
Cursor lookups (lookups that return more than one row)
Restart capability for parallel bulk loads
MultiLoad delete task support
Support for BLOB and CLOB data types
Reject link support for missing UPDATE or DELETE rows
Error message and row count feedback for immediate lookups/writes.

b. Parallel synchronization table:
The following list details the Parallel Synchronization Table properties:
Used for coordination of player processes in parallel mode
Now optional, connector runs sequentially if not specified
Can be used for logging of execution statistics
Connector stage can use its own sync table or share it
Primary key is SyncID, PartitionNo, StartTim
Each player updates its own row, no lock contention
Management properties for dropping, deleting rows.

c. Comparison with existing Teradata stages:
The following list details comparisons with Teradata stages:
Limited support for stored procedures and macros, but the Stored Procedure plug-in is still better suited for it.
No more utilities, named pipes, control scripts, or report files.
Interface with the Parallel Transporter is through a direct call-level API.
Error messages are reported in the DataStage Director log.
MultiLoad plug-in jobs that use advanced custom script features cannot be migrated to use the Teradata Connector.
Number of players is determined by the PX Engine config file.

Table shows a list of advantages and disadvantages of the Parallel Transporter operators.

Operator	Equivalent Utility	Advantages	Disadvantages
Export	FastExport	Fastest export method.	Uses utility slot, No single-AMP SELECTs.
Load	FastLoad	Fastest load method.	Uses utility slot, INSERT only, Locks table, No views, No secondary indexes.
Update	MultiLoad	INSERT, UPDATE, DELETE, Views, Non-unique secondary indexes.	Uses utility slot, Locks table, No unique secondary indexes, Table inaccessible on abort.
Stream	Tpump	INSERT, UPDATE, DELETE, Views, Secondary indexes, No utility slot, No table lock.	Slower than UPDATE operator.

d. DB2 Connector
The DB2 Connector includes the following features similar to the ODBC Connector:
Source
Target and lookup context
Reject links
Passing LOBs by reference
Arrays
SQL Builder
Pre/post run statements
Metadata import
Supports DB2 version V9.1

The Connector is based on the CLI client interface. It can connect to any database cataloged on the DB2 client. The DB2 client must be collocated with the Connector, but the actual database might be local or remote to the Connector.
Separate the sets of connection properties for the job setup phase (conductor) and execution phase (player nodes), so the same database might be cataloged differently on conductor and player nodes.

The Connector provides support for the following tasks:

1. Specifying DB2 instance dynamically (through connection properties), which overrides the default environment settings (DB2INSTANCE environment variable).

2. XML data type in DB2 V9.1.

3. DB2 DPF. A job with a DB2 Connector target stage might be configured to assign on execution player node with each DB2 partition, and to write data to the partitioned database in parallel, providing dramatic performance improvement over sending the data to the same partition node and forcing DB2 to redirect data to corresponding partitions.

4. DB2 bulk load functionality. The invocation of bulk load is done through the CLI interface. Parallel bulk load is also supported for DB2 with DPF.

New features
In terms of functionality, the DB2 Connector offers more capabilities than all of
the existing stages. Specifically it provides support for the following elements:

1. XML data type

2. LOBs, with data passed either inline, or by a reference mechanism. The latter allows for LOB data of any size to be moved from a source to a target.

3. Client-server access to DB2 server. This overcomes a limitation with the EE stage that can only be used in homogenous environments where DataStage and DB2 are on identical (or the same) servers.

4. Options to control bulk load than EE Operator.

5. Design time capabilities, such as metadata import to the common model, and enumeration of server-side information.

Using rejects with user-defined SQL
Using rejects with user-defined SQL (UserSQL) is described in this section.

The following list describes the use of UserSQL without the reject link:
All statements in the UserSQL property are either passed for all of the input records in the current batch (as specified by the Array size property) or none. In other words events in previous statements do not control the number of records passed to the statements that follow.
If FailOnError=Yes, the first statement that fails causes the job to fail and the current transaction, as depicted by the Record Count property, is rolled back. No more statements from the UserSQL property are executed after that. For example, if there are three statements in the property, and the second one fails, it means that the first one has already executed, and the third one is not executed. None of the statements have its work committed because of the error.
If FailOnError=No, all statements still get all the records but any statement errors are ignored and statements continue to be executed. For example, if there are three statements in the UserSQL property and the second one fails, all three are executed and any successful rows are committed. The failed rows are ignored.

The following list describes the use of UserSQL with the reject link:
All statements in the UserSQL property are either passed all of the input records in the current batch (as specified by the Array size property) or none. In other words, events in previous statements do not control the number of records passed to the statements that follow.
All the rows in each batch (as specified by the Array size property) are either successfully consumed by all statements in the UserSQL property, or are rejected as a whole. This is important to preserve the integrity of the records processed by multiple statements that are expected to be atomic and committed in the same transaction. In other words, the connector tries to eliminate the possibility of having each statement successfully consume a set of rows.
If any of the rows in any statement in the UserSQL property are not consumed successfully, the processing of the current batch is aborted and the whole batch of records is sent to the reject link. The statements that follow the failed statement are not executed with the current batch. The processing resumes with the next batch.
To preserve the consistency of records in each batch, the connector forces a commit after every successful batch and forces a rollback after every failed batch. This means the connector overrides the transactional behavior specified by the Record Count property.

e. Oracle Connector

The Oracle Connector includes these features similar to the ODBC Connector:

Source

Target and lookup context

Reject links

Passing LOBs by reference

Arrays

SQL Builder

Pre/post run statements

Metadata import

In addition, it supports bulk loads and Oracle partitioning.

The Connector works with Oracle versions 10g and 11g. It supports connecting to an Oracle database through Oracle Full Client or Oracle Instant Client.

The following list details improvements made to the Oracle Connector:

· Distributed transactions:

Support for guaranteed delivery of transactions arriving in form of MQ messages. In case of success, the messages are processed by the job and the data is written to the target Oracle database. In case of a failure the messages are rolled back to the queue.

To use the Distributed Transaction stage, you need MQ 6.0 and Oracle 10g R2.

· Built-in Oracle scalar data types are supported, including BLOB, CLOB,NCLOB, BFILE, LONG, and LONG RAW data types.

· XMLType columns and object tables are supported.

· PL/SQL anonymous blocks with bind parameters are supported.

Support for a rich set of options for configuring reject links. Reject links are also supported in bulk load mode

· Pre- and Post- SQL operations are supported:

SQL statement, multiple SQL statements or PL/SQL anonymous block might be specified.

The statements might be configured to run at the job or at the node level.

The statements might be specified in the stage UI or in external files.

· Rich metadata import functionality:

Table selection based on table type (table, view, IOT, materialized view, external table, synonym), the table owner or the table name pattern.

Supports importing PK, FK, and index information.

· Table action:

Performing Create, Replace, or Truncate table actions in the job is supported before writing data to the table.

Input link column definitions automatically used to define target table columns.

· Fast read, write, and bulk load operations:

Performance improvement over Oracle Enterprise stage.

Rich set of options for configuring parallel read and write operations.

Provides control over the Transparent Application Failover (TAF) mechanism in environments such as Oracle RAC.

Includes storing TAF notifications in the job log to inform the user about the failover progress.

Oracle or OS authentication is supported for the connector at job runtime.

f. Essbase connector

The Essbase connector supports the extraction, delta extraction, and load of data to and from Essbase databases. It performs hierarchical to relational mapping of cube data. The connector supports parallel read and writes.

It is implemented in C++ using a 3rd party interface library from a partner, and uses DMDI to allow selection from cube data.

DataWareHousing Technology

About Me