DataWareHousing Technology: How to read Score Dump format in DataStage

Reading Score DUMP ..

The dump score contains two sections -- the data sets (DS) and the operators (OP).
Data sets - The data sets that are listed in the score are the same type of data sets that you create with the Data Set stage -- in this context, they are temporary memory and/or disk storage during the job's run.
Operators - Operators are individual parallel engine stages that you might see on the user interface.

In a typical job flow, operators are end-points, and data sets are the links between the operators. (An exception is when data sets are used to actually output to a file.)

Each and every link on the job design is potentially a data set. However, unlike the data set stage which you put in your resource disk group by using the specified node pool within the job's configuration file (APT_CONFIG_FILE), these data sets are in memory. These temporary data sets are only placed in the scratch disk space when an imposed limit is reached. A limit can be imposed due to environment settings, or due to physical memory limitations.

Each operator listed in the score spawns a number of processes that are dependent on:

the job's established configuration file (APT_CONFIG_FILE) constrained by the node pool settings the operator configuration in the parallel engine code Several environment variables, such as APT_DISABLE_COMBINATION, being set/unset.
First, let us focus on the operators, which are listed after the data sets in the score:

op0[1p] {(sequential PacifBaseMCES)
on nodes (
node1[op0,p0]
)}
op1[4p] {(parallel RemDups.IndvIDs_in_Sort)
on nodes (
node1[op1,p0]
node2[op1,p1]
node3[op1,p2]
node4[op1,p3]
)}

In the proceeding example, the two operators are: op0 and op1. The operator name is prefixed with the code name "op" and appended with an incremental numeric value starting with zero (0). Next to the operator name, there is an enclosed bracket with a value that is followed by the letter "p", for example, "[1p]". The value indicates the number of partitions given to that operator by the engine. For the first operator, only one (1) partition is provided, and the second operator is given four (4) partitions.

Within the curly brackets, the execution mode ("parallel" or "sequential") and the name of that operator is provided. The operator name is based on the name shown on the parallel canvas in the Designer client. The operator name is not the same as the operator type.

In the proceeding example, the first operator is listed as "PacifBaseMCES" and is the stage name in its entirety. However, the second operator, is listed as "remDups.IndvIDs_in_Sort". The stage name"IndvIDs" is renamed to indicate that the sort process triggered by the remove duplicate stage occured.

Following each operator name are the specific nodes that the operators are tagged to run on. In the proceeding example, node1 is for the first operator, and node1, node2, node3, and node4 are for the second operator. The name of nodes are defined in your configuration file (APT_CONFIG_FILE).

Now let us focus on the data sets:

ds0: {op0[1p] (sequential PacifBaseMCES)
eOther(APT_ModulusPartitioner { key={ value=MBR_SYS_ID }
})<>eCollectAny
op1[4p] (parallel RemDups.IndvIDs_in_Sort)}
ds1: {op1[4p] (parallel RemDups.IndvIDs_in_Sort)
[pp] eSame=>eCollectAny
op2[4p] (parallel RemDups)}

The name of the data set is provided first. Within the curly brackets, there are three stages:
the source of the data set - operator 0, sequential PacifBaseMCES the activity of the data set - operator 1, parallel RemDups.IndvIDs_in_Sort the target of the data set - operator 2, parallel RemDups.

In the example for the first dataset, you see "eOther" and "eCollectAny".These are input and target read methods. The second method indicates the method that the receiving operator uses to collect the data.

In this example, "eOther" is the originating or input method for op0. It is an indication that something else is being imposed outside the expected partitioning option (and that you need to observe the string within the parenthesis -- APT_ModulusPartitioner in this example -- Modulus partitioning is imposed.)

"eCollectAny" is the target read method. Any records that are fed to this data set are collected in a round robin manner. The round robin behavior is less significant than the behavior that occurs for input partitioning method, which is eOther(APT_ModulusPartitioner) for ds0.

In the first example in this document, where the operator and stage uses the APT_SortedMergeCollector for ds9, the "eCollectOther"method indicates where actual partitioning occurs and is usually specified when you are referencing a sequential flat file. Shown again, in part, is the example:

ds8: {op8[4p] (parallel APT_TransformOperatorImplV22S14_ETLTek_HP37FMember_PMR64262_Test1_SplitTran2 in SplitTran2)
eSame=>eCollectAny
op9[4p] (parallel buffer(1))}
ds9: {op9[4p] (parallel buffer(1))
>>eCollectOther(APT_SortedMergeCollector { key={ value=MBR_SYS_ID,
subArgs={ asc }

The symbols between the originating partitioning method and the target read method translates to the parallelism of the partitioning. The following is the list of the symbols, and their definition:

-> Sequential to Sequential
<> Sequential to Parallel
=> Parallel to Parallel ( SAME)
#> Parallel to Parallel ( NOT SAME)
>> Parallel to Sequential
> No Source or Target

DataWareHousing Technology

About Me

Monday, September 1, 2014

How to read Score Dump format in DataStage

No comments:

Post a Comment

Adsense bar