Monday, September 1, 2014

DataStage Job Execution Flow

When you execute a job, the generated OSH and contents of the configuration
file ($APT_CONFIG_FILE) is used to compose a "score". This is similar to a SQL
query optimization plan.
At runtime, IBM InfoSphere DataStage identifies the degree of parallelism and
node assignments for each operator, and inserts sorts and partitioners as
needed to ensure correct results. It also defines the connection topology (virtual
data sets/links) between adjacent operators/stages, and inserts buffer operators
to prevent deadlocks (for example, in fork-joins). It also defines the number of
actual OS processes. Multiple operators/stages are combined within a single OS
process as appropriate, to improve performance and optimize resource
requirements.
The job score is used to fork processes with communication interconnects for
data, message and control3. Processing begins after the job score and
processes are created. Job processing ends when either the last row of data is
processed by the final operator, a fatal error is encountered by any operator, or
the job is halted by DataStage Job Control or human intervention such as
DataStage Director STOP.
Job scores are divided into two sections — data sets (partitioning and collecting)
and operators (node/operator mapping). Both sections identify sequential or
parallel processing.


The execution (orchestra) manages control and message flow across processes
and consists of the conductor node and one or more processing nodes as shown
in Figure 1-6. Actual data flows from player to player — the conductor and
section leader are only used to control process execution through control and
message channels.
_ Conductor is the initial framework process. It creates the Section Leader (SL)
processes (one per node), consolidates messages to the DataStage log, and
manages orderly shutdown. The Conductor node has the start-up process.
The Conductor also communicates with the players.
Note: You can direct the score to a job log by setting $APT_DUMP_SCORE.
To identify the Score dump, look for "main program: This step....".
_ Section Leader is a process that forks player processes (one per stage) and
manages up/down communications. SLs communicate between the
conductor and player processes only. For a given parallel configuration file,
one section leader will be started for each logical node.
_ Players are the actual processes associated with the stages. It sends stderr
and stdout to the SL, establishes connections to other players for data flow,
and cleans up on completion. Each player has to be able to communicate
with every other player. There are separate communication channels
(pathways) for control, errors, messages and data. The data channel does
not go through the section leader/conductor as this would limit scalability.
Data flows directly from upstream operator to downstream operator.

No comments:

Post a Comment