1. How to parameterize if and else conditions in Datastage ?
Scenario:
User has a set of condition statements which needs to picked by Job at run time and produce output. The conditions can be varied for each run.
Answers : Load the conditions into a seqential file. Lookup the seqential file for the conditions and use the same in the Modfiy stage to achieve the run time condition request requirement.
2. How to improve reading of Sequential files ( Fixed width format only applicable ) ?
Scenario:
Today most of the datastage implementation runs in configuration file setting. But running the job on more nodes does not help improving reading of the sequential file at a faster rate.
Answer:
While handling huge volumes of data, the Sequential File stage can itself become one of the major bottlenecks as reading and writing from this stage is slow. Certainly do not use sequential files for intermediate storage between jobs. It causes performance overhead, as it needs to do data conversion before writing and reading from a file. Rather Dataset stages should be used for intermediate storage between different jobs.
Datasets are key to good performance in a set of linked jobs. They help in achieving end-to-end parallelism by writing data in partitioned form and maintaining the sort order. No repartitioning or import/export conversions are needed.
In order to have faster reading from the Sequential File stage the number of readers per node can be increased (default value is one). This means, for example, that a single file can be partitioned as it is read (even though the stage is constrained to running sequentially on the conductor mode).
The options "Read From Multiple Nodes” and "Number of Readers Per Node” are mutually exclusive.
3 . Considerations for improving Sort Stage peformance ?
Scenario: I have a seen many coding patterns where developers does not understand the basics of sorting applied by dataStage and would hinder performance instread of improving it. Below provided are some tips that would help developers understand in the scenarios.
Answer :
A sort done on a database is usually a lot faster than a sort done in DataStage. So – if possible – try to already do the sorting when reading data from the database instead of using a Sort stage or sorting on the input link. This could also mean a big performance gain in the job, although it is not always possible to avoid needing a Sort stage in jobs.
Careful job design can improve the performance of sort operations, both in standalone Sort stages and in on-link sorts specified in other stage types, when not being able to make use of the database sorting power.
If data has already been partitioned and sorted on a set of key columns, specify the ″don't sort, previously sorted″ option for the key columns in the Sort stage. This reduces the cost of sorting and takes more advantage of pipeline parallelism. When writing to parallel data sets, sort order and partitioning are preserved. When reading from these data sets, try to maintain this sorting if possible by using the Same partitioning method.
The stable sort option is much more expensive than non-stable sorts, and should only be used if there is a need to maintain row order other than as needed to perform the sort.
The performance of individual sorts can be improved by increasing the memory usage per partition using the Restrict Memory Usage (MB) option of the Sort stage. The default setting is 20 MB per partition. Note that sort memory usage can only be specified for standalone Sort stages, it cannot be changed for inline (on a link) sorts.
Scenario:
User has a set of condition statements which needs to picked by Job at run time and produce output. The conditions can be varied for each run.
Answers : Load the conditions into a seqential file. Lookup the seqential file for the conditions and use the same in the Modfiy stage to achieve the run time condition request requirement.
2. How to improve reading of Sequential files ( Fixed width format only applicable ) ?
Scenario:
Today most of the datastage implementation runs in configuration file setting. But running the job on more nodes does not help improving reading of the sequential file at a faster rate.
Answer:
While handling huge volumes of data, the Sequential File stage can itself become one of the major bottlenecks as reading and writing from this stage is slow. Certainly do not use sequential files for intermediate storage between jobs. It causes performance overhead, as it needs to do data conversion before writing and reading from a file. Rather Dataset stages should be used for intermediate storage between different jobs.
Datasets are key to good performance in a set of linked jobs. They help in achieving end-to-end parallelism by writing data in partitioned form and maintaining the sort order. No repartitioning or import/export conversions are needed.
In order to have faster reading from the Sequential File stage the number of readers per node can be increased (default value is one). This means, for example, that a single file can be partitioned as it is read (even though the stage is constrained to running sequentially on the conductor mode).
This is an optional property and only applies to files containing
fixed-length records. But this provides a way of partitioning data contained in
a single file. Each node reads a single file, but the file can be divided
according to the number of readers per node, and written to separate partitions.
This method can result in better I/O performance on an SMP (Symmetric Multi
Processing) system.
It can also be specified that single files can be read by multiple nodes.
This is also an optional property and only applies to files containing
fixed-length records. Set this option to "Yes” to allow individual files to be
read by several nodes. This can improve performance on cluster systems.
IBM DataStage knows the number of nodes available, and using the fixed
length record size, and the actual size of the file to be read, allocates to the
reader on each node a separate region within the file to process. The regions
will be of roughly equal size.
The options "Read From Multiple Nodes” and "Number of Readers Per Node” are mutually exclusive.
3 . Considerations for improving Sort Stage peformance ?
Scenario: I have a seen many coding patterns where developers does not understand the basics of sorting applied by dataStage and would hinder performance instread of improving it. Below provided are some tips that would help developers understand in the scenarios.
Answer :
A sort done on a database is usually a lot faster than a sort done in DataStage. So – if possible – try to already do the sorting when reading data from the database instead of using a Sort stage or sorting on the input link. This could also mean a big performance gain in the job, although it is not always possible to avoid needing a Sort stage in jobs.
Careful job design can improve the performance of sort operations, both in standalone Sort stages and in on-link sorts specified in other stage types, when not being able to make use of the database sorting power.
If data has already been partitioned and sorted on a set of key columns, specify the ″don't sort, previously sorted″ option for the key columns in the Sort stage. This reduces the cost of sorting and takes more advantage of pipeline parallelism. When writing to parallel data sets, sort order and partitioning are preserved. When reading from these data sets, try to maintain this sorting if possible by using the Same partitioning method.
The stable sort option is much more expensive than non-stable sorts, and should only be used if there is a need to maintain row order other than as needed to perform the sort.
The performance of individual sorts can be improved by increasing the memory usage per partition using the Restrict Memory Usage (MB) option of the Sort stage. The default setting is 20 MB per partition. Note that sort memory usage can only be specified for standalone Sort stages, it cannot be changed for inline (on a link) sorts.
No comments:
Post a Comment