Thursday, January 29, 2015

Converting DataStage jobs as services via remote SSH invocation

Here in this post  i will explain about on how to use a datastage job as service via remote SSH invocation


Task :

Decompress the files , read the data , transform the data and write into output file.

The file should not be transmitted over network, but data can be streamed.

Here is the simple implementation :

On Remote DataStage Server:
1) Write a job that take a input file name as parameter  and output file name as parameter.
2) One record in input file is a just a single string.
3) Write transformation rule and output to the output file.

On Local Unix Server:

Write the below script:

piped_ds_job.sh
============

#!/bin/bash
dshome=`cat /.dshome`
. $dshome/dsenv
export PATH=$PATH:$DSHOME/bin

pid=$$
fifodir=/data/datastage/tmp
infname=$fifodir/infname.$pid
outfname=$fifodir/outfname.$pid
mkfifo $infname
mkfifo $outfname
dsjob  -run -param inputFile=$infname \
    -param outputFile=$outfname dstage1 ds_sort.$pid 2> /dev/null  &
                
if [ $? -ne 0 ]; then
    echo "error calling DataStage job."
    rm $infname
    rm $outfname
    exit 1
fi
 
(cat $outfname;rm $outfname)&
                
if [ -z $1 ]; then
    cat > $infname
else
    cat $1 > $infname
fi
                
rm $infname
 
 
======================
 
Decompressing the file and running the above script on dataStage server

zcat compressedfile.gz |ssh -l dsadm@victory.ibm.com piped_ds_job.sh >> outputfilename
 
 
 

Difference of using DataSet and Lookup File Set during Lookup Operations

Dataset stores all the information in internal format of the Dataset.

Alternatively Lookup fileset also holds the same kind of information in Files.

Then the question is why is Lookup FileSet avaliable in the datastage.

There is a minor difference between DataSet and Lookup FileSet when used of Lookup.

Here goes the explanation:

Characteristic of Lookup File Stage.
When you use a Lookup File Set stage as a source for lookup data, there are special considerations about
column naming. If you have columns of the same name in both the source and lookup data sets, the
source data set column will go to the output data. If you want this column to be replaced by the column
from the lookup data source, you need to drop the source data column before you perform the lookup.

This character help in faster processing of Lookup data when the data is Huge. When a Dataset is used the Lookup data is loaded into memory and lookup is performed , in contrast when used Lookup FileSet only the Key columns are loaded into Memory for lookup and remaining columns are automatically taken from File using Inbuilt Copy Operator. This helps in processing Lookup faster when using a LookfileSet than Dataset.

However with the improvements in the Join and Sort stages , the usage of Lookup File Set is almost depleted in recent years.

DataStage Oracle Connector Bulk Load performance improvements

Improve oracle loading performance using bulk load options.

--> Set APT_PM_PLAYER_TIMING to true find our if the oracle conncetor is taking More CPU cycles.

--> Disable the oracle triggers / indexes & Constraints on this table. ( can use before and after SQL statement to disable load and enable).

--> Use oracle data cache for faster handling of Date and Timestamp columns.

--> Disabe redo logging while Bulk upload

--> Ensure the array size and buffer size are set to optimal.

--> Load data if possible using Partitioned tables.

--> Proper Handing of datatype, as datatype conversions can be costly.

--> Proper setting of NLS_LANG setting from oracle database and Environment variables.

--> Set CC_MSG_LEVEL to lowest possible value.

Further readings

http://www.ibm.com/developerworks/data/library/techarticle/dm-1304datastageoracleconnector2/

Tuesday, January 27, 2015

Distributed Transaction stage -- A handy tool for Real Time data processing

This post is basicaly for people who are interested in Realtime processing messages into data processing systems using IBM DataStage capabilities.

DataStage version earlier than 8.5 does not have the guarantee delivery of data from a source to a target system. Each instance of a stage manages its own transaction, which is independent of other stages. There are no control signals sent between stages to co-ordinate how transactions are started and committed. However, some user scenarios require the data to remain in the source system until it is written to
a target system. ( Ex: Messages from MQ or JMS ).

From version 8.5 and later distributed transaction stage has provided these data delivery gurantee between Source to Target.

Here is the transaction flow described for your reference:

When the job is run, the following actions take place:
1. A local transaction is started with the  connector.
2. The target is updated.
3. If the updates are successful, the local transaction is committed on reaching
end-of-wave.
4. If the local transaction is successfully committed:
a. A WebSphere MQ transaction is started.
b. The messages about the current transaction are deleted from the source or
work queue.
c. The WebSphere MQ transaction is committed.
5. If actions in Step 3 or Step 4 are not successful, and a reject queue is defined:
a. A WebSphere MQ transaction is started.
b. Messages about the current transaction are moved from the source or work
queue to the reject queue.
c. The WebSphere MQ transaction is committed.


Conns:

1) Required write access to MQ's (either source or work queues ).
2) Processing of messages is slowed based on response from MQ server. ( especially network related ).

Environment Variables for consideration

CC_DTS_COMMIT_ON_EOF  : specify whether Distributed Transaction stage
commits on the end of data.

CC_TRUNCATE_NSTRING_WITH_NULL  : to truncate string data that includes the string 0x00.

CC_USE_EXTERNAL_SCHEMA_ON_MISMATCH : to use an external schema rather than a design
schema when the schemas do not match
 

Friday, January 9, 2015

DataVault - Next step in data Modelling

Today i was going thourgh the learndatavault.com website to understand more about datavault modelling. It was very interesting and exiciting facts which i have pratically come accross during project implementations.

The major problems involved in maintaining a data warehouse clearly address and give flexibility to organization to maintain the datasets for a longer period of time.

Today Major problems with Datawarehouse's Designs:

1) Flexibility to modify
2) Fast Query processing
3) Maintaining the sanity of the datasets when business rules changes.
4) Processing huge volumes.( Star schemas has problem of summing up & 3 NF has problem of storage due to huge number of tables. Ex: Teradata FSLDM has hundred's of tables .. )

DataVault a simple Hub-Link-Satellite model provides a Genetic helix like structure which can adapt and modify it self for the changing business rules. This is an interesting design paradiagm which i would learn and further write on my blog with my understanding. Keep tuned.

Some reference for your readings: ( Also avaliable on the wiki page ).

http://prudenza.typepad.com/files/damhof_dbm0508_eng-1.pdf

Tuesday, January 6, 2015

ETL Architecture and Design - Rounding up requirements

 From an ETL architecture design point of view, gathering requirements is the key aspect of a successful and sustainable design for a longer period of time.

Apart from desiging the Dataware House database model , it is quite very important to focus on the below set of requirements up front in-order to overcome the re-work during testing or after implementation phases.

Below are the categories which are descirbed as the key decision makers for any ETL design.

-> Business Needs.
     1) what is the information content that business user needs ?
     2) How the buiness user would like to drill down and drill across the data.
     3) List of KPI's ( Facts across dimensions ) that business would like to report.

-> Compliance
This form an important aspect when design ETL for Financial or Telecommunication sector doe to external regulatory requirements.

1) List of data points subjected to Legal and Compliance. ( ex: Credit Card number , Client information etc) 2) Data transformation steps for which you must maintain the “chain of custody” showing and proving
that final reports were derived from the original data delivered from your data sources.
3) Security related aspects for data.
4) Archival mechanism and expected usable life time of those archives.

--> Data Quality

1) List all data Sources and identify those data elements whose quality is known to be unacceptable.
2) list whether an agreement has been reached with the source systems to correct the data before extraction.
3) Consolidate all the data elements which are discovered during data profiling as they should be continously monitored during ETL process.

--> Security

1) List down all the security aspects from Security Manager.
2) List down all the complaince aspects from compalinance team, as most of them will overlap with Security aspects.
3) Brainstrom with senior management to idetify the list of security aspects the ETL design should cary with it.
4) Most important accessiablity of the data avaliable in the warehouse has to taken care. ( For ex: BI tools accessibility to data , Backup accessibility to user's etc)

--> Data Integration

1) List down all the data source as per business needs querstionaire.
2) Understand the meaning of each data elements from the transcation system owners.
3) Understand the business process to inter-relate common dimensions across various sources, to help business user's to drill-across. ( Conformed Dimensions ).
4) understand the meaning of KPI's and derivation mechanism accors various sources.
5) List down methods to integrate data from each of the sources using Bus matrix representation.
6) Prepate a meta data mapping using Bux Matrix for each of the source to its conformed dimenstions / facts.

--> Data Latency

This is most curcial aspect of designing a ETL design as business needs data on time to make necessary descions.

1) Understand the SLA of data delivery from Datawareshouse to business user's.
2) Understand the avaliability of data from the sources.Decision of batch / Micro-batch or Stream oriented to be finalized.
3) Estimate the data processing time based on the avaliable infrastucture / Complexity of transformation / Network capacity & Avalibility etc ..
4) Recommend the optimum infrastucture capability to achieve SLA.
5) Get an agreement with source system on the SLA of data avaliability for processing. ( Inculding transfer over network ).

--> BI Delivery Interfaces

I would feel that ETL team, working closely with the modeling team, must take responsibility for the content and structure of the data that makes the BI applications simple and fast

1) List all fact and dimension tables that will be directly exposed to your BI tools. This should come directly from the dimensional model specification.
2) List all OLAP cubes and special database structures required by BI tools.
3) List all known indexes and aggregations you have agreed to build to support BI performance.

Reference : The datawarehouse ToolKit , 3rd edition ( Kimball )

Datastage Job Backup via Client and Server

I would like provide the information on taking Datastage backup using Unix & Windows,

Server : ( UNIX )

<IBM Server installation Path>/InformationSever/server/DSengine/istool export –domain <Domain:port> -username <Login User name> -password <Login Password> -archive "<Path to store the exported file>"  -datastage  ' "Domain:port/$PROJECTNAME/Jobs/*/*.*"  '

Client : ( Windows )

@Echo off

SET DSExportCMD="C:\Program Files\Ascential\DataStage7.5.1\dscmdexport.exe"
echo a=%DSExportCMD%
REM ===== Setting Parameter ===============
SET DParams=/H=Server /U=username /P=password ProjectName
REM ===============
SET DSExportFileName=C:\DSExports\Dev\filename.dsx
echo DSExportFileName1=%DSExportFileName%
REM ==Run The Backup ==============
%DSExportCMD% %DParams% %DSExportFileName%

save as .bat and run on the windows PC to take the backup of a project.