DataWareHousing Technology: 2015

Tuesday, October 13, 2015

Generic ETL DataStage Framework for loading any source file

This article provides a details of ETL Framework to load any source file without having to creating new set of ETL jobs. The jobs are being designed using IBM Inforsphere DataStage v8.5 or later.

The process of adding a new source file follows a standard workflow.

To add a new source file:

- Import file metadata through DataStage Import Table Definition/Sequential File Definitions

- Create the OSH schema definition

- Create and execute the target table ddl

- Import table metadata through DataStage shared table definition

- Place source and schema files in server directories

Import file metadata through DataStage Import Table Definition/Sequential File Definitions

Use DataStage import table definition to get file metadata.

Define file column names, data types and nullability

Edit file specifications in the file definition properties.

In the columns tab, make sure to remove all field properties like null_field = ‘’

In the Parallel tab, edit the properties as required for the input data.

Create the OSH schema definition

In table definition properties for the file, open the Layout tab / Parallel. This is the OSH schema definition.

Copy-paste this definition into a file and save as “SourceFileName.osh”. The resulting definition may need to be further edited.

As a general rule, consider the following when creating an OSH schema definition file

- For delimited files, remove all “{prefix = 2}”

- Carefully specify timestamp and date format strings

- Although the length of file columns may be variable, a maximum column length should be specified as this will be used to define the target table structure. Another advantage of specifying column length is that the actual column lengths will be validated during file import, avoiding truncation of data

- Match schema data types to cater for Vertica database types (see target table creation section for more on this)

It is good practice to test schema file definitions before usage in the actual jobs (this can be tested through “view data” in the sequential file stage after the schema file creation).

Create and execute the target table ddl

For the target table, create the structure according to the source schema definition. Additionally, add the CDC control columns to the table structure as shown

OSH Schema Target Table

record

{final_delim=end, delim='\t', null_field='', quote=none, timestamp_format='%dd %mmm %yyyy %hh:%nn:%ss'}

(

RecType:string[1];

RDM_PUBLICATION_ID:int64;

RDM_GNRA_PUBLICATION_ID:nullable int64;

NM_RDM_PUBDIVNNAME:nullable string[max=50];

NM_RDM_PUBREGNNAME:nullable string[max=50];

NM_RDM_PUBOFFICENAME:nullable string[max=50];

NM_RDM_MASTHEADNAME:nullable string[max=50];

NM_RDM_GNRA_PUBNAME:nullable string[max=50];

FL_RDM_PUB_ACTIVE:nullable string[1];

DT_RDM_PUB_CREATEDDATE:nullable timestamp;

NM_RDM_PUB_PROPRIETOR:nullable string[max=50];

)

CREATE TABLE sdm_gnra_test.STG_RDMS_GNRA_PUBLICATION_TEST

(

RecType char(1) NOT NULL,

RDM_PUBLICATION_ID int NOT NULL,

NK varchar(255) NOT NULL,

RDM_GNRA_PUBLICATION_ID int,

NM_RDM_PUBDIVNNAME varchar(50),

NM_RDM_PUBREGNNAME varchar(50),

NM_RDM_PUBOFFICENAME varchar(50),

NM_RDM_MASTHEADNAME varchar(50),

NM_RDM_GNRA_PUBNAME varchar(50),

FL_RDM_PUB_ACTIVE char(1),

DT_RDM_PUB_CREATEDDATE timestamp(0),

NM_RDM_PUB_PROPRIETOR varchar(50),

EFF_TS timestamp,

END_TS timestamp,

EFF_RUN_ID int,

END_RUN_ID int

);

Matching common OSH schema data types to Vertica database data types

OSH data type Vertica data type

int64

INTEGER,INT,BIGINT,INT8,SMALLINT,TINYINT

string[max=length]

VARCHAR(length)

string[length]

CHAR(length)

timestamp / timestamp[precision]

TIMESTAMP[0] / TIMESTAMP[precision]

...

Import table metadata through DataStage shared table definition.

IBM Infosphere Master Data Management Source system specific delete

This article describes on deleting source system specific data from IBM Infosphere master data management virtual style implementation.

The below code snippet assume that MDM server is installed on DB2 database , but similar steps can be excuted for other Database implementation as well.

Steps:

-> Connect to database.
-> extract all MEMID's for a specific source system.
-> For each MEMID trigger MPIDROP command to delete the MEMID.

db2 +O "connect to <MDMdatabase> user <userid> using <password>"

db2 +O "SET SCHEMA <DB SCHEMA NAME>"

db2 -x "select '<Source System Code>:'||MEMIDNUM from mpi_memhead where srcrecno = <Source id>" | xargs -r -n 1 $MAD_HOMEDIR/bin/mpidrop -conn 'http|<hostname>|<port>|en' -usrLogin <server user id> -usrPass <server password> -rec

IBM Infosphere Master Data Management Incremental Load script

IBM Information Management has Master Data Management as part of Suit. Inherited originally from a Initiate company. This product has evolved over a great time and as of today it provides a lot more integration capabilities with other suit of IBM products.

This article describes one aspect of the IBM Master Data Management tool which is not document on IBM website.

Problem Statement:

Initial load of data into MDM server can done using various method. Jobs sets , Web services etc .. but problem would come when it is required to add additional data source without impacting existing Enterprise ID's in Virtual Implementation style. Even though IBM document says that it can be done in IXM mode, but very less documentation is provided around on how to implement the solution.

The below script is provided to load data from background using IXM mode and not impacting existing enterprise ID's.

Solution Script & Explanation.

export DATABASE_HOME=<Path of Bin directory for your specific database implementation>

export DATABASE_NAME=<Database Instance Name>

export DATABASE_HOST=<Database Login ID for MDM>

export WAS_CELL=<WAS cell described during implementation>

export WAS_PROFILE=<Was profile described during implemetation>

export MAD_DBNAME=$DATABASE_NAME

export MAD_DBUSER=md<DB username>

export MAD_DBPASS=<DB password>

export MAD_DSNNAME=<DNS Name provided in MDM configuration file>

export WAS_HOME=/opt/IBM/WebSphere/AppServer

export JAVA_HOME=$WAS_HOME/java/jre

export WAS_SECURITY=1

export MAD_TIMER=1

export MAD_ROOTDIR=<MDM Application Root Directory>

export MAD_HOMEDIR=$MAD_ROOTDIR/<Directory path for Configuration Project>

export MAD_WORKDIR=$MAD_HOMEDIR/work

export MAD_UNLDIR=$MAD_WORKDIR/unl

export MAD_BXMDIR=$MAD_WORKDIR/bxm

export MAD_DDLFILE=$MAD_HOMEDIR/sql/mpihub.ddl

export MAD_STOFILE=$MAD_HOMEDIR/sql/mpihub.sto

export MAD_LOGDIR=$MAD_WORKDIR

export MAD_CONNSTR="DSN=MDM11DB_E001;UID=mdmins11;PWD=mdmins11;"

export MAD_CTXLIB=ODBC

export MAD_DBTYPE=DB2

export PATH=$PATH:$MAD_ROOTDIR/bin

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MAD_ROOTDIR/lib

export CLASSPATH=$JAVA_HOME:$CLASSPATH

export CLASSPATH=$WAS_HOME/runtimes/com.ibm.ws.admin.client_8.5.0.jar:$WAS_HOME/runtimes/com.ibm.ws.ejb.thinclient_8.5.0.jar:$CLASSPATH

export CLASSPATH=$JAVA_HOME:$CLASSPATH

export CLASSPATH=$WAS_HOME/runtimes/com.ibm.ws.admin.client_8.5.0.jar:$WAS_HOME/runtimes/com.ibm.ws.ejb.thinclient_8.5.0.jar:$CLASSPATH

export CLASSPATH=$WAS_HOME/lib/j2ee.jar:$CLASSPATH

export CLASSPATH=$WAS_HOME/lib/naming.jar:$CLASSPATH

export CLASSPATH=$WAS_HOME/lib/ras.jar:$CLASSPATH

export CLASSPATH=$WAS_HOME/lib/wsexception.jar:$CLASSPATH

export CLASSPATH=$WAS_HOME/lib/bootstrap.jar:$CLASSPATH

export CLASSPATH=$WAS_HOME/lib/emf.jar:$CLASSPATH

export CLASSPATH=$WAS_HOME/lib/ecutils.jar:$CLASSPATH

export CLASSPATH=$WAS_HOME/lib/iwsorb.jar:$CLASSPATH

export CLASSPATH=$WAS_HOME/lib/idl.jar:$CLASSPATH

export CLASSPATH=$WAS_HOME/lib/ffdc.jar:$CLASSPATH

export CLASSPATH=$WAS_HOME/lib/namingclient.jar:$CLASSPATH

export CLASSPATH=$WAS_HOME/lib/ejbcontainer.jar:$CLASSPATH

export CLASSPATH=$WAS_HOME/properties:$CLASSPATH

mpxprep -entType mdmper -bxmOutDir $MAD_HOMEDIR/work/dev_config/work/bxm -ixmMode -nbktparts 1 -srcRecno 292

mpxrule -entType mdmper -bxmOutDir $MAD_HOMEDIR/work/dev_config/work/bxm

mpxxeia -entType mdmper -bxmOutDir $MAD_HOMEDIR/work/dev_config/work/bxm

mpxxtsk -entType mdmper -bxmOutDir $MAD_HOMEDIR/work/dev_config/work/bxm

mpxcomp -entType mdmper -bxmInpDir $MAD_HOMEDIR/work/dev_config/work/bxm -bxmOutDir $MAD_HOMEDIR/work/dev_config/work/bxm -ixmMode -nbktParts 1 -nthreads 2

Tuesday, October 6, 2015

Australian Address Quality Framework

The need to provide quality data with in any organization is becoming mandatory to perform analytics and reporting needs. An organization with rich quality of data can get better insights into data and find unknown location specific patterns which can increase revenue for the organization.

But the question comes on how to implement this solutions. An organization need to have enough wisdom to take wise decision on the implementing Data Quality solutions. This article will provide an outline of implementation methodology to either improve existing data quality or capture Quality data on an ongoing basis. This article only outlines the methodology for ADDRESS data quality issue , but can be extensible to any data points.

Outlining below approach from my experience with australian based organization, but can be extendable for any organization over the globe.

Any organization which would like to improve their data quality with Australian address can follow below approach to improve their quality.

Step1 :

Identify all the systems within the organization where address data is captured. ( Manual or electronic ).

Step 2 :

Australian Post is the most trustable and accurate data to verify an address quality.

Engage with different third party vendors to extract Address specific data. ( AMAS (Address Matching Approval System)

Link : http://auspost.com.au/business-solutions/address-validation.html

Raw address file providers:

http://auspost.com.au/business-solutions/raw-address-data.html

Step 3:

Define the Match Strategy required from this solutions.

Ex:

Find only Exact Matches.
Find Exact Matches and relevant match , if there is any alternate street name.
Find all extract Matches and Map possible values for incorrect Matches etc ..
Fuzzy Matching Vs Non Fuzzy Matching.

Step 4:

Define data testing process. ( Availability of Address Match data from third party , Availability of Current system data , Data Stewards time to verify and approve matching data etc .. )

Step 5:

Software Considerations for Matching data. ( ETL tools vs Self coded tools etc .. )

Step 6:

Data Quality Process within organization.

Ex: 1. Who is responsibility to certify correct data.

2.Who will take corrective action , in case of incorrect data identification.

3. How will the action be taken. ( using GUI or manual process etc .. )

All these factors go into solution designing. So be wise to answer such question correctly.

Step 7:

Design solution that matches Step 3, 4, 5 and 6.

Detailed Match process Hierarchy

Standardize source address to its lowest level. All australian address follow the below hierarchy. ( Different ETL tools can help to lower development Cost. I would prefer IBM DataStage QualityStage ).

• Flat/Unit Type; • Flat/Unit Number; • Floor Level Type; • Floor Level Number; • Building Property Name 1; • Building Property Name 2; • Lot Number; • House Number 1; • House Number 1 Suffix; • House Number 2; • House Number 2 Suffix; • Postal Delivery Number; • Postal Delivery Number Prefix; • Postal Delivery Number Suffix; • Street Name; • Street Type; • Street Suffix; • Postal Delivery Type; • Locality; • State; and • Postcode.

2. Perform Exact Match with Supplier provide reference data.

3. Perform Fuzzy Match (Misspelt data , abbreviated data etc .. ).

4. Transform unmatched with Alternate street names.

5. Perform step 2, 3 with alternate street names.

6. Match using Bordering Localities.

7. Match using any specific rules defined by business. ( Tie Breaking Rules ).

8. Assign DPID of each Match.

9. Take appropriate steps as per Step 3 ( Match Strategy ).

FOR ANY CONSULTATION SERVICES YOU CAN CONTACT ME @ cevvavijay@gmail.com

Thursday, March 19, 2015

DataStage Checksum Stage. Do's and Dont's

DataStage has provide many powerful plugin tools for data inergration and loading purpose into datawarehouse. In this blog I am giving a brief overview of checksum stage and present you the do's and dont's of this stage.

Checksum Stage : This stage is used to calcuate the unique checksum of a given string or given set of columns. This stage can be used to generate a unique identifier for a composite key or can be used for detection of changes between source and target set of records during CDC process while loading to datawarehouse.

Do's:

1) Ensure the column name specfication between the source and target at the same. The order of column need not be the same , but the spelling on the column should be the same.

As the checksum stage will auto align the columns in alphabetical ascending order it is very important that the spelling of the column names be the same between source and target.

2) Ensure the column lenght and scale are same between the source and target to generate the same unique identifier.

3) Ensure the column exclude or include to be same between both the source and taget columns.

4) For better debugging purpose always use a option Buffer Output Column Name to store the buffer value for identifying the columns and data used to build the cheksum.

Dont:

If using checksum stage in a job with run time column propagation , it is advised not use the column exlucusion as the maintainability of the jobs will be difficult for future changes. Inclusion of column is a best approach for better maintainability in such scenario. Vice versa for non run time column propagation columns.

Avoid performing a checksum on a existing checksum field as this will not produce the same result always.

Let me know your view and toughts abt this post ..

Windows batch Script to Export Datastage jobs using command line

The below script is to automate extraction of export on to desktop from Datastage server for usage into code version systems.

@ECHO OFF

:: -----------------------------------------------------------------
:: -----------------------------------------------------------------
:: Name: export.bat
:: -----------------------------------------------------------------
:: Description: To extract datastage components using Command line.
:: -----------------------------------------------------------------
:: VERSION CREATED BY DATE DESCRIPTION
:: 1.0  Vijay Reddy 16/02/2015 Initial Version
:: -----------------------------------------------------------------
:: -----------------------------------------------------------------
:: Required Components:
::     dsexport.exe (Windows Version)
::     dssearch.exe (Windows Version)
::     exportRec.bat (Windows Batch Script file )
:: -----------------------------------------------------------------
:: Command Line Parameters:
:: 1. Single Mode or Multi Mode
:: 2. Job Name or File list all the job names for export
:: 3. Project Name
:: 4. Target Path to save the exported files
:: 5. Recursive extraction mode to extracted all the referenced jobs.
:: -----------------------------------------------------------------
:: -----------------------------------------------------------------

:: -----------------------------------------------------------------
:: SETTING GLOBAL VARIABLES
:: -----------------------------------------------------------------

SET OPTION=%1
SET JOBNAME=%2
SET JOBLISTFILENAME=%2
SET PROJECTNAME=%3
SET TARGETPATH=%4
SET RECUREXTRACT=%5

:: -----------------------------------------------------------------
:: SETTING DATASTAGE ENVIRONMENT VARIABLES
:: -----------------------------------------------------------------

SET DOMAIN=
SET USER=
SET PWD=
SET SERVER=

:: -----------------------------------------------------------------
:: VALIDATING PARAMETERS
:: -----------------------------------------------------------------

SET CLIENTPATH=C:\IBM\InformationServer87\Clients\Classic
SET UTILPATH=C:\IBM\InformationServer87\JobExport\ExportUtil

:: -----------------------------------------------------------------
:: EXPORTING in SINGLE MODE
:: -----------------------------------------------------------------

if "%1"==""   ( goto usage ) else (if NOT %OPTION% == M ( if NOT %OPTION% == S goto ERROR1 ) )
if "%2"==""   ( goto usage ) else (if %OPTION% == M (if not exist %JOBLISTFILENAME% goto ERROR2 ))
if "%3"==""   ( goto usage )
if "%4"==""   ( goto usage ) else ( if not exist %TARGETPATH% goto ERROR3 )
if "%5"==""   ( SET RECUREXTRACT=N )

:: -----------------------------------------------------------------
:: EXPORTING in MULTIPLE MODE
:: -----------------------------------------------------------------

IF %OPTION% == S (

echo Exporting %JOBNAME% from Project: %PROJECTNAME%
%CLIENTPATH%\dsexport.exe /D=%DOMAIN% /U=%USER% /P=%PWD% /JOB=%JOBNAME% /NODEPENDENTS %SERVER%/%PROJECTNAME% %TARGETPATH%\%JOBNAME%.dsx

IF %RECUREXTRACT% == R (
CALL %UTILPATH%\exportRec.bat %JOBNAME% %PROJECTNAME% %TARGETPATH%
)

) else (

for /F "tokens=1" %%i in (%JOBLISTFILENAME%) do (
        echo Exporting %%i from Project: %PROJECTNAME%
        %CLIENTPATH%\dsexport.exe /D=%DOMAIN% /U=%USER% /P=%PWD% /JOB=%%i /NODEPENDENTS %SERVER%/%PROJECTNAME% %TARGETPATH%\%%i.dsx
        IF %RECUREXTRACT% == R (
        CALL %UTILPATH%\exportRec.bat %%i %PROJECTNAME% %TARGETPATH%
        )
    )

)

cd /D %~dp0
EXIT /B

:usage

echo Options 1-4 are Mandatory
echo Command Options : Option 1 - "S" for Single Job extraction Mode and "M" for Multi Job Extraction Mode
echo Option 2 - In single job extraction Mode - Job Name to extract
echo Option 2 - In Multi job extraction Mode - Path and Filename of the list of jobs for extraction
echo Option 3 - DataStage Project Name for extraction
echo Option 4 - Target path to store exported dsx files
echo option 5 - Recursive export of components ( Downward recursive only ). This parameter is optional.

cd /D %~dp0
EXIT /B

:ERROR1

echo Not a valid Option, Please use letters S - for Single job extraction or M - for Multi Job extraction.
echo Please find the options summary below
goto usage

:ERROR2

echo %JOBLISTFILENAME% not found

cd /D %~dp0
EXIT /B

:ERROR3

echo Not a valid directory %TARGETPATH%

cd /D %~dp0
EXIT /B

============================================

@ECHO OFF

:: ------------------------------------------------------------------------------------
:: ------------------------------------------------------------------------------------
:: Name: exportRec.bat
:: ------------------------------------------------------------------------------------
:: Description::: To extract datastage components using Command line in recursive mode
:: ------------------------------------------------------------------------------------
:: VERSION CREATED BY DATE DESCRIPTION
:: 1.0  Vijay Reddy 16/02/2015 Initial Version
:: ------------------------------------------------------------------------------------
:: ------------------------------------------------------------------------------------
:: Required Components:
::     dsexport.exe (Windows Version)
::     dssearch.exe (Windows Version)
:: ------------------------------------------------------------------------------------
:: Command Line Parameters:
:: 1. Job Name or File list all the job names for export
:: 2. Project Name
:: 3. Target Path to save the exported files
:: ------------------------------------------------------------------------------------
:: ------------------------------------------------------------------------------------

SET JOBNAME=%1
SET PROJECTNAME=%2
SET TARGETPATH=%3

SET WORKDIR=C:\IBM\InformationServer87\JobExport\ExportUtil
SET RECURJOBLIST=tmprecurjoblist.txt

SET DOMAIN=
SET USER=
SET PWD=
SET SERVER=

SET CLIENTPATH=C:\IBM\InformationServer87\Clients\Classic

echo Exporting Recursive jobs

%CLIENTPATH%\dssearch.exe -domain %DOMAIN% -user %USER% -password %PWD% -server %SERVER% -ljobs -uses %PROJECTNAME% %JOBNAME% > %WORKDIR%\%RECURJOBLIST%

:: ------------------------------------------------------------------------------------
:: CHECKING ERROR CODES
:: ------------------------------------------------------------------------------------

IF %ERRORLEVEL% EQU 0 (

echo No recursive jobs found for %JOBNAME% in Project: %PROJECTNAME%

EXIT /B 0

)

:: ------------------------------------------------------------------------------------
:: EXPORTING in RECURSIVE MODE
:: ------------------------------------------------------------------------------------

IF %ERRORLEVEL% NEQ 1 (

for /F "tokens=1" %%i in (%WORKDIR%\%RECURJOBLIST%) do (
echo Exporting %%i from Project: %PROJECTNAME%
%CLIENTPATH%\dsexport.exe /D=%DOMAIN% /U=%USER% /P=%PWD% /JOB=%%i /NODEPENDENTS %SERVER%/%PROJECTNAME% %TARGETPATH%\%%i.dsx

)

Usage manual will be updated later. For more information pls contact me

Thursday, January 29, 2015

Converting DataStage jobs as services via remote SSH invocation

Here in this post i will explain about on how to use a datastage job as service via remote SSH invocation

Task :

Decompress the files , read the data , transform the data and write into output file.

The file should not be transmitted over network, but data can be streamed.

Here is the simple implementation :

On Remote DataStage Server:
1) Write a job that take a input file name as parameter and output file name as parameter.
2) One record in input file is a just a single string.
3) Write transformation rule and output to the output file.

On Local Unix Server:

Write the below script:

piped_ds_job.sh

============

#!/bin/bash
dshome=`cat /.dshome`
. $dshome/dsenv
export PATH=$PATH:$DSHOME/bin

pid=$$
fifodir=/data/datastage/tmp
infname=$fifodir/infname.$pid
outfname=$fifodir/outfname.$pid

mkfifo $infname
mkfifo $outfname
dsjob  -run -param inputFile=$infname \
    -param outputFile=$outfname dstage1 ds_sort.$pid 2> /dev/null  &
                
if [ $? -ne 0 ]; then
    echo "error calling DataStage job."
    rm $infname
    rm $outfname
    exit 1
fi

(cat $outfname;rm $outfname)&
                
if [ -z $1 ]; then
    cat > $infname
else
    cat $1 > $infname
fi
                
rm $infname

======================

Decompressing the file and running the above script on dataStage server

zcat compressedfile.gz |ssh -l dsadm@victory.ibm.com piped_ds_job.sh >> outputfilename

Difference of using DataSet and Lookup File Set during Lookup Operations

Dataset stores all the information in internal format of the Dataset.

Alternatively Lookup fileset also holds the same kind of information in Files.

Then the question is why is Lookup FileSet avaliable in the datastage.

There is a minor difference between DataSet and Lookup FileSet when used of Lookup.

Here goes the explanation:

Characteristic of Lookup File Stage.
When you use a Lookup File Set stage as a source for lookup data, there are special considerations about
column naming. If you have columns of the same name in both the source and lookup data sets, the
source data set column will go to the output data. If you want this column to be replaced by the column
from the lookup data source, you need to drop the source data column before you perform the lookup.

This character help in faster processing of Lookup data when the data is Huge. When a Dataset is used the Lookup data is loaded into memory and lookup is performed , in contrast when used Lookup FileSet only the Key columns are loaded into Memory for lookup and remaining columns are automatically taken from File using Inbuilt Copy Operator. This helps in processing Lookup faster when using a LookfileSet than Dataset.

However with the improvements in the Join and Sort stages , the usage of Lookup File Set is almost depleted in recent years.

DataStage Oracle Connector Bulk Load performance improvements

Improve oracle loading performance using bulk load options.

--> Set APT_PM_PLAYER_TIMING to true find our if the oracle conncetor is taking More CPU cycles.

--> Disable the oracle triggers / indexes & Constraints on this table. ( can use before and after SQL statement to disable load and enable).

--> Use oracle data cache for faster handling of Date and Timestamp columns.

--> Disabe redo logging while Bulk upload

--> Ensure the array size and buffer size are set to optimal.

--> Load data if possible using Partitioned tables.

--> Proper Handing of datatype, as datatype conversions can be costly.

--> Proper setting of NLS_LANG setting from oracle database and Environment variables.

--> Set CC_MSG_LEVEL to lowest possible value.

Further readings

http://www.ibm.com/developerworks/data/library/techarticle/dm-1304datastageoracleconnector2/

Tuesday, January 27, 2015

Distributed Transaction stage -- A handy tool for Real Time data processing

This post is basicaly for people who are interested in Realtime processing messages into data processing systems using IBM DataStage capabilities.

DataStage version earlier than 8.5 does not have the guarantee delivery of data from a source to a target system. Each instance of a stage manages its own transaction, which is independent of other stages. There are no control signals sent between stages to co-ordinate how transactions are started and committed. However, some user scenarios require the data to remain in the source system until it is written to
a target system. ( Ex: Messages from MQ or JMS ).

From version 8.5 and later distributed transaction stage has provided these data delivery gurantee between Source to Target.

Here is the transaction flow described for your reference:

When the job is run, the following actions take place:
1. A local transaction is started with the connector.
2. The target is updated.
3. If the updates are successful, the local transaction is committed on reaching
end-of-wave.
4. If the local transaction is successfully committed:
a. A WebSphere MQ transaction is started.
b. The messages about the current transaction are deleted from the source or
work queue.
c. The WebSphere MQ transaction is committed.
5. If actions in Step 3 or Step 4 are not successful, and a reject queue is defined:
a. A WebSphere MQ transaction is started.
b. Messages about the current transaction are moved from the source or work
queue to the reject queue.
c. The WebSphere MQ transaction is committed.

Conns:

1) Required write access to MQ's (either source or work queues ).
2) Processing of messages is slowed based on response from MQ server. ( especially network related ).

Environment Variables for consideration

CC_DTS_COMMIT_ON_EOF : specify whether Distributed Transaction stage
commits on the end of data.

CC_TRUNCATE_NSTRING_WITH_NULL : to truncate string data that includes the string 0x00.

CC_USE_EXTERNAL_SCHEMA_ON_MISMATCH : to use an external schema rather than a design
schema when the schemas do not match

Friday, January 9, 2015

DataVault - Next step in data Modelling

Today i was going thourgh the learndatavault.com website to understand more about datavault modelling. It was very interesting and exiciting facts which i have pratically come accross during project implementations.

The major problems involved in maintaining a data warehouse clearly address and give flexibility to organization to maintain the datasets for a longer period of time.

Today Major problems with Datawarehouse's Designs:

1) Flexibility to modify
2) Fast Query processing
3) Maintaining the sanity of the datasets when business rules changes.
4) Processing huge volumes.( Star schemas has problem of summing up & 3 NF has problem of storage due to huge number of tables. Ex: Teradata FSLDM has hundred's of tables .. )

DataVault a simple Hub-Link-Satellite model provides a Genetic helix like structure which can adapt and modify it self for the changing business rules. This is an interesting design paradiagm which i would learn and further write on my blog with my understanding. Keep tuned.

Some reference for your readings: ( Also avaliable on the wiki page ).

http://prudenza.typepad.com/files/damhof_dbm0508_eng-1.pdf

Linstedt, Dan. "Data Vault Series 1 – Data Vault Overview". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
Linstedt, Dan. "Data Vault Series 2 – Data Vault Components". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
Linstedt, Dan. "Data Vault Series 3 – End Dates and Basic Joins". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
Linstedt, Dan. "Data Vault Series 4 – Link Tables". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.
Linstedt, Dan. "Data Vault Series 5 – Loading Practices". Data Vault Series. The Data Administration Newsletter. Retrieved 12 September 2011.

Kunenborg, Ronald. "Data Vault Rules v1.0.8 Cheat Sheet". Data Vault Rules. Grundsätzlich IT. Retrieved 26 September 2012. Cheat sheet reflecting the rules in v1.0.8 and additional clarification from the forums on the rules in v1.0.8.
Linstedt, Dan. "Data Vault Modeling Specification v1.0.9". Data Vault Forum. Dan Linstedt. Retrieved 26 September 2012.
Linstedt, Dan. "Data Vault Loading Specification v1.2". Dan Linstedt.com. Dan Linstedt. Retrieved 2014-01-03.
Linstedt, Dan. "A short intro to #datavault 2.0". Dan Linstedt.com. Dan Linstedt. Retrieved 2014-01-03.
Linstedt, Dan. "Data Vault 2.0 Being Announced". Dan Linstedt.com. Dan Linstedt. Retrieved 2014-01-03.

DataWareHousing Technology

About Me

Tuesday, October 13, 2015

Generic ETL DataStage Framework for loading any source file

IBM Infosphere Master Data Management Source system specific delete

IBM Infosphere Master Data Management Incremental Load script

Tuesday, October 6, 2015

Australian Address Quality Framework

Australian Address Quality Framework

Thursday, March 19, 2015

DataStage Checksum Stage. Do's and Dont's

Windows batch Script to Export Datastage jobs using command line

Thursday, January 29, 2015

Converting DataStage jobs as services via remote SSH invocation

Difference of using DataSet and Lookup File Set during Lookup Operations

DataStage Oracle Connector Bulk Load performance improvements

Tuesday, January 27, 2015

Distributed Transaction stage -- A handy tool for Real Time data processing

Friday, January 9, 2015

DataVault - Next step in data Modelling

Adsense bar