Tuesday, October 13, 2015

Generic ETL DataStage Framework for loading any source file

This article provides a details of ETL Framework to load any source file without having to creating new set of ETL jobs. The jobs are being designed using IBM Inforsphere DataStage v8.5 or later.

The process of adding a new source file follows a standard workflow.

To add a new source file:

- Import file metadata through DataStage Import Table Definition/Sequential File Definitions

- Create the OSH schema definition

- Create and execute the target table ddl

- Import table metadata through DataStage shared table definition

- Place source and schema files in server directories

Import file metadata through DataStage Import Table Definition/Sequential File Definitions

Use DataStage import table definition to get file metadata.

Define file column names, data types and nullability



Edit file specifications in the file definition properties.


In the columns tab, make sure to remove all field properties like null_field = ‘’



In the Parallel tab, edit the properties as required for the input data.



Create the OSH schema definition
In table definition properties for the file, open the Layout tab / Parallel. This is the OSH schema definition.



Copy-paste this definition into a file and save as “SourceFileName.osh”. The resulting definition may need to be further edited.

As a general rule, consider the following when creating an OSH schema definition file
- For delimited files, remove all “{prefix = 2}”
- Carefully specify timestamp and date format strings
- Although the length of file columns may be variable, a maximum column length should be specified as this will be used to define the target table structure. Another advantage of specifying column length is that the actual column lengths will be validated during file import, avoiding truncation of data
- Match schema data types to cater for Vertica database types (see target table creation section for more on this)
It is good practice to test schema file definitions before usage in the actual jobs (this can be tested through “view data” in the sequential file stage after the schema file creation).
Create and execute the target table ddl
For the target table, create the structure according to the source schema definition. Additionally, add the CDC control columns to the table structure as shown
OSH Schema Target Table
record
{final_delim=end, delim='\t', null_field='', quote=none, timestamp_format='%dd %mmm %yyyy %hh:%nn:%ss'}
(
RecType:string[1];
RDM_PUBLICATION_ID:int64;
RDM_GNRA_PUBLICATION_ID:nullable int64;
NM_RDM_PUBDIVNNAME:nullable string[max=50];
NM_RDM_PUBREGNNAME:nullable string[max=50];
NM_RDM_PUBOFFICENAME:nullable string[max=50];
NM_RDM_MASTHEADNAME:nullable string[max=50];
NM_RDM_GNRA_PUBNAME:nullable string[max=50];
FL_RDM_PUB_ACTIVE:nullable string[1];
DT_RDM_PUB_CREATEDDATE:nullable timestamp;
NM_RDM_PUB_PROPRIETOR:nullable string[max=50];
)
CREATE TABLE sdm_gnra_test.STG_RDMS_GNRA_PUBLICATION_TEST
(
RecType char(1) NOT NULL,
RDM_PUBLICATION_ID int NOT NULL,
NK varchar(255) NOT NULL,
RDM_GNRA_PUBLICATION_ID int,
NM_RDM_PUBDIVNNAME varchar(50),
NM_RDM_PUBREGNNAME varchar(50),
NM_RDM_PUBOFFICENAME varchar(50),
NM_RDM_MASTHEADNAME varchar(50),
NM_RDM_GNRA_PUBNAME varchar(50),
FL_RDM_PUB_ACTIVE char(1),
DT_RDM_PUB_CREATEDDATE timestamp(0),
NM_RDM_PUB_PROPRIETOR varchar(50),
EFF_TS timestamp,
END_TS timestamp,
EFF_RUN_ID int,
END_RUN_ID int
);
Matching common OSH schema data types to Vertica database data types
OSH data type Vertica data type
int64
INTEGER,INT,BIGINT,INT8,SMALLINT,TINYINT
string[max=length]
VARCHAR(length)
string[length]
CHAR(length)
timestamp / timestamp[precision]
TIMESTAMP[0] / TIMESTAMP[precision]
...
...
Import table metadata through DataStage shared table definition.

IBM Infosphere Master Data Management Source system specific delete

This article describes on deleting source system specific data from IBM Infosphere master data management virtual style implementation.

The below code snippet assume that MDM server is installed on DB2 database , but similar steps can be excuted for other Database implementation as well.

Steps:

-> Connect to database.
-> extract all MEMID's for a specific source system.
-> For each MEMID trigger MPIDROP command to delete the MEMID.

db2 +O "connect to <MDMdatabase> user <userid> using <password>"
db2 +O "SET SCHEMA <DB SCHEMA NAME>"
db2 -x "select '<Source System Code>:'||MEMIDNUM from mpi_memhead where srcrecno = <Source id>" | xargs -r -n 1 $MAD_HOMEDIR/bin/mpidrop -conn 'http|<hostname>|<port>|en' -usrLogin <server user id> -usrPass <server password> -rec

IBM Infosphere Master Data Management Incremental Load script

IBM Information Management has Master Data Management as part of Suit. Inherited originally from a Initiate company. This product has evolved over a great time and as of today it provides a lot more integration capabilities with other suit of IBM products.

This article describes one aspect of the IBM Master Data Management tool which is not document on IBM website.

Problem Statement:

Initial load of data into MDM server can done using various method. Jobs sets , Web services etc .. but problem would come when it is required to add additional data source without impacting existing Enterprise ID's in Virtual Implementation style. Even though IBM document says that it can be done in IXM mode, but very less documentation is provided around on how to implement the solution.

The below script is provided to load data from background using IXM mode and not impacting existing enterprise ID's.

Solution Script & Explanation.

export DATABASE_HOME=<Path of Bin directory for your specific database implementation>
export DATABASE_NAME=<Database Instance Name>
export DATABASE_HOST=<Database Login ID for MDM>
export WAS_CELL=<WAS cell described during implementation>
export WAS_PROFILE=<Was profile described during implemetation>
export MAD_DBNAME=$DATABASE_NAME
export MAD_DBUSER=md<DB username>
export MAD_DBPASS=<DB password>
export MAD_DSNNAME=<DNS Name provided in MDM configuration file>
export WAS_HOME=/opt/IBM/WebSphere/AppServer
export JAVA_HOME=$WAS_HOME/java/jre
export WAS_SECURITY=1
export MAD_TIMER=1
export MAD_ROOTDIR=<MDM Application Root Directory>
export MAD_HOMEDIR=$MAD_ROOTDIR/<Directory path for Configuration Project>
export MAD_WORKDIR=$MAD_HOMEDIR/work
export MAD_UNLDIR=$MAD_WORKDIR/unl
export MAD_BXMDIR=$MAD_WORKDIR/bxm
export MAD_DDLFILE=$MAD_HOMEDIR/sql/mpihub.ddl
export MAD_STOFILE=$MAD_HOMEDIR/sql/mpihub.sto
export MAD_LOGDIR=$MAD_WORKDIR
export MAD_CONNSTR="DSN=MDM11DB_E001;UID=mdmins11;PWD=mdmins11;"
export MAD_CTXLIB=ODBC
export MAD_DBTYPE=DB2
export PATH=$PATH:$MAD_ROOTDIR/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MAD_ROOTDIR/lib
export CLASSPATH=$JAVA_HOME:$CLASSPATH
export CLASSPATH=$WAS_HOME/runtimes/com.ibm.ws.admin.client_8.5.0.jar:$WAS_HOME/runtimes/com.ibm.ws.ejb.thinclient_8.5.0.jar:$CLASSPATH
export CLASSPATH=$JAVA_HOME:$CLASSPATH
export CLASSPATH=$WAS_HOME/runtimes/com.ibm.ws.admin.client_8.5.0.jar:$WAS_HOME/runtimes/com.ibm.ws.ejb.thinclient_8.5.0.jar:$CLASSPATH
export CLASSPATH=$WAS_HOME/lib/j2ee.jar:$CLASSPATH
export CLASSPATH=$WAS_HOME/lib/naming.jar:$CLASSPATH
export CLASSPATH=$WAS_HOME/lib/ras.jar:$CLASSPATH
export CLASSPATH=$WAS_HOME/lib/wsexception.jar:$CLASSPATH
export CLASSPATH=$WAS_HOME/lib/bootstrap.jar:$CLASSPATH
export CLASSPATH=$WAS_HOME/lib/emf.jar:$CLASSPATH
export CLASSPATH=$WAS_HOME/lib/ecutils.jar:$CLASSPATH
export CLASSPATH=$WAS_HOME/lib/iwsorb.jar:$CLASSPATH
export CLASSPATH=$WAS_HOME/lib/idl.jar:$CLASSPATH
export CLASSPATH=$WAS_HOME/lib/ffdc.jar:$CLASSPATH
export CLASSPATH=$WAS_HOME/lib/namingclient.jar:$CLASSPATH
export CLASSPATH=$WAS_HOME/lib/ejbcontainer.jar:$CLASSPATH
export CLASSPATH=$WAS_HOME/properties:$CLASSPATH
mpxprep -entType mdmper -bxmOutDir $MAD_HOMEDIR/work/dev_config/work/bxm -ixmMode -nbktparts 1  -srcRecno 292
mpxrule -entType mdmper -bxmOutDir $MAD_HOMEDIR/work/dev_config/work/bxm
mpxxeia -entType mdmper -bxmOutDir $MAD_HOMEDIR/work/dev_config/work/bxm
mpxxtsk -entType mdmper -bxmOutDir $MAD_HOMEDIR/work/dev_config/work/bxm
mpxcomp -entType mdmper -bxmInpDir $MAD_HOMEDIR/work/dev_config/work/bxm -bxmOutDir $MAD_HOMEDIR/work/dev_config/work/bxm -ixmMode -nbktParts 1 -nthreads 2

Tuesday, October 6, 2015

Australian Address Quality Framework

Australian Address Quality Framework

The need to provide quality data with in any organization is becoming mandatory to perform analytics and reporting needs. An organization with rich quality of data can get better insights into data and find unknown location specific patterns which can increase revenue for the organization.
But the question comes on how to implement this solutions. An organization need to have enough wisdom to take wise decision on the implementing Data Quality solutions. This article will provide an outline of implementation methodology to either improve existing data quality or capture Quality data on an ongoing basis. This article only outlines the methodology for ADDRESS data quality issue , but can be extensible to any data points.


Outlining below approach from my experience with australian based organization, but can be extendable for any organization over the globe.


Any organization which would like to improve their data quality with Australian address can follow below approach to improve their quality.


Step1 :


Identify all the  systems within the organization where address data is captured. ( Manual or electronic ).


Step 2 :


Australian Post is the most trustable and accurate data to verify an address quality.
Engage with different third party vendors to extract Address specific data. ( AMAS (Address Matching Approval System)
Raw address file providers:
http://auspost.com.au/business-solutions/raw-address-data.html


Step 3:


Define the Match Strategy required from this solutions.


Ex:
  1. Find only Exact Matches.
  2. Find Exact Matches and relevant match , if there is any alternate street name.
  3. Find all extract Matches and Map possible values for incorrect Matches etc ..
  4. Fuzzy Matching Vs Non Fuzzy Matching.


Step 4:


Define data testing process. ( Availability of Address Match data from third party , Availability of Current system data , Data Stewards time to verify and approve matching data etc .. )


Step 5:


Software Considerations for Matching data. ( ETL tools vs Self coded tools etc .. )


Step 6:


Data Quality Process within organization.
Ex: 1. Who is responsibility to certify correct data.
2.Who will take corrective action , in case of incorrect data identification.
3. How will the action be taken. ( using GUI or manual process etc .. )


All these factors go into solution designing. So be wise to answer such question correctly.


Step 7:


Design solution that matches Step 3, 4, 5 and 6.


Detailed Match process Hierarchy


  1. Standardize source address to its lowest level. All australian address follow the below hierarchy. ( Different ETL tools can help to lower development Cost. I would prefer IBM DataStage QualityStage ).


• Flat/Unit Type; • Flat/Unit Number; • Floor Level Type; • Floor Level Number; • Building Property Name 1; • Building Property Name 2; • Lot Number; • House Number 1; • House Number 1 Suffix; • House Number 2; • House Number 2 Suffix; • Postal Delivery Number; • Postal Delivery Number Prefix; • Postal Delivery Number Suffix; • Street Name; • Street Type; • Street Suffix; • Postal Delivery Type; • Locality; • State; and • Postcode.


2. Perform Exact Match with Supplier provide reference data.
3. Perform Fuzzy Match (Misspelt data , abbreviated data etc .. ).
4. Transform unmatched with Alternate street names.
5. Perform step 2, 3 with alternate street names.
6. Match using Bordering Localities.
7. Match using any specific rules defined by business. ( Tie Breaking Rules ).
8. Assign DPID of each Match.
9. Take appropriate steps as per Step 3 ( Match Strategy ).


FOR ANY CONSULTATION SERVICES YOU CAN CONTACT ME @ cevvavijay@gmail.com