Tuesday, October 13, 2015

Generic ETL DataStage Framework for loading any source file

This article provides a details of ETL Framework to load any source file without having to creating new set of ETL jobs. The jobs are being designed using IBM Inforsphere DataStage v8.5 or later.

The process of adding a new source file follows a standard workflow.

To add a new source file:

- Import file metadata through DataStage Import Table Definition/Sequential File Definitions

- Create the OSH schema definition

- Create and execute the target table ddl

- Import table metadata through DataStage shared table definition

- Place source and schema files in server directories

Import file metadata through DataStage Import Table Definition/Sequential File Definitions

Use DataStage import table definition to get file metadata.

Define file column names, data types and nullability



Edit file specifications in the file definition properties.


In the columns tab, make sure to remove all field properties like null_field = ‘’



In the Parallel tab, edit the properties as required for the input data.



Create the OSH schema definition
In table definition properties for the file, open the Layout tab / Parallel. This is the OSH schema definition.



Copy-paste this definition into a file and save as “SourceFileName.osh”. The resulting definition may need to be further edited.

As a general rule, consider the following when creating an OSH schema definition file
- For delimited files, remove all “{prefix = 2}”
- Carefully specify timestamp and date format strings
- Although the length of file columns may be variable, a maximum column length should be specified as this will be used to define the target table structure. Another advantage of specifying column length is that the actual column lengths will be validated during file import, avoiding truncation of data
- Match schema data types to cater for Vertica database types (see target table creation section for more on this)
It is good practice to test schema file definitions before usage in the actual jobs (this can be tested through “view data” in the sequential file stage after the schema file creation).
Create and execute the target table ddl
For the target table, create the structure according to the source schema definition. Additionally, add the CDC control columns to the table structure as shown
OSH Schema Target Table
record
{final_delim=end, delim='\t', null_field='', quote=none, timestamp_format='%dd %mmm %yyyy %hh:%nn:%ss'}
(
RecType:string[1];
RDM_PUBLICATION_ID:int64;
RDM_GNRA_PUBLICATION_ID:nullable int64;
NM_RDM_PUBDIVNNAME:nullable string[max=50];
NM_RDM_PUBREGNNAME:nullable string[max=50];
NM_RDM_PUBOFFICENAME:nullable string[max=50];
NM_RDM_MASTHEADNAME:nullable string[max=50];
NM_RDM_GNRA_PUBNAME:nullable string[max=50];
FL_RDM_PUB_ACTIVE:nullable string[1];
DT_RDM_PUB_CREATEDDATE:nullable timestamp;
NM_RDM_PUB_PROPRIETOR:nullable string[max=50];
)
CREATE TABLE sdm_gnra_test.STG_RDMS_GNRA_PUBLICATION_TEST
(
RecType char(1) NOT NULL,
RDM_PUBLICATION_ID int NOT NULL,
NK varchar(255) NOT NULL,
RDM_GNRA_PUBLICATION_ID int,
NM_RDM_PUBDIVNNAME varchar(50),
NM_RDM_PUBREGNNAME varchar(50),
NM_RDM_PUBOFFICENAME varchar(50),
NM_RDM_MASTHEADNAME varchar(50),
NM_RDM_GNRA_PUBNAME varchar(50),
FL_RDM_PUB_ACTIVE char(1),
DT_RDM_PUB_CREATEDDATE timestamp(0),
NM_RDM_PUB_PROPRIETOR varchar(50),
EFF_TS timestamp,
END_TS timestamp,
EFF_RUN_ID int,
END_RUN_ID int
);
Matching common OSH schema data types to Vertica database data types
OSH data type Vertica data type
int64
INTEGER,INT,BIGINT,INT8,SMALLINT,TINYINT
string[max=length]
VARCHAR(length)
string[length]
CHAR(length)
timestamp / timestamp[precision]
TIMESTAMP[0] / TIMESTAMP[precision]
...
...
Import table metadata through DataStage shared table definition.