Search This Blog

Thursday, 31 March 2016

IBM Infosphere Datastage - Performance Tuning

Performance Tuning will have special considerations when tuning the overall system versus individual job. The approach we need to take to tune individual job will be different to tuning overall batch/application.
When we need to tune individual job,
1.     We need to first isolate the performance issue i.e. which stage is running longer and what operation within the stage is slowing down the job. In order to identify stage impacting the performance, we could use one of following techniques
·        Use “Record Job Performance data” option. This can be checked in job properties (under execution tab) or while running the job from designer , in the job run options and general tab. Then we can analyze this performance data from designer “Fileà Performance Analysis” or from director “Jobà Performance Analysis”. The location of the performance files is defined in the environment variable APT_PERFORMANCE_DATA
·        Use the job monitor function from Director. The job monitor provides job's performance data at a moment of execution. Note: Automatic insertion of sort or similar operations are not considered in this performance data.
·        Set the APT_PM_PLAYER_TIMING environment variable, this will give each operation and CPU time by operator in the job log. This information could be used to identify the stage slowing down the job

·       We can also use Performance Tab in the job details in the Operations Console to view the performance data

2.     Once we know which staging is running longer, we need to investigate further into details by using score dump (APT_DUMP_SCORE) to analyze number of operators, parallelism for each stage and take stage specific performance improvement action. Simple example is, Stage might be running sequentially, we could change this to run parallel to improve the performance.
When we need to tune overall batch/application,
3.      When overall batch is running slow, this mean we either have serious design issues or CPU/Memory/Configuration issue
4.      We need to first check if CPU is utilized near to 100%.  Ideally a good job flow should consume as much CPU as possible. We can check CPU and Memory usage in the IBM InfoSphere Information Server Operation Console à Dashboard or Activity/Resources. If CPU is not utilitsed as much available, this indicate bottlenecks are elsewhere may be at I/O.. so we could increase number of nodes in configuration files and increate number of partitions to increase CPU load.
5.      Unix command IOSTAT could be used to understand the CPU utilization and through put of disk devices. If some disks are used too much and some are not used at all, then we could as unutilized disks as “resource disks” in the configuration file, this would distribute the i/o operations across the disks, could reduce the network latency
6.      If both CPU and Disk resources are properly utilized and no further scope for improvement, then we could manage work load using “workload management” in the IBM InfoSphere Information Server Operation console, to run only certain number of jobs at any given point in time by queuing the jobs (adjust Maximum running jobs in WLM)
7.      Balanced optimization could be used to push the load down to DBMS database servers, this will reduce workload on datastage server improving overall performance
8.      We could Segregate jobs as priority vrs non-priority, and use different configuration files for priority jobs to consume maximum resources and for non-priority jobs to consume less resources and let these non-priority jobs run longer
9.      Despite all the above efforts or increasing the number of nodes by extending resources, if the jobs execution time is not reduced then it must be serious design issue in multiple jobs. In this case we need to tune the jobs or redesign the jobs.

Wednesday, 30 March 2016

IBM Infosphere DataStage - Hadoop Connectivity Configuration


BigData File Stage or BDFS stage was introduced as part of IBM Infosphere DataStage V8.7

In Version 9.1, Balanced Optimization features of InfoSphere DataStage are extended to Hadoop BDFS also. I.e. from V9.1 onwards, we can push sets of data integration processing and related data I/O into a Hadoop cluster.
 
In Version 11.3, the Big Data File stage compatibility is extended to Hortonworks 2.1, Cloudera 4.5, and InfoSphere BigInsights 3.0. 
 
 
In Version 11.5, all features to assess and monitor the quality of your data are now supported executing directly inside a Hadoop cluster. You can run the same profiling and data rules inside or outside your Hadoop environment,
 
 
Before using BDFS Stage, we need configure the Hadoop connectivity as below

  •  The shared library libhdfs.so should have correct level of access so that BDFS stage can access this library

  • For the correct functioning of the libhdfs.so library, we need to add the HDFS .jar files path to CLASSPATH variables

  • add HDFS configuration file hdfs-site.xml  path to CLASSPATH variable

  • add the directory that contains the libhdfs.so file to the LD_LIBRARY_PATH variable

  • Add the JDK bin directory to the PATH variable

  • add fs.hdfs.impl property in the core-site.xml file if this property not present.

  • <property>
        <name>fs.hdfs.impl</name>
        <value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
    </property>

  • Set environment variable HADOOP_CONF_DIR pointing to Hadoop configuration files.
 

What is data governance?

 
Data governance is nothing but controlling how data is created, transferred, accessed while ensuring data completeness, consistency, integrity and security.

Data governance board will have different roles who work with predefined industry standard policies and procedures.

Some companies makes it mandatory to get the data governance broad approvals every time

·        New data elements/attributes created (To ensure, metadata standards are followed and these new attributes actually new not duplicates some other attributes , new data access is restricted only to people who need it etc.)

·        New data exchange or new interface to be implemented (to ensure data transfer is in compliance with regulatory and company policies)

·        New application development (all the data models should be approved by data governance board)

·        Decommission of existing application (To ensure, important information is not lost)

·        Using production data for testing (to ensure, personal information attributes are properly masked)

Data Governance board will have following roles

·        DBAs

o   Provide guidance on creating new data, data access control, data modeling, metadata creation etc.

·        Data Stewards (data owners)

o   Ensures data creating and management is in accordance to the business needs. Highlights critical attributes where more attention required. Creates business glossary, reviews the data quality score cards produced by data quality analysts, etc..

·        Data Quality Analysts

o   Do the data profiling, highlight data quality issues and record DQs in the backlog and work with data stewards to priorities the DQs.. Create data quality score cards, data cleansing rules and business transformation rules etc..

·        Data integration Lead

o   Data Integration lead works with Data Quality Analysts, DBAs and sometimes  with data stewards in order to translate the data integration, data interfacing and data validation business requirements to technical specification

o   Investigate and resolve cross application/cross functional data integration issues

o   Develop solutions based on data analysis to increase business efficiency.

o   Develop integration strategy (ex. ELT or ETL)

o   Normally an application/solution architect or a technical lead will play as data integration lead role in data governance board and is responsible for implementing the business requirement with scalability, portability, security, reusability as basis for technical solution.

·        Leads managing the  Data Governance board

o   Organize frequent meetings with team and monitor the work

Monday, 28 March 2016

IBM Infosphere Datastage Merge Stage is same as Oracle Merge SQL statement

 

If we know how Oracle merge statement works, its easy to work with Datastage merge as both do similar things though not exactly same.

If we want to update/insert records to table A based on selection criteria from multiple tables its better to write ORACLE MERGE SQL instead of individual UPDATE/INSERT statements. But this may have performance impact if we are writing ORACLE Merge.

On other hand, DataStage MERGE stage will only do the update.

MERGE works with the concept that we have 1 master table and we want to update master table based on the selection criteria from update tables using key fields.

Below table compares ORACLE MERGE and DataStage MERGE assuming DataStage merge is processing tables (DataStage Merge could also process files, datasets)
 

 ORACLE MERGEDATASTAGE MERGE
Master TableCan have only one master tableCan have only one master table
Update TableCan have many update tablesCan have many update tables
Output record countORACLE MERGE can do INSERT, UPDATE and DELETE.. Therefore record count could increate, decrease or same after merge operationwill remain same if unmatched master records are kept otherwise will be less if unmatched master records are dropped
Output column countOracle MERGE can not add columns to master table therefore column count will remain unchanged after merge operation. Note: both master table and target tables are same in ORACLE Mergemaster and target tables are different therefore new columns from update tables can be added
Can we capture unmatched update tables recordsNoYes, using the reject links
What happens to records unmatched in master tableswill remain unchanged and delivered to targetcan be kept in master or dropped from master table
When matchedwe can perform insert, update or deleteonly update
when unmatchedwe can perform insert, update or deletewe can only drop or keep records from master table
deterministic?cannot update the same row of the target table multiple times in the same MERGE statementSame row can be updated multiple time is master record is matched with different update table records
Duplicates?No, if duplicate records in master or update tables present for the match fields (key fields) then oracle thinks its updating same record again and again so it will failNo, duplicates must be removed from master and update tables before processing merge

Sunday, 27 March 2016

Memory Structures


Buffer is a reserved area of RAM. This area is used to pre-fetch or preload data while processor busy with current data so that processor could process pre-fetched data immediately after current data process completed which would avoid lag caused by I/O operation. This process is called buffering.

Virtual memory is a concept to compensate insufficient RAM for running a process. There is a temporary space on hard disk called paging file, this will be used as virtual memory. When RAM is not sufficient to run a process, then virtual memory management of processor keeps data in paging file and treats paging files as if its a RAM.

Paging
Its good to have all the data required to execute a process in RAM, but this is not mandatory. What is mandatory is to have only the data required to execute a task within a process of multiple tasks. Therefore processor will try to get all the data required to execute a process from disk and if there is no enough RAM then process will get the data required to process a task and then again retrieve the data to process another task.. this will increase number of I/Os and impact performance but the process would not be failed. What if there is no enough RAM to process even a task then it will use the virtual memory.. i.e. it will keep all the required data for a task in a file called paging file in hard disk and split the data into equal chunks called pages, then get a page, process, flush and get another page... this is called Paging...  Then the question is how would this not impact performance.. of course it would not be as fast as having all the data in RAM but still its better than searing for data location in every I/O because processor now has to look in only one specific location called paging file.
 

Different types of SQL statements

 
DDL -  Data Definition Language
These are the statements that are used to define the database structure or schema
i.e. CREATE, ALTER, DROP, TRUNCATE, COMMENT, RENAME

DML - Data Manipulation Language 
These are the statements are used for managing data within schema objects.
i.e. SELECT , INSERT, DELETE , UPDATE, MERGE

DCL -  Data Control Language
These are GRANT and REVOKE

TCL - Transaction Control Language
These are COMMIT, ROLLBACK,  COMMIT, SAVEPOINT, SET TRANSACTION

SCL - Session Control Language
i.e ALTER SESSION statements

Oracle: Benefits of Packages over standalone functions and procedures (Procedures vs packages)

Oracle: Benefits of Packages over standalone functions and procedures (Procedures vs packages)
Most us know oracle packages as convenient way of grouping functions and procedures .i.e. having all at one place. There oracle packages offer more than this

Overloading: We have multiple functions/procedures with same name..  during the execution, based on number of parameters and data types will determine the correct version to be invoked.
This way, every time we need additional functionality on top of existing, we do not need to modify existing functions/procedures.. we can simply create new without impacting existing applications...

Scope of Variable :  We can declare a variable in package specification .. i.e. global variable... to have the variable accessible during the session in order to pass info between the packages/procedures/functions during a session. or declare with in body to be private to the package to be accessible only by the package, not to be accessible to other packages in the same session.

Global and Private variables are persistent between multiple calls of package during the session.. but if want it to reset for every call then we need have PRAGMA SERIALLY_REUSABLE; in both specification and body.. Note: PRAGMA is compiler directive .

Initialization: Package Body can have BEGIN and END; (not a function or procedure)  block. This is executed only once , very first time any one of procedure or function referenced from package and information retrieved/set by this pl/sql block will be available for rest of the session.

Information Hiding: Only the package specification is visible to people who have execute privilege on the package. We can also control only specific procedures or functions are accessible by creating other procedures calling specific package procedures.. Therefore packages provide very good encapsulation technique.

Insulation from dependency crisis:   If we are creating standalone procedures and are interlinked (called one from other).. a change in one procedure will invalidate all the parent procedures referencing changed procedure. Though oracle will automatically recompile when the invalid procedure called, it will be costly in terms of resources and performance. This problem is not there with packages as any parent procedure/function calling one of package procedure/function is only dependent on package specification and do not require recompilation when package body change.

Note: All the dependencies are present in the system table user_dependencies

Enabling recursion: In the standalone procedures or functions, the recursion is not allowed example.. have to procedures X, Y... calling Y from X and also calling X from Y is not allowed.. the procedures would not compile at all.. but if same procedures are created in Packages then its allowed.


When Not to use packages

Packages are not good fit to create functions for function-based index. Because of isolation from dependency chain, the function body changes will not disable or enforce rebuild for function-based index which could yield wrong results. Therefore its always better to use standalone function for function-based index.

 











 

Friday, 25 March 2016

Project Management Metrics in non project manger terminology

Project Management Metrics in non project manger terminology:
Every work needs to be tracked , measured to see where we are and what are the next actions required to meet the target/goal. Its same in IT Project Management as well. That's why projects will have metrics, milestones and critical paths.

Metrics are nothing but measurements or facts. They will tell us if we are on track or behind schedule or ahead of schedule or spending the money more than budgeted or have sufficient number of resources etc..

The more important measurements are Schedule Variable and Cost Variance. As the name suggest, these tell us the variance i.e. difference.. If these values are zero that means there is no difference or variance therefore the project is on track... But how do we measure these? and at what point we measure them?

More important point is at what point we measure them.. if we measure them at the end of the project, do we have any value of these.. no right.. because its already too late. Therefore we have to measure them in very frequent intervals.. or at every project milestone at least.. then look back what went right or wrong and forecast if we are going to complete the project in budget and in time ... otherwise take necessary actions...

No how do we measure them...

Lets assume that we have project estimated as 1000 man days... in 100 days.. i.e. we need 10 employees to complete 1000 days work in 100 days.
Lets assume again that we check every 10 days to see if we are on track.

So after 10 days, we checked. By this time, we should have completed 1/10 of the work right.. but we completed 2/10 of work... this is obviously mean we are ahead.. because we completed more that we thought we will complete.. with this phase we would complete total work in 50 days with 10 employees .. I,e, total cost of 500 days...  because we assumed simple number, the calculation is so easy but can we follow same process for complex task.. no right.. so need a help of mathematics... so lets put this into project management metrics terminology..

Our total estimation = 1000 man days .. this is called total planned value or Budget At Complete, known as BAC

Our schedule is 10 days

at our first check point, (i.e. after 10 days).. we planned to completed 1/10th of work... so this is called Planned value at this check point.. known as PV.  But
 actually we completed 2/10th of work... because this is already completed , we earned this... so called as earned value... know as EV...

Now in terms of the cost, the planned cost is 10*10 (10 employees times 10 days)  = 100 days.. But we completed 200days worth... i.e. we spent only 50 days to complete 100 days worth.. so the actual cost is only 50 days for the planned value of 1/10th of work.. so the actual cost 50 days also know as AC.

              Note: All AC, EV and PV are calculated as number of hours.

Now, how do we calculate schedule variance.. we planned to complete 1/10th worth but completed 2/10th.. i.e. 2/10th - 1/10th = 1/10th variance.. because we are ahead of schedule, the difference is positive... so that formula is SV =  EV - PV

Similarly, cost variance CV = EV - AC








 

Thursday, 24 March 2016

Oracle : Aggregate cum Analytical Functions

 

I always remembered Max, Min, Avg, Sum, Count as aggregate functions. But actually these are also analytical functions

analytical functions will have following syntax

{analytical function} (arguments) OVER (analytical clause)

Aggregate functions with OVER clause will become analytical functions.. could become more useful thank they already are.

for example AVG could be used as
SELECT AVG(salary) "Average"
  FROM employees;
also as

SELECT manager_id, last_name, hire_date, salary,
       AVG(salary) OVER (PARTITION BY manager_id ORDER BY hire_date
  ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS c_mavg
  FROM employees
  ORDER BY manager_id, hire_date, salary;

Some of the analytical functions used in my experience are

  1. LEAD  - provides access to a row at a given physical offset beyond that position.. for example, we can use this function to display current row value and next row value of same field in single row

  2. LAG -- Similar to LEAD, but gives access to previous row information

  3. ROW_NUMBER

  4. RANK

  5. DENSE_RANK


 

IBM Infosphere Datastage - Processing Non-ASCII Character set between Oracle and Datastage

 
In order to process non-ASCII character set, we need to understand how different technologies behave when a character set is different between source and target.

For example, if we are receiving a file to Unix with characters like ‎, then file must be encoded in UTF or ISO8859-15 (Latin-9) or any other character set in which euro character is valid but it can not be ASCII encoded as euro is not part of basic 8bit ASCII.

Tip: in Unix, Use file command to see the file encoding and using iconv to convert from one to another encoding, and use locale -l to list all installed character sets

If we are creating a file with euro using VI editor in Unix then Unix locale must be set to ISO8859-15 before creating the file otherwise euro symbol might get corrupted in the file.

Unix LANG:  This is the locate environment language/character set setting. Many applications running from unix rely on LANG to assume their locale environment in absence of explicit character code encoding. This is same for Datastage also. If there is no NLS map mentioned in datastage for a job/stage, then unix LANG will be used as default.

Unix NLS_LANG: Oracle database use this environment variable to determine the client character code set.

More details about Oracle NLS_LANG can be found in below link


 Oracle converts the characters data if Oracle Character set and client character set is not same.  Similarly datastage. But Netezza will never convert the data, it will always pass Latin 9 and assumes that it always receives Latin 9 . 

Therefore we need to make sure the character encoding is appropriate depending on the interface we are working on.

For example,
Oracle NLS_CHARACTERSET = UTF8
Unix NLS_LANG = "" (null)

Then if we are loading oracle varchar field with text containing euro using sqlldr, then euro in file will be written as ¿ to oracle field. Because oracle tries to convert text from default ASCII in unix  to UTF8 but euro symbol is not part of ASCII character set. To resolve this issue, we need to set NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P15



 

IBM Infosphere Datastage : Exception Handler Activity in Sequence Jobs


When does the Exception Handler Activity gets executed?
    There is a possibility of misunderstanding that Exception Handler handles Job failures and job exceptions.
    Exception Handler does not handle the job failures and job exceptions. Job failures and job exceptions should be handled with trigger "Failed - (Conditional)"
    Exception Handler gets triggered only when any of the activities in sequence job failed to run. For example, if Job Activity is trying to execute a job that is not yet compiled. Then Job Activity would not be able to trigger the job , therefore will raise an exception.
   Similarly, when  Execute Command trying to execute a invalid command, then execute command activity will raise an exception that will trigger Exception Handler activity.

IBM Infosphere Datastage : Unlock the Job Locked by user

 
When we disconnect the designer abnormally, some times we will get an error "job is locked by another user" while trying to open the job.

Use cleanup_abandoned_locks utility to clear any abandoned locks. The cleanup_abandoned_locks utility deletes session locks from the Information Server repository that were left over from some usage of an Information Server suite application (e.g. DataStage).

Login to domain layer using either root or Administrator user (for Windows)

cd /opt/IBM/InformationServer/ASBServer/bin

./cleanup_abandoned_locks.sh (on Unix/Linux)
./cleanup_abandoned_locks.bat (on Windows)

usage: cleanup_abandoned_locks
-P,--password Password
-U,--user User name
-h,--help Print this message.
 

for this Issue, full resolution steps are documented in below link
 

IBM Infosphere Datastage - Working with Job Status and User Status

 
In Datastage, every job can have Job Status and User Status.  We can query Job Status but user can not set the job status to custom string. The user status in other hand, user have to set and then can be queried in the sequence jobs. 
In the following link, Timm Ruger provided very good information about both UserStatus and JobStatus

The user status can be set using two ways
1. Using the BASIC transformer
2. Writing BASIC functions in the Job Control

Call DSSetUserStatus("sales job done")
Call DSLogInfo("JobControl Executed","JobControl")

As you can see in the about screen shot, I have also added a info message to log in order to test when do this Job Control get executed (Before the job itself or after). as a result I found that the Job Control is executed before the job itself

Once the UserStatus is set it can be retrieve from the Sequence jobs. We can control the branching using this userstatus in the triggers tab for job activity only.  or can retrieve this user status from the next activities (this is one of the activity parameter , the other two are $JobStatus and $JobName)