Search This Blog

Wednesday 6 April 2016

IBM Infosphere QualityStage simplified


IBM Infosphere server quality stage is part of information server suit and its available with additional license.  Buying QualityStage will enable us to use more stages (plus match specification and rule designer, for use in these stages).. that’s it.. Everything else is same and part of DataStage..
Do not confuse with Information Analyzer and Data Rules stage is not part of QualityStage, its part of Information Analyzer.

What are the additional stages comes with QualityStage

1.      Investigate

2.      Standardize

3.      One-source Match

4.      Two-source Match

5.      Match Frequency

6.      Survive



 

If we understand steps in the data quality assurance process, then understanding these stages will be easy. Data quality process steps and corresponding QualityStage Stages are


While working on any DataQuality stages, its important to understand single domain column and free form columns

Single Domain columns represent one specific business attribute… for example first name or customer id…

Free Form columns contains free text, combination of many attributes.. for example name, it could contain first name and last name or first, last and middle name… similarly address, it could contain only the house number and street name or entire address along with post code.

Analyzing single domain columns is easy because we upfront know what data the field contains. But analyzing free form fields is not that easy because we need to first set some rules expecting what free form field might contains..  i.e. to analyses the free form fields, we need to define set of rules (using rule designer).  The rules work on the fact that free form field is nothing but bunch of tokens and then we map each token to a pattern, then decide output action when token is as expected (i.e. conformed to the pattern)

To standardise single domain columns, we may need rule set in some cases for example if we want to lookup correct value or spell correct or change description to acronym or we could just use modify, transformer stage to deal with nulls or change the format etc...  

DataQuality Standardise stage normally used to standardise the free form column to split the free form columns into single domain columns. We need rule sets for this purpose.

Following quality stage use the Rule sets

·        Investigate

·        Standardize

·        Multinational Standardize (MNS)

Rule set contains

·        Classifications

·        Output Columns

·        Rules

o   Rules basically define the input patterns and the actions when pattern matches. As part of this pattern matching we could use pattern specification, classifications tables, look up tables

o   We can specify what action to be taken when pattern matches and what output to be written to output columns

 

When it comes to matching process,

        Matching is nothing but comparing two records and checking if both records are duplicates are not. Then we can use survive stage to keep one record and drop other record in the duplicate pair.

Match stages take match specification as input and match specification could be created using match specification designer.

Match specification tells,

·        On which keys we need to match

·        What are group by columns in order to divide total records into blocks to operate on blocks first instead of operating on entire data set. Dividing entire data into blocks in very important. Because we are not trying to identify only 100% matched records. Two records match by 20% may be good enough for us to decide these records may be duplicates and need to be reviewed by someone before confirming these are duplicates. So we say 20% match are clerical match type. Therefore in order to identify which one is matching which, we need to first divide data into blocks where we are likely to find the duplicates

·        How many passes we need to run the match process before finalizing the match weightage?

More details about quality stage could be found at below links



 

3 comments:

  1. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in IBM Information Analyzer.kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor led training on IBM Information Analyzer. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.

    For Free Demo Contact us:
    Name : Arunkumar U
    Email : arun@maxmunus.com
    Skype id: training_maxmunus
    Contact No.-+91-9738507310
    Company Website –http://www.maxmunus.com


    ReplyDelete
  2. Very simple but, clear to grab details quickly about Quality Stage...Thank you

    ReplyDelete