Search This Blog

Friday 1 April 2016

IBM Infosphere Datastage – Job Design Considerations for better performance


1.      When data needs to be exchanged between the parallel jobs, use the parallel datasets instead of sequential files or database temporary table. This will avoid overhead of re-sorting and re-partitioning.

2.      Filter rows and columns as early as possible in the job flow. Considering using where clause for database reads to read only the required data from database.

3.      Keep the data validations and data looks up separate (combine multiple lookups where possible) and perform only once in the job or process flow

4.      Separate overall process into different jobs and choose the portioning keys for each job ensuring even distribution (reducing data skew) of data across partitions, and avoid re-partitioning. We could use environment variables APT_RECORD_COUNTS to identify if data is evenly partitioned. We could APT_PARTITION_COUNT to determine number of partitions of the stage

5.      When working with Sort

a.      Sort is memory and performance intensive. Therefore sort all your data at source or once in the early flow and use the same sorting across all stages and avoid resorting.

b.      Avoid resorting already sorted data. Use sub-sorting when part of the keys already sorted. For example if we need to sort on keys A,B,C and data is already sorted on A then choose  “don’t sort, previously sorted” for A and choose “sort” for B and C

c.      Avoid stable sort if there is no special requirement to keep non-key columns in original order. The default behavior is STATBLE SORT = TRUE. This means sort operation will not rearrange rows that are already in a properly sorted data set. We normally use key columns in sort operation therefore for key columns we any way mention the option of sorting or don’t sort. Therefore stable sort then work on non-key columns and tell SORT operation to memorize non-key columns order thus will be an additional overhead and cause performance impact.

d.      By default SORT uses maximum of 20MB internal memory per partition. Increase this using “restrict memory usage” option while working on large volumes of data. If we know size of the data to be sorted, we can use the formula to calculate optimal RMU value that will reduce number of I/O demand. The formula is RMU = (total size of data / node count).  While calculating total size of data, always calculate as 2x giving scope for scalability as data grows. The RMU default is 20 MB per node, it could be increased up to maximum of 1.6 GB. If RMU formula results in more than 1.6 GB, then better to increase number of node than increasing RMU size.

e.      For link sorts (part of input partitions option), we do not have option to specify RMU from designer. But if we still want to control the space usage behavior, we can change the amount of memory used by all tsort stages globally by setting APT_TSORT_STRESS_BLOCKSIZE = [mb].

f.       Sort uses scratch disk for sorting blocks of data (block size is corresponding to RMU) then merge all these blocks and do final merge per partition. Its important that scratch disk has enough space and also its not shared by too many applications/processes which could cause I/O bottle necks. If possible use different scratch disks for different nodes, this could parallelize the I/O operations.

g.      If there are too many variable length fields in the input but not in sort keys then use APT_OLD_BOUNDED_LENGTH=True to reduce overall memory requirement. This action does not guarantee the performance improvement but it will reduce I/O operations, might give performance benefits in some cases.

h.      Pipeline parallelism employs buffering automatically. The SORT initially work a lot on data to sort within the data chucks, this normally takes longer than rest of SORT processing. Therefore assuming input stage has passed 1000 records to SORT, then while SORT is too busy processing these 1000 records the input stage could supply another 3000 records.. if there no enough buffer to hold these 3000 records then input stage stops passing more records… this will overall slowdown the job.. In this scenario increasing SORT input buffer size could improve performance.

i.       From Linux files, we could improve performance by increasing Linux file system “Read Ahead” value.

j.       Use APT_INSERT_NO_SORT to avoid auto sort operation insertion by datastage

k.      Use APT_INSERT_SORT_CHECK_ONLY to only check if input is sorted or not,  instead of inserting sort insertion automatically

6.      When working with Sequential files

a.      Never use sequential files as intermediate staging data between the parallel jobs

b.      When working with fixed length files, explore “Number of Readers per Node” or “Read from Multiple Nodes” options to read files parallel.

7.      Disk I/O

a.      Reducing number of Disk I/O could certainly improve performance.

                                                    i.     Some disk arrays have read ahead caches that are only effective when data is read repeatedly in like-sized chunks. Set APT_CONSISTENT_BUFFERIO_SIZE=N to force stages to read data in n multiple sized chunks to enable read ahead features on these disks

                                                   ii.     Understand the I/O type if its Memory-mapped I/O (MMIO) or  Port-mapped I/O (PMIO). If its Memory mapped I/O on a remote disk mounted via NFS, memory mapped I/O might cause significant performance problems. Setting the environment variables APT_IO_NOMAP and APT_BUFFERIO_NOMAP true will turn off this feature and sometimes gives better performance.

8.      Tweaking sort insertion settings

a.      Datastage automatically insert sort operation when sorting is need in input but input is not sorted. This could happen in following scenarios

                                                    i.     In the aggregator stage, the method is sort but the input is not sorted on all group by key columns

                                                   ii.     Remove Duplicate stage requires input to be sorted on the key columns. If it identifies that input is not already sorted, then SORT operator will be inserted automatically

                                                  iii.     The other instance where sort is inserted automatically  are while partitioning (example using key partitioning and sort yes) and collecting (as sort merge collection method)

                                                  iv.     Other Imp points to be noted in this context:

·        Difference Stage – will require both input datasets are sorted on keys.. if not then stage will fail i.e. sort operator is not inserted automatically

·        Sort Stage – If we use sort key mode as “don’t sort….” And input is not sorted then stage will fail  i.e. sort operator is not inserted automatically

b.      We need to avoid automatic sort insertions as developers will not have much scope for performance improvements. We can do this by

                                                    i.     Setting environment variables APT_NO_PART_INSERTION (avoid re-partitioning by sort operation) and APT_NO_SORT_INSERTION (avoid automatic sort insertions)

                                                   ii.     Set the APT_SORT_INSERTION_CHECK_ONLY environment variable so that sorts just check that the order is correct, rather than actually sorting, when sorts are inserted automatically by DataStage.

9.      Choose correct stage specific to operation

a.     Transformer vs Merge, Copy, Aggregator etc  à each stage in Datastage is optimized for specific purpose. Stages like Merge, Copy are light Wight compared with Transformer stage. Because transformer stage is compiled to C++ whereas other stages are compiled into datastage native OSH code and for every transformer function, we need to call all C++ function library from native OSH code. Therefore avoid using Transformation stage where other stages could be used

b.      Similarly look up vs Join stage.. use lookup only when look up table/file is large and not sorted otherwise chose join stage for joining all small tables/files which are sorted.

c.      In the aggregator stage, use method as sort when working presorted large volumes of the data

10.   Environment Variables that could improve performance

a.      Setting APT_DISABLE_COMBINATION might improve performance when combining operators causing bottle neck for CPU consumption

b.      Use OSH_PRINT_SCHEMA to print schema and check if we have any implicit data conversions and avoid implicit data conversions to improve performance.

No comments:

Post a Comment