Technical excerpts: IBM Infosphere Datastage - Processing Non-ASCII Character set between Oracle and Datastage

In order to process non-ASCII character set, we need to understand how different technologies behave when a character set is different between source and target.

For example, if we are receiving a file to Unix with characters like ‎€, then file must be encoded in UTF or ISO8859-15 (Latin-9) or any other character set in which euro character is valid but it can not be ASCII encoded as euro is not part of basic 8bit ASCII.

Tip: in Unix, Use file command to see the file encoding and using iconv to convert from one to another encoding, and use locale -l to list all installed character sets

If we are creating a file with euro using VI editor in Unix then Unix locale must be set to ISO8859-15 before creating the file otherwise euro symbol might get corrupted in the file.

Unix LANG: This is the locate environment language/character set setting. Many applications running from unix rely on LANG to assume their locale environment in absence of explicit character code encoding. This is same for Datastage also. If there is no NLS map mentioned in datastage for a job/stage, then unix LANG will be used as default.

Unix NLS_LANG: Oracle database use this environment variable to determine the client character code set.

More details about Oracle NLS_LANG can be found in below link

http://www.oracle.com/technetwork/products/globalization/nls-lang-099431.html

Oracle converts the characters data if Oracle Character set and client character set is not same. Similarly datastage. But Netezza will never convert the data, it will always pass Latin 9 and assumes that it always receives Latin 9 .

Therefore we need to make sure the character encoding is appropriate depending on the interface we are working on.

For example,

Oracle NLS_CHARACTERSET = UTF8

Unix NLS_LANG = "" (null)

Then if we are loading oracle varchar field with text containing euro using sqlldr, then euro in file will be written as ¿ to oracle field. Because oracle tries to convert text from default ASCII in unix to UTF8 but euro symbol is not part of ASCII character set. To resolve this issue, we need to set NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P15

Technical excerpts

Search This Blog

Thursday, 24 March 2016

IBM Infosphere Datastage - Processing Non-ASCII Character set between Oracle and Datastage

1 comment: