Technical excerpts

Monday, 22 January 2018

BigData - Sqoop Import exercise

In this article, we will do following

1. Download consumer complaints dataset from https://www.data.gov/consumer/

2. Load consumer complaints dataset to a table in MySQL database

3. Import MySQL database table into Hive using Sqoop

Download consumer complaints dataset

1. Download consumer complaints dataset from https://www.data.gov/consumer/ to unix box using “wget https://data.consumerfinance.gov/api/views/s6ew-h6mp/rows.csv?accessType=DOWNLOAD”

2. Then rename the file to Consumer_Complaints.csv using

“mv "rows.csv?accessType=DOWNLOAD" Consumer_Complaints.csv”

3. The downloaded file is comma separated and ready for loading to MySQL table.

Load consumer complaints dataset to a table in MySQL database

1. Connect to mysql from terminal in linux server using “mysql --user="username" –password”

2. Type “use databasename” to use the database in which we are creating a table

3. Create a table in MySQL using a following create table structure

create table if not exists consumer_complaints

( data_received varchar( 10),

product varchar( 100),

sub_product varchar( 50),

issue varchar( 100),

sub_issue varchar( 100),

narrative varchar(5000),

company_Public_response varchar( 150),

company varchar( 100),

state varchar( 2),

zipcode varchar( 12),

tags varchar( 50),

consent_provided varchar( 20),

submitted_channel varchar( 25),

date_sentto_company varchar( 10),

company_consumer_response varchar( 150),

timely_response varchar(10),

consumer_disputed varchar( 10),

complaint_id varchar( 50)

);

4. load data local infile 'Consumer_Complaints.csv' into table consumer_complaints fields terminated by "," optionally enclosed by "\"" ignore 1 lines;

Import MySQL database table into Hive using Sqoop

1. Check connection to target database working or not using list tables

a. List tables in the target database using “sqoop list-tables …..”

b. Check target hive database and table exist or not

c. As show in the above picture, I do not have a table to import therefore I will create a table using Sqoop during import

d. sqoop import --connect "jdbc:mysql://192.168.56.102/rajendra" --username="cloudera" -password="cloudera" --table "consumer_complaints" --split-by "complaint_id" --create-hive-table --hive-database "rajendra" –hive-import

e. Check table created in Hive

f. Check Hive table metadata, it will be same as MySQL table metadata

Saturday, 20 January 2018

BigData - Spark Fundamentals

1. Spark architecture contains a Driver and Workers. Once Driver act as master and manages tasks, scheduling, data locality etc

2. The driver or master maintains the context. Each application needs to instantiate its own context and operate within that context

3. RDD – Resilient Distributed Dataset – Collection of elements partitioned across nodes in a cluster that can be operated on in parallel

4. Transformations (Like MAP, FILTER etc) are evaluated lazily and Actions(collect, count, reduce etc ) are evaluated immediately

5. This lazy evaluation allows Spark to store functional instructions to DAG for later use

6. This DAG (Directed A Cyclical Graph) continuously grows with functional lineage and then at the time of actions, a task is distributed to workers using this functional lineage from DAG

7. Loading Data

a. Spark supports Amazon S3, HDFS, and many databases… as well as many data serialization techniques and file formats like AVRO, PARQUET

b. Space Context is the starting point for loading data to initial RDD

c. sc.parallelize(1,100) to distribute a range sequence from 1 to 100

d. res0.collect to collect this sequence to show driver like console

e. to see spark methods starting with first few characters in spark-shell, just do one tab for auto complete and then tab again to see the signature of method

f. sc.makeRDD, sc.range are few other memory loading methods

g. textFile, wholeTextFiles, sequenceFile, objectFile are some file loading methods that are implemented from the generic hadoopFile method

h. even more generic file load method is hadoopRDD. The difference between hadoopFile and hadoopRDD is that hadoopRDD accepts jobconfiguration parameter and does not accept path. Using hadoopRDD any file/data can be loaded and this input data file or formation is supplied via job configuration

8. Transformations

a. Transformation are lazily evaluated. Its collection of methods (not actions) that run set of functions on data to transform into target format

b. Transformations returns another RDD. Lazily built graph of actions (DAG) to act up on when an action performed

c. Some transformations are are MAP, FILTER etc

d. RDDs can be combined using different transformations like rddname.union, .intersect, .subtract, .cartisian , rddname1 ++ rddname2 etc

9. Actions

a. Actions do not return RDD and are evaluated immediately

b. .collect … it collects entire RDD into driver (or master node).. this could cause memory exceptions if final RDD size does not fit in driver’s memory

c. .take(5) – takes only first 5 items only .. similarly, we can take only specific number of items by passing number to take action.

BigData - Scala Fundamentals

1. Scala is functional language. Functions in Scala are similar to objects in Java.

2. Idiom of Scala is less use of For loop, instead operations are supplied as collections, and those operations are carried out using methods of those collections

3. Scala encourages immutable data.

4. Immutable data items can be declared using prefix “val”, similarly mutable data items can be declared using prefix “var”

5. Expressions are units of code that return a value. Statements are units of code that do not return a value. Expressions can be chained but statements cannot be chained.

6. Scala favors expression over statements

7. Any that evaluates to a value is an expression for example string “Hello Word” is an expressions similarly list object like List(“1”,”2”,”a”,”b”) is an expression

8. When you assign a expression it become statements for example “val t_list = List(“1”,”2”,”a”,”b”)” is a statement does not evaluate to a value. Similarly, statements like printing to screen, writing to file are statements

9. Expressions can be passed into functions

10. Multiple expressions can be enclosed into {}, this is called expression block. Last expression in the block serves as return value. Expression blocks can be nested. Expression blocks are nothing but anonymous functions, can have both statements and expressions within block.

11. If/else, for loop, pattern matching like case or switch are expressions in scala.. these are statements in Java. That is these can be chained in Scala

12. Functions can be passed in as parameter to other functions and also functions can return functions. Functions can be stored in a variable or a value.

13. Type inference in scala is done using 3 classes Any, AnyVal, AnyRef. Any is a type that is used for expressions combinations of value type like Int, Double, Char, Boolean and reference types like Lists

14. Variables can be used in string interpolation like s”$var1+$var2 is $va3 . i.e. var3 is addition of var1 and var2”

15. If/else blocks can be used just like other languages

16. For loop can be written as statements or expressions. If for loops are written like other languages then it’s a statement but if we prefix for block with “yield” keyword then it will become expression and result a List reference type

17. While loops can not be written as expressions in Scala

18. Pattern matching is similar to SWITCH/CASE but there is no fall through i.e. only one case will be match therefore break statement not required

a. Val dayofweek = “Sat”

b. Val matchday = dayofweek match {

c. Case “Sat” => “Saturday”

d. Case “sun” => “notstatuday”

e. }

19. Functions vrs methods

a. Functions are objects and expression blocks that can be parameterized

b. Methods on other hand not objects but can be parameterized

c. Methods must be associated with class but functions can be independent

d. Method example

i. def getArea(radius:Double):Double =

ii. {

iii. val PI = 3.4

iv. PI * radius * radius

v. }

vi. This above method takes input “radius” or type Double and returns Double

e. Functions Example

i. val getArea = (radius:Double) =>

ii. {

iii. val PI =3.4

iv. PI * radius * raidus

v. } :Double

vi. This above function takes input “radius” or type Double and returns Double

f. We can convert method to function as below

i. val getvalFunc: (Double) => Double = getArea

ii. or val getvalFunc = getArea (_)

iii. _ is a place hold indicating take same type input and return same type as of method.. but place holder does not work for closure where method returns function type but explicit conversion works on closures also

20. First Order Functions or higher order functions

a. A function can be stored in a variable or a value

b. The return value of a function can be a function

c. A parameter of a function can be a function

21. Closures

a. Closures retain the scope of inner function variable even after outer function execution is completed

22. Collections

a. Tuples

i. Not iterables in Scala.. these are orders collection of items

b. Lists

i. val listval = List(“a”,”A”,”ac”)

ii. val listval = “a” :: “A” :: “ac” :: Nil

c. Maps

i. Key value pairs.. defined using “A”=>”va” syntax..

ii. These are List of Tuples

d. Options

i. Used for error handling

ii. Option with combination of “None” can be used to return None when return type can be returned from function.. example division by zero can return None as it can not return Int or Double

e. Arrays

i. val arranmae = Array(“a”,”A”,”ac”)

ii. same as List but items (not length) are mutable

23. Higher order methods

a. MAP, FOREACH, FILTER

i. Act on one element at a time

b. SCAN , FOLD, REDUCE

i. Act on multiple items at a time

BigData - Getting Started with HBase

1. Hadoop Limitations

a. Unstructured data: Hadoop hold mostly unstructured data. Even though data confirms to a schema, Hadoop or hdfs itself does not provide a way to associate a schema to data

b. No Random Access: Hadoop supports bulk data processing and batch processing but does not support randon access of data i.e. accessing a particular record is not supported

c. High Latency: Hadoop not best suited for transactional database. Processing is suppose to be on very huge data and latency will be high on small datasets, but on huge data Hadoop is faster compared with traditional database or file system

d. Not ACID Compliant: HDFS file system does not guarantee the data integrity

2. These Hadoop limitations could be overcomed with database like HBase

3. HBASE

a. Hbase is NoSQL database

b. It supports a loose data structure

c. HBase supports Row Keys, which allows updates to one record or random access

d. Some transactions in HBase will have ACID properties

4. Properties of HBase

a. Columnar Store database

b. Denormalized storage

i. Queries span over multiple tables not allowed in Hbase therefore we should denormalize and store all related data together

ii. No Indexes and No Constraints supported

c. Only CRUD Operations (Create, Read, Update and Delete)

i. Functions like ORDER BY, JOIN, GROUP BY not allowed in HBASE

ii. HBase does not support SQL, its NoSQL database

d. ACID at the row level

5. HBase Data Model

a. HBase follows 4-dimensional model. It has 4 types of data in each row

i. Row Key

1. Rowkey is like primary key or index in RDMS. Rowkey not required to be specified at the time of table creation, it can be given at the time of data insert or load

ii. Column Family

1. At the time of table creation, we need to define the column family

2. All the related columns could be grouped together as column family

3. Every row in a table will have same set of column families

4. one column family is stored in one file i.e. a column family is fetched together even if we query only one column in column family

iii. Column

1. Column is actual column that contains data

2. Columns are not specified at the time of table creation and we can add columns dynamically

3. Each column must belong to a column family

4. Data type not associated with column, everything internally a bytearray

5. Column is referenced as “columnfamily: columnname”

iv. Timestamp

1. Value in column is associated with timestamp. This timestamp identifies when was last time value changed

2. Every time value modified, new entry is created with new timestamp therefore it allows versioning values using timestamp

6. HBASE Shell can be invoked by typing “hbase shell”

7. To create a table just type “create ‘tablename’, ‘columnfam1’, ‘columnfamily2’”

8. To insert data type “put ‘tablename’, rowkeyval, ‘columnfam1: col1’, col1val”…. We can run multiple puts with same rowkey value to add columns to same row… if we run put again with same rowkey and same column, then it updates the value

9. Use “scan tablename” to read table data

10. Use “count tablename” to count records in a table

11. Using “get tablename, rowkey” to get a specific row in a table.. get can further be granular by specifying column names

12. We can use scan for retrieving range of records , we have more options at scan.. for example “scan ‘tablename’,{COLUMNS => ‘[coumnfam: columnname]’, STARTROW =>”2”, STOPROW => “4”} retrieves one column from second and third row

13. “delete” is similar to get but it deletes instead of reading

14. Disable table before drop.. “enable”, “disable”,”exists”,”drop” few more hbase shell commands

Search This Blog