Machine Learning Notepad: January 2016

YARN : Yet Another Resource Negotiator

YARN Architecture :

YARN Components:

Resource Manager :

Uses of resources across the Hadoop cluster
What are the resources available and who are using the resources in the cluster
Negotiate resources
Gives container to node manager to execute

Node Manager :

Runs on each Data node
Instead of having task tracker we have node manager to run
Monitor the resources and report back to Resource manager

Container :

It creates execution container which executed on worker/slave machine of cluster
Whatever the resources allocation you have from resource manager for the execution of job like how much memory , cpu , its priority based on resources are allocated

Application Master :

Coordinates and manages the manually submitted Map-Reduce job also handle life cycle of it
Work with the resource manager to schedule task , to get the resources and based on that get help of node manager to execute the task on machine
Managing life cycle of the application

1.0 - 2.0 :
Job Tracker = Resource Manager
Task Tracker = Node Manager

Hadoop : Hive

What is Hive?

Data warehousing package built on top of Hadoop
Used for data analytics
Targeted towards users comfortable with SQL
It is similar to SQL and called HiveQL
For managing and querying structured data
Abstracts complexity of Hadoop
No need learn java and Hadoop APIs
Developed by Facebook and contributed to community
Facebook analyzed several Terabytes of data everyday using Hive

Where to use Hive:

When you have structured data
Data Mining
Document Indexing
Predictive Modelling, Hypothesis Testing
Customer Facing Business Intelligence
Log Processing

Hive Limitations:

Not designed for online transaction processing (OLTP)
Does not offer real time queiries and row level updates
Provides acceptable latency for interactive data browsing
Latency for hive queries is generally very high(Minutes)
Hive Advantages:
Ability to filter rows from a table using “where” clause.
Ability to do equi-joins between two tables
Ability to store the results of a query in Hadoop DFS directory
Ability to manage tables and partitions (create, drop & alter)
Ability to store the result of query into another table

Hive Architecture:

Hive Components:

User Interface : Hive is a data warehouse infrastructure software that can create interaction between user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server).
Meta Store : Hive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types, and HDFS mapping.
HiveQL Process Engine : HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce program in Java, we can write a query for MapReduce job and process it.
Execution Engine : The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce.
HDFS or HBASE : Hadoop distributed file system or HBASE are the data storage techniques to store data into file system.

Hive Background:

Execute Query : The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as JDBC, ODBC, etc.) to execute.
Get Plan : The driver takes the help of query compiler that parses the query to check the syntax and query plan or the requirement of query.
Get Metadata : The compiler sends metadata request to Metastore (any database).
Send Metadata : Metastore sends metadata as a response to the compiler.
Send Plan : The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and compiling of a query is complete.
Execute Plan : The driver sends the execute plan to the execution engine.
Execute Job : Internally, the process of execution job is a MapReduce job. The execution engine sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node. Here, the query executes MapReduce job.
Metadata Ops : Meanwhile in execution, the execution engine can execute metadata operations with Metastore.
Fetch Result : The execution engine receives the results from Data nodes.
Send Results : The execution engine sends those resultant values to the driver.
Send Results : The driver sends the results to Hive Interfaces.

Hadoop Data Types :

Struct
Map
Array

Hive Limitations:

Not designed for online transaction processing (OLTP)
Does not offer real time queiries and row level updates
Provides acceptable latency for interactive data browsing
Latency for hive queries is generally very high(Minutes)

Hive Advantages:

Ability to filter rows from a table using “where” clause.
Ability to do equi-joins between two tables
Ability to store the results of a query in Hadoop DFS directory
Ability to manage tables and partitions (create, drop & alter)
Ability to store the result of query into another table

Hadoop Architecture

Hadoop 1.0 Architecture :

Limitation of Hadoop 1.0 :

No horizontal scalability of NameNode

Metadata is stored in NameNode Memory(RAM)
Bottleneck after ~4000 nodes
Result on cascading failure

Does not support NameNode High Availability

Not a hot standby for the NameNode
Connects to NameNode regularity
Housekeeping backup of NameNode metadata
Saved metadata can build a failed NameNode

Overburden JobTracker

CPU : Spends a very significant portion of time and effort managing the life cycle of application
Network : Single listener Thread of communicate wit thousand of Map and Reduce Jobs

Not possible to run Non-MapReduce Big Data Application on HDFS

Only MapReduce processing can be achieved
Alternate Data Storage is needed for other processing such as Real-time or Graph Analysis

Does not support Multi-tenancy

Hadoop 2.0 Architecture :

Hadoop 2.0 Feature:

HDFS Federation

Multiple NameNode and Namespaces

Support for NameNode High Availability
YARN – Yet another resource negotiator

Better processing control
Support for non Map Reduce type of processing
Support for multi-tenancy
Resource Manager ,Node Manager, App Master, Capacity Scheduler

Multi Tenancy :

Different types of jobs are organized in different queues (Batch, Streaming, Interactive)
Queue shares as %’s of cluster
Each queue has an associated priority
FIFO scheduling which each queue
Security ensured between application

HDFS Snapshots
NFSv3 access to data in HDFS
Support for running Hadoop on MS Windows
Binary Compatibility for MapReduce applications built on Hadoop 1.0
Substantial amount of integration testing wih rest of the projects (such as PIG, HIVE) in Hadoop ecosystem.

Hadoop Application :

Tuesday, January 26, 2016

YARN : Yet Another Resource Negotiator

Hadoop : Hive

Hadoop Architecture