Machine Learning Notepad: Apache Spark Introduction

Spark History :

Spark started in 2009 as a research project in the UC Berkeley RAD Lab,

later to become the AMPLab. The researchers in the lab had previously been working

on Hadoop MapReduce, and observed that MapReduce was inefficient for iterative and

interactive computing jobs. Thus, from the beginning, Spark was designed to be fast for

interactive queries and iterative algorithms, bringing in ideas like support for inmemory

storage and efficient fault recovery

Unified Stack :

The Spark project contains multiple closely-integrated components. At its core, Spark

is a “computational engine” that is responsible for scheduling, distributing, and monitoring

applications consisting of many computational tasks across many worker machines,

or a computing cluster. Because the core engine of Spark is both fast and general purpose,

it powers multiple higher-level components specialized for various workloads,

such as SQL or machine learning.

tight integration has several benefits :

1. all libraries and higher level components in the stack benefit from improvements at the lower layers

2. the costs associated with running the stack are minimized.

3. ability to build applications that seamlessly combine different processing models.

Spark Core :

Spark Core contains the basic functionality of Spark, including components for task

scheduling, memory management, fault recovery, interacting with storage systems, and

more. Spark Core is also home to the API that defines Resilient Distributed Datasets

(RDDs), which are Spark’s main programming abstraction. RDDs represent a collection

of items distributed across many compute nodes that can be manipulated in parallel. Spark Core provides many APIs for building and manipulating these collections.

Spark SQL:

Spark SQL provides support for interacting with Spark via SQL as well as the Apache

Hive (i.e HiveQL). Spark SQL represents database tables as Spark RDDs and translates SQL queries into Spark operations. Beyond providing the SQL interface to Spark, Spark SQL allows developers to intermix SQL queries with the programmatic data manipulations supported by RDDs in Python, Java and Scala, all within a single application.

Spark Streaming :

Spark Streaming is a Spark component that enables processing live streams of data. Spark

Streaming provides an API for manipulating data streams that closely matches the Spark

Core’s RDD API, making it easy for programmers to learn the project and move between applications that manipulate data stored in memory, on disk, or arriving in real-time. Spark Streaming was designed to provide the same degree of fault tolerance, throughput, and scalability that the Spark Core provides

MLIB:

Spark comes with a library containing common machine learning (ML) functionality

called MLlib. MLlib provides multiple types of machine learning algorithms, including

binary classification, regression, clustering and collaborative filtering, as well as supporting

functionality such as model evaluation and data import. It also provides some

lower level ML primitives including a generic gradient descent optimization algorithm.

GRAPHX :

GraphX provides an API for manipulating graphs and performing graph-parallel computations. Like

Spark Streaming and Spark SQL, GraphX extends the Spark RDD API, allowing us to create a directed graph with arbitrary properties attached to each vertex and edge. GraphX also provides set of operators for manipulating graphs and a library of common graph algorithms .

Cluster Manager:

Spark is designed to efficiently scale up from one to many thousands

of compute nodes. To achieve this while maximizing flexibility, Spark can run over a

variety of cluster managers, including Hadoop YARN, Apache Mesos, and a simple

cluster manager included in Spark itself called the Standalone Scheduler. If you are just

installing Spark on an empty set of machines, the Standalone Scheduler provides an

easy way to get started;

Machine Learning Notepad

Sunday, May 15, 2016

Apache Spark Introduction

Spark History :

Unified Stack :

No comments:

Post a Comment