Spark History :
Spark started in 2009 as
a research project in the UC Berkeley RAD Lab,
later to become the
AMPLab. The researchers in the lab had previously been working
on Hadoop MapReduce, and
observed that MapReduce was inefficient for iterative and
interactive computing
jobs. Thus, from the beginning, Spark was designed to be fast for
interactive queries and
iterative algorithms, bringing in ideas like support for inmemory
storage and efficient
fault recovery
Unified Stack :
The Spark project contains multiple closely-integrated components. At its core, Spark
is
a “computational engine” that is responsible for scheduling, distributing, and
monitoring
applications
consisting of many computational tasks across many worker machines,
or
a computing cluster. Because the core engine of Spark is both fast and general purpose,
it
powers multiple higher-level components specialized for various workloads,
such as
SQL or machine learning.
tight
integration has several benefits :
1.
all libraries and higher level components
in the stack benefit from improvements at the lower layers
2.
the costs associated with running the
stack are minimized.
3.
ability to build applications that
seamlessly combine different processing models.
Spark
Core
:
Spark Core contains the basic functionality of Spark,
including components for task
scheduling,
memory management, fault recovery, interacting with storage systems, and
more.
Spark Core is also home to the API that defines Resilient Distributed Datasets
(RDDs),
which are Spark’s main programming abstraction. RDDs represent a collection
of
items distributed across many compute nodes that can be manipulated in
parallel. Spark Core provides many APIs for building and manipulating these
collections.
Spark SQL:
Spark
SQL provides support for interacting with Spark via SQL as well as the Apache
Hive
(i.e HiveQL). Spark SQL represents database
tables as Spark RDDs and translates SQL queries into Spark operations. Beyond providing
the SQL interface to Spark, Spark SQL allows developers to intermix SQL queries
with the programmatic data manipulations supported by RDDs in Python, Java and
Scala, all within a single application.
Spark Streaming :
Spark
Streaming is a Spark component that enables processing live streams of data. Spark
Streaming
provides an API for manipulating data streams that closely matches the Spark
Core’s
RDD API, making it easy for programmers to learn the project and move between applications
that manipulate data stored in memory, on disk, or arriving in real-time. Spark
Streaming was designed to provide the same degree of fault tolerance,
throughput, and scalability that the Spark Core provides
MLIB:
Spark
comes with a library containing common machine learning (ML) functionality
called
MLlib. MLlib provides multiple types of machine learning algorithms, including
binary
classification, regression, clustering and collaborative filtering, as well as
supporting
functionality
such as model evaluation and data import. It also provides some
lower
level ML primitives including a generic gradient descent optimization
algorithm.
GRAPHX :
GraphX
provides an API for manipulating graphs and performing graph-parallel
computations. Like
Spark
Streaming and Spark SQL, GraphX extends the Spark RDD API, allowing us to create
a directed graph with arbitrary properties attached to each vertex and edge. GraphX
also provides set of operators for manipulating graphs and a library of common
graph algorithms .
Cluster Manager:
Spark
is designed to efficiently scale up from one to many thousands
of
compute nodes. To achieve this while maximizing flexibility, Spark can run over
a
variety
of cluster managers, including Hadoop YARN, Apache Mesos, and a simple
cluster
manager included in Spark itself called the Standalone Scheduler. If you are
just
installing
Spark on an empty set of machines, the Standalone Scheduler provides an
easy
way to get started;
No comments:
Post a Comment