Starting from the bottom
of the diagram in Above diagram, Hadoop’s ecosystem consists of the following:
➤ HDFS — A foundational component of the Hadoop
ecosystem is the Hadoop Distributed
File System (HDFS).
HDFS is the mechanism by which a large amount of data can be
distributed over a
cluster of computers, and data is written once, but read many times for
analytics. It provides
the foundation for other tools, such as HBase.
➤ MapReduce — Hadoop’s main
execution framework is MapReduce, a programming model for distributed, parallel
data processing, breaking jobs into mapping phases and reduce phases (thus the name).
Developers write MapReduce
jobs for
Hadoop, using data stored in HDFS for fast data access. Because of the nature
of how MapReduce works, Hadoop brings the processing to the data in a parallel
fashion, resulting in fast implementation.
➤ HBase — A column-oriented NoSQL database built
on top of HDFS, HBase is used for fast
read/write access to
large amounts of data. HBase uses Zookeeper for its management to
ensure that all of its
components are up and running.
➤ Zookeeper — Zookeeper is Hadoop’s
distributed coordination service. Designed to run over a cluster of machines,
it is a highly available service used for the management of Hadoop operations,
and many components of Hadoop depend on it.
➤ Oozie — A scalable workflow system, Oozie is
integrated into the Hadoop stack, and is
used to coordinate
execution of multiple MapReduce jobs. It is capable of managing a
significant amount of
complexity, basing execution on external events that include timing
and presence of
required data.
➤ Pig — An abstraction over the complexity of
MapReduce programming, the Pig platform
includes an execution
environment and a scripting language (Pig Latin) used to analyze
Hadoop data sets. Its
compiler translates Pig Latin into sequences of MapReduce programs.
➤ Hive — An SQL-like, high-level language used
to run queries on data stored in Hadoop, Hive enables developers not familiar
with MapReduce to write data queries that are translated into MapReduce jobs in
Hadoop. Like Pig, Hive was developed as an abstraction layer, but geared more
toward database analysts more familiar with SQL than Java programming.
The
Hadoop ecosystem also contains several frameworks for integration with the rest
of the
enterprise:
➤ Sqoop is a connectivity tool for moving
data between relational databases and data
warehouses
and Hadoop. Sqoop leverages database to describe the schema for the imported/exported
data and MapReduce for parallelization operation and fault tolerance.
➤ Flume is a distributed, reliable, and
highly available service for efficiently collecting,
aggregating,
and moving large amounts of data from individual machines to HDFS. It
is
based on a simple and flexible architecture, and provides a streaming of data
flows. It
leverages
a simple extensible data model, allowing you to move data from multiple
machines within an enterprise into Hadoop.
No comments:
Post a Comment