Monday 2 February 2015

Hadoop Core Components



Starting from the bottom of the diagram in Above diagram, Hadoop’s ecosystem consists of the following:

HDFS — A foundational component of the Hadoop ecosystem is the Hadoop Distributed
File System (HDFS). HDFS is the mechanism by which a large amount of data can be
distributed over a cluster of computers, and data is written once, but read many times for
analytics. It provides the foundation for other tools, such as HBase.

MapReduce — Hadoop’s main execution framework is MapReduce, a programming model for distributed, parallel data processing, breaking jobs into mapping phases and reduce phases (thus the name). Developers write MapReduce jobs for Hadoop, using data stored in HDFS for fast data access. Because of the nature of how MapReduce works, Hadoop brings the processing to the data in a parallel fashion, resulting in fast implementation.

HBase — A column-oriented NoSQL database built on top of HDFS, HBase is used for fast
read/write access to large amounts of data. HBase uses Zookeeper for its management to
ensure that all of its components are up and running.

Zookeeper — Zookeeper is Hadoop’s distributed coordination service. Designed to run over a cluster of machines, it is a highly available service used for the management of Hadoop operations, and many components of Hadoop depend on it.

Oozie — A scalable workflow system, Oozie is integrated into the Hadoop stack, and is
used to coordinate execution of multiple MapReduce jobs. It is capable of managing a
significant amount of complexity, basing execution on external events that include timing
and presence of required data.

Pig — An abstraction over the complexity of MapReduce programming, the Pig platform
includes an execution environment and a scripting language (Pig Latin) used to analyze
Hadoop data sets. Its compiler translates Pig Latin into sequences of MapReduce programs.

Hive — An SQL-like, high-level language used to run queries on data stored in Hadoop, Hive enables developers not familiar with MapReduce to write data queries that are translated into MapReduce jobs in Hadoop. Like Pig, Hive was developed as an abstraction layer, but geared more toward database analysts more familiar with SQL than Java programming.
The Hadoop ecosystem also contains several frameworks for integration with the rest of the
enterprise:

Sqoop is a connectivity tool for moving data between relational databases and data
warehouses and Hadoop. Sqoop leverages database to describe the schema for the imported/exported data and MapReduce for parallelization operation and fault tolerance.

Flume is a distributed, reliable, and highly available service for efficiently collecting,
aggregating, and moving large amounts of data from individual machines to HDFS. It
is based on a simple and flexible architecture, and provides a streaming of data flows. It
leverages a simple extensible data model, allowing you to move data from multiple machines within an enterprise into Hadoop.

No comments:

Post a Comment