What
is Pig ?
Pig provides an engine for executing the data flows in
parallel on Hadoop. It’s include a language, Pig Latin, for expressing these
data flows, Pig Latin includes operators
for many of the traditional data operations(joins, filter, sort, etc).
as well as the ability for users to develop their own functions for
reading,processing,and writing data.
Pig is an Apache open source project. This means users
are free to download it as source or binary, use it for themselves, contribute
to it, and—under the terms of the Apache License—use it in their products and
change it as they see fit.
Pig on Hadoop
Pig runs on Hadoop.it makes use of the both Hadoop
Distribute file system,HDFS and Hadoop processing system,MapReduce.
HDFS is a distributed filesystem that stores files
across all of the nodes in a Hadoop cluster. It handles breaking the files into
large blocks and distributing them across different machines, including making
multiple copies of each block so that if any one machine fails no data is lost.
It presents a POSIX-like interface to users. By default, Pig reads input files
from HDFS, uses HDFS to store intermediate data between MapReduce jobs, and
writes its output to HDFS.
MapReduce is a simple but powerful parallel
data-processing paradigm. Every job in MapReduce consists of three main phases:
map, shuffle, and reduce. In the map phase, the application has the opportunity
to operate on each record in the input separately. Many maps are started at
once so that while the input may be gigabytes or terabytes in size, given
enough machines, the map phase can usually be completed in under one minute.
Part of the specification of a MapReduce job is the key
on which data will be collected. For example, if you were processing web server
logs for a website that required users to log in, you might choose the user ID
to be your key so that you could see everything done by each user on your
website. In the shuffle phase, which happens after the map phase, data is
collected together by the key the user has chosen and distributed to different
machines for the reduce phase. Every record for a given key will go to the same
reducer.
No comments:
Post a Comment