Introduction
Software platform to easily process vast amounts of data. The main features are:
- Scalable: It can reliably store and process petabytes.
- Economical: It distributes the data and processing across clusters of commonly available computers (in thousands).
- Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located.
- Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.
Hadoop implements Google’s MapReduce, using HDFS.
Components overview
- Apache Hadoop
- Apache Hive
- Apache Pig
- Apache HBase
- Apache Zookeeper
- Flume, Hue, Oozie, and Sqoop
HDFS
It is a file system responsible for storing data on the cluster. Data files are split into blocks and distributed across the nodes in the cluster where each block is replicated multiple times.
- written in Java based on the Google’s GFS
- Provides redundant storage for massive amounts of data
- HDFS works best with a smaller number of large files
- Files in HDFS are write once read many
- Optimized for streaming reads of large files and not random reads
File storing
- Files are split into blocks and these blocks are split across many machines at load time
- Different blocks from the same file will be stored on different machines
- Blocks are replicated across multiple machines
- The NameNode keeps track of which blocks make up a file and where they are stored
- Single Namespace for entire cluster
- The default replication strategy is 3-fold.
Hadoop Properties
-
Fault Tolerance:
- Worker failure – handled via re-execution
- Master failure rare and can be recovered from checkpoints
-
Disk Locality:
- Leveraging HDFS
- Map tasks are scheduled close to data
- Conserves network bandwidth
-
Task Granularity:
- No. of map tasks > no. of worker nodes
- Increases load on Master
- Map could be chosen w.r.t to block size of the file system
- Reduce is usually specified by users
No Comments