11 Hadoop
Slides: https://webeep.polimi.it/mod/resource/view.php?id=52305
Pages
Introduction
Software platform to easily process vast amounts of data. The main features are: Scalable : It can reliably store and process petabytes. Economical : It distributes the data and processing across…
HDFS Architecture, Security and Configuration
HDFS Architecture NameNode Manages File System Namespace Maps a file name to a set of blocks Maps a block to the DataNodes where it resides Cluster Configuration Management Replication Engine for…
HDFS Commands
Shell Commands There are two types of shell commands: User Commands hdfs dfs – runs filesystem commands on the HDFS hdfs fsck – runs a HDFS filesystem checking command Administration Commands hdfs…
Hadoop 2.0
YARN Splits up the two major functions of JobTracker Global Resource Manager - Cluster resource management Application Master - Job scheduling and monitoring (one per application). The Application…
MapReduce
Introduction MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is…
Hadoop key components
Input Splitter Is responsible for splitting your input into multiple chunks (default is 64MB). These chunks are then used as input for your mappers Splits on logical boundaries. Typically, you can…
Scheduling
By default, Hadoop uses FIFO to schedule jobs. Alternate scheduler options: capacity and fair Capacity Scheduler Jobs are submitted to queues Jobs can be prioritized Queues are allocated a fraction…