Hadoop key components

Input Splitter

Is responsible for splitting your input into multiple chunks (default is 64MB). These chunks are then used as input for your mappers Splits on logical boundaries.

Typically, you can just use one of the built in splitters, unless you are reading in a specially formatted file.

Mapper

Reads in input pair <K,V> (a section as split by the input splitter) Outputs a pair <K’, V’>

Reducer

Accepts the Mapper output, and collects values on the key. All inputs with the same key must go to the same reducer!

Input is typically sorted, output is output exactly as is

Partitioner (Shuffler)

Decides which pairs are sent to which reducer Default is simply: Key.hashCode() % numOfReducers

Custom partitioning is often required.

Important to choose well-balanced partitioning functions. If not, reduce tasks may delay completion time

Combiner

An optional intermediate reducer. Reduces output from each mapper, reducing bandwidth and sorting.

Cannot change the type of its input and input types must be the same as output types.

Output Committer

Is responsible for taking the reduce output, and committing it to a file.

Typically, this committer needs a corresponding input splitter (so that another job can read the input).

Again, usually built-in committers are good enough, unless you need to output a special kind of file

Master

Responsible for scheduling & managing jobs (handled by the framework, no user code is necessary).

If a task fails to report progress (such as reading input, writing output, etc), crashes, the machine goes down, etc, it is assumed to be stuck, and is killed, and the step is re-launched (with the same input)

01 The Data-driven Virtuous Cycle

02 What is Big Data and new ways to solve problems

03 Data-driven decisions

01 ER

02 Relational Model

03 ER Exercises

01 Introduction to API

02 RESTful API

03 Scraping

01 NoSQL General Concepts

02 Transactional Properties in NoSQL

03 Brief NoSQL history

04 A map of NoSQL technologies

01 Graph Theory

02 Graph Databases

03 Neo4J

Exam Questions

01 Introduction

02 MongoDB

03 MongoDB Queries

Exam questions

01 Introduction

02 Redis

03 Memcache

Exam questions

01 Introduction

02 Cassandra

03 HBase

04 Cassandra Query Language

01 ELK stack

02 Elasticsearch

03 Elasticsearch operations

04 Logstash

Introduction

HDFS Architecture, Security and Configuration

HDFS Commands

Hadoop 2.0

MapReduce

Hadoop key components

Scheduling

HBase

Pig

Hive

Impala

Storm

Flume

Sqoop

01 The challange and importance of data wrangling

02 Data Wrangling Process

Hadoop key components

Input Splitter

Mapper

Reducer

Partitioner (Shuffler)

Combiner

Output Committer

Master

No Comments