Introduction

Software platform to easily process vast amounts of data. The main features are:

Scalable: It can reliably store and process petabytes.
Economical: It distributes the data and processing across clusters of commonly available computers (in thousands).
Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located.
Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.

Hadoop implements Google’s MapReduce, using HDFS.

Components overview

Apache Hadoop
Apache Hive
Apache Pig
Apache HBase
Apache Zookeeper
Flume, Hue, Oozie, and Sqoop

HDFS

It is a file system responsible for storing data on the cluster. Data files are split into blocks and distributed across the nodes in the cluster where each block is replicated multiple times.

written in Java based on the Google’s GFS
Provides redundant storage for massive amounts of data
HDFS works best with a smaller number of large files
Files in HDFS are write once read many
Optimized for streaming reads of large files and not random reads

File storing

Files are split into blocks and these blocks are split across many machines at load time
Different blocks from the same file will be stored on different machines
Blocks are replicated across multiple machines
The NameNode keeps track of which blocks make up a file and where they are stored
Single Namespace for entire cluster
The default replication strategy is 3-fold.

Hadoop Properties

Fault Tolerance:
- Worker failure – handled via re-execution
- Master failure rare and can be recovered from checkpoints
Disk Locality:
- Leveraging HDFS
- Map tasks are scheduled close to data
- Conserves network bandwidth
Task Granularity:
- No. of map tasks > no. of worker nodes
- Increases load on Master
- Map could be chosen w.r.t to block size of the file system
- Reduce is usually specified by users

01 The Data-driven Virtuous Cycle

02 What is Big Data and new ways to solve problems

03 Data-driven decisions

01 ER

02 Relational Model

03 ER Exercises

01 Introduction to API

02 RESTful API

03 Scraping

01 NoSQL General Concepts

02 Transactional Properties in NoSQL

03 Brief NoSQL history

04 A map of NoSQL technologies

01 Graph Theory

02 Graph Databases

03 Neo4J

Exam Questions

01 Introduction

02 MongoDB

03 MongoDB Queries

Exam questions

01 Introduction

02 Redis

03 Memcache

Exam questions

01 Introduction

02 Cassandra

03 HBase

04 Cassandra Query Language

01 ELK stack

02 Elasticsearch

03 Elasticsearch operations

04 Logstash

Introduction

HDFS Architecture, Security and Configuration

HDFS Commands

Hadoop 2.0

MapReduce

Hadoop key components

Scheduling

HBase

Pig

Hive

Impala

Storm

Flume

Sqoop

01 The challange and importance of data wrangling

02 Data Wrangling Process

Introduction

Components overview

HDFS

File storing

Hadoop Properties

No Comments