01 The challange and importance of data wrangling

The step after Data Acquisition and before Analysis in the data management flow is Data Wrangling, also called Data Cleaning or Data Preparation.

In this step data is cleaned, tested and prepared to be the best input possible for the analysis process. In other words,we need to ensure that our data has high quality, which is a property defined by this metrics:

Accuracy: The data was recorded correctly.
Completeness:All relevant data was recorded.
Uniqueness:Entities are recorded once.
Timeliness: The data is kept up to date (and time consistency is granted).
Consistency: The data agrees with itself.

Unfortunately it isn’t easy to measure and define these metrics. In fact, these metrics could be:

Unmeasurable: Accuracy and completeness are extremely difficult, per- haps impossible to measure.
Context independent: No accounting for what is important. E.g., if you are computing aggregates,you can tolerate a lot of inaccuracy.
Incomplete: What about interpretability, accessibility, metadata, analysis, etc.
Vague: The conventional definitions provide no guidance towards practi- calimprovements of the data.

For these reasons data wrangling is often the most crucial, difficult and predominant task of a data scientist/engineer.

If the data wrangling step is done poorly, it may lead to analysis being run on bad data which could in turn lead to bad decision making in company and bad business.

01 The Data-driven Virtuous Cycle

02 What is Big Data and new ways to solve problems

03 Data-driven decisions

01 ER

02 Relational Model

03 ER Exercises

01 Introduction to API

02 RESTful API

03 Scraping

01 NoSQL General Concepts

02 Transactional Properties in NoSQL

03 Brief NoSQL history

04 A map of NoSQL technologies

01 Graph Theory

02 Graph Databases

03 Neo4J

Exam Questions

01 Introduction

02 MongoDB

03 MongoDB Queries

Exam questions

01 Introduction

02 Redis

03 Memcache

Exam questions

01 Introduction

02 Cassandra

03 HBase

04 Cassandra Query Language

01 ELK stack

02 Elasticsearch

03 Elasticsearch operations

04 Logstash

Introduction

HDFS Architecture, Security and Configuration

HDFS Commands

Hadoop 2.0

MapReduce

Hadoop key components

Scheduling

HBase

Pig

Hive

Impala

Storm

Flume

Sqoop

01 The challange and importance of data wrangling

02 Data Wrangling Process

01 The challange and importance of data wrangling

No Comments