SMBUD - Systems and Methods for Big and Unstructured Data
Summary built from the slides of lessons and exercises, plus the professor's notes.
01-02 Big Data and data-driven decisions
Slides: https://webeep.polimi.it/pluginfile.php/228925/mod_resource/content/2/SMBUD-01-bigdata_co...
01 The Data-driven Virtuous Cycle
Collect Widely different data sources: Logs IoT Social Media Ecommerce Traditional businness da...
02 What is Big Data and new ways to solve problems
Definition of the four Vs Volume (Data at scale): terabytes to hexabyte of data cumulated on ch...
03 Data-driven decisions
Using data decisions no longer have to be made in the dark or based on gut instinct; they can be ...
03 ER and Relational Data Models
Lesson: https://webeep.polimi.it/mod/resource/view.php?id=30596 Exercises: https://webeep.polimi...
01 ER
Foundations An entity–relationship model (or ER model) describes interrelated things of interest ...
02 Relational Model
Keys Superkey: a set of attributes K is a superkey for a relation r if r does not contain two dis...
03 ER Exercises
Exercise 1 Design an ER Model for a car rental system that manages the customers, the cars and th...
04 Data ingestion and API
Lesson: https://webeep.polimi.it/mod/resource/view.php?id=33753 Exercises: https://webeep.polimi...
01 Introduction to API
Data ingestion Data ingestion is the first and fundamental step of any Data Analysis Pipeline. Th...
02 RESTful API
Is a standardized resource based way of designing API. The RESTful API uses the available HTTP ve...
03 Scraping
Web crawling, data crawling, and web scraping are all names to define the process of data extrac-...
05 NoSQL introduction
Slides: https://webeep.polimi.it/mod/resource/view.php?id=37378
01 NoSQL General Concepts
Differences with the traditional data model Schema Less Approach This new types of data model re...
02 Transactional Properties in NoSQL
Transaction is SQL, ACID In the relational world we are used to having the concept of a transacti...
03 Brief NoSQL history
MultiValue databases at TRW in 1965. DBM is released by AT&T in 1979. Lotus Domino released in 1...
04 A map of NoSQL technologies
Key-Value Store A key that refers to a payload (actual content / data). E.g. MemcacheDB, Azure T...
06 Graph Stores
Slides: https://webeep.polimi.it/mod/resource/view.php?id=39780 Exercises: https://webeep.polimi...
01 Graph Theory
Basic definitions The Graph is a data structure that was first used in 1736 to represent a city a...
02 Graph Databases
Motivations The table based structure of relational databases makes it hard to represent relation...
03 Neo4J
The most popular graph database is Neo4J, implemented in Java by Neo Technologies, and it has the...
Exam Questions
2021 06 22 Q1 (8 points) A dedicated online review and social networking system tracks the activi...
07 Document Databases
Slides: https://webeep.polimi.it/mod/resource/view.php?id=42228 Exercises: https://webeep.polimi...
01 Introduction
Document databases deviate from the entity-based denormalized data model of relational dbs and pr...
02 MongoDB
MongoDB is a document-oriented database that stores data within: Documents: consist of key-valu...
03 MongoDB Queries
Create Create a database: use database_name Create a collection: db.createCollection(name, option...
Exam questions
2021 06 22 Q2 (6 points) A dedicated online review and social networking system tracks the activi...
08 Key-value Databases
Slides: https://webeep.polimi.it/mod/resource/view.php?id=45426
01 Introduction
In many applications performance is an essential priority, and often a small delay in response ti...
02 Redis
Redis is an advanced key-value store, where keys can contain data structures such as strings, has...
03 Memcache
Memcache is a free & open source, high-performance, distributed memory object caching system that...
Exam questions
2021 02 04 Q1 A maintenance and service management company supports public administrations in the...
09 Columnar Databases
Slides: https://webeep.polimi.it/mod/resource/view.php?id=49240 Exercises: https://webeep.polimi...
01 Introduction
In the recent years there has been a ever growing need for technologies capable of handling large...
02 Cassandra
Apache Cassandra is a highly scalable, high-performance distributed database designed to handle l...
03 HBase
HBase Table: Split it into multiple regions: replicated across servers. One Store per ColumnFa...
04 Cassandra Query Language
To query the data stored within Cassandra, a dedicated query language named Cassandra Query Langu...
10 IR Based Databases - ELK
Slides: https://webeep.polimi.it/mod/resource/view.php?id=50601 Exercises: https://webeep.polimi...
01 ELK stack
Kibana: Visualize and Manage Elasticsearch: Store, Search and Analyze Logstash + Beats: Inges...
02 Elasticsearch
Elasticsearch stores data structures in JSON documents which are distributed and can be accessed ...
03 Elasticsearch operations
Creating and index: PUT /index_name Define a mapping: PUT /my_index/_mapping { "properties": { ...
04 Logstash
Working with Beats Beats focus on data collection and shipping while Logstash focuses on processi...
11 Hadoop
Slides: https://webeep.polimi.it/mod/resource/view.php?id=52305
Introduction
Software platform to easily process vast amounts of data. The main features are: Scalable: It c...
HDFS Architecture, Security and Configuration
HDFS Architecture NameNode Manages File System Namespace Maps a file name to a set of blocks M...
HDFS Commands
Shell Commands There are two types of shell commands: User Commands hdfs dfs – runs filesyste...
Hadoop 2.0
YARN Splits up the two major functions of JobTracker Global Resource Manager - Cluster resourc...
MapReduce
Introduction MapReduce is a programming model and an associated implementation for processing an...
Hadoop key components
Input Splitter Is responsible for splitting your input into multiple chunks (default is 64MB). Th...
Scheduling
By default, Hadoop uses FIFO to schedule jobs. Alternate scheduler options: capacity and fair Cap...
12-13 Hadoop Subprojects
Slides: https://webeep.polimi.it/mod/resource/view.php?id=52306
HBase
HBase is a key-valued row/column store modeled on Google’s Bigtable providing Bigtable-like capab...
Pig
Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language...
Hive
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data...
Impala
Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data sto...
Storm
Apache Storm is a distributed stream processing computation framework. Storm provides realtime co...
Flume
Apache Flume is a distributed, reliable, and available software for efficiently collecting, aggre...
Sqoop
Sqoop is a command-line interface application for transferring data between relational databases ...
14 Data Wrangling
Slides: https://webeep.polimi.it/mod/resource/view.php?id=56288
Exam questions on design
2021 01 15 Q1 (10 points) Suppose you need to design the data model that supports the structure a...