01 NOSQL General Concepts
Differences with the traditional data model
Schema Less Approach
This new types of data model require lot of flexibility,which is usually imple- mented by getting rid of the traditional fixed schema of the relational model,using a so called ”schema-less” approach where data doesn’t need to strictly respect a predefined structure in order to be stored in the database.
his way the definition of the schema is postponed from the moment in which the data is written to the moment when the data is read. This approach is called Schema-On-Read as opposed to the traditional Schema-On-Write. This new philosophy has many advantages over the old one:
- Since we don’t have to check the correctness of the schema the database writes are much faster
- We have a smaller IO throughput since when reading data we can only fetch the properties we need for that particular task
- The same data can assume a different schema according to the needs of the analytical jobs that is reading it, optimizing performance.
Data Lakes
The better define this new kind of data storage systems,the concept of Data Lake was invented: a reservoir of data, which is just a rough container, used to support the needs of the company.
A Data Lake has incoming flows of both structured and unstructured data and outgoing flows consumed by the analytical jobs. This way organizations have only one centralized data source for all its analytical needs, as opposed to silos based approaches.
Building such a huge and centralized data storage system has many challenges: it’s important to organize and index well all the incoming data, or we could end up with a huge Data Swamp where it’s impossible for everyone to find what they’re looking for.
And so usually in a Data Lake the incoming raw data gets incrementally refined and enriched in order to improve accessibility,by adding metadata and indexing tags and by performing quality checks.
Scalability
As we previously mentioned, traditional data systems scale vertically, meaning that when the machine doesn’t have enough ram or disk to store all the data it just gets replaced with a more powerful and bigger one. This approach only scales up to a certain point and after that it just isn’t efficient anymore, so when dealing with Big Data we need to scale horizontally instead, meaning that the computing system is composed of many commodity machines and when there’s a need for more resources more machines are added without replacing the old ones.
Data pipeline
Ingestion
The process of importing, transferring and loading data for storage and later use
It involves loading data from a variety of sources. It can involve altering and modification of individual files to fit into a format that optimizes the storage
Data Wrangling
The process of cleansing "raw" data and transforming it into data that can be analysed to generate valid actionable insights.
It includes understanding, cleansing, augmenting and shaping data. The results is data in the best format (e.g., columnar) for the analysis to perform.
Extract Transform Load
The process that extracts data from heterogeneous data sources, transforms it in the schema that better fits the analysis to perform and loads it in the system that will perform the analysis.