02 What is Big Data and new ways to solve problems
Definition of the four Vs
-
Volume
:(Data atscale,scale): terabytes to hexabyte of data cumulated on cheaper and cheaper storages -
Variety
:(Data in manyforms:forms): structured, unstructured, text, images, video and general multimedia -
Velocity
:(Data inmotion,motion): straming data analitics -
Veracity
:(Datauncertainty,uncertainty): managing the reliability and predictability of inherently imprecise data type
Other Vs: Value, Volatili
"oil" metaphor for Big Data, Data Science and Data Engineering
Data is just like crude oil. It’s valuable, but if unrefined it cannot really be used. It has to be
changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity;
so data must be broken down and analyzed for it to have value.
-
Exploration: just like we need to find oil, we need to locate relevant data before we can
extract it
-
Extraction: after locating the data (oil), we need to extract it
-
Transform: we then need to clean, filter and aggregate data (oil)
-
Storage: data (oil) needs to be stored and this may be challenging if it is huge
-
Transport: getting the data (oil) to the right person, organization or software tool (to the
petrol station)
-
Usage: while driving a car one consumes oil. Similarly, providing analysis results requires
data
Data is just like crude oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so data must be broken down and analyzed for it to have value.
However, there are some important differences between data and oil:
- Copying data is relatively easy and cheap. While it is impossible to simply copy a product like oil.
- Data is specific, i.e., it relates to a specific event, object, and/or period. Different data elements are not exchangeable. When going to a petrol station, this is very different; drops of oil are not preallocated to a specific car on a specific day.
- Typically, data storage and transport are cheap (unless the data is really Big Data). In a communication network, data may travel (almost) at the speed of light and storage costs are much lower than the storage costs of oil.