12-13 Hadoop Subprojects

Slides: https://webeep.polimi.it/mod/resource/view.php?id=52306

Pages

HBase is a key-valued row/column store modeled on Google’s Bigtable providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse…

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin . Pig can execute its Hadoop jobs. Pig Latin abstracts the…

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and…

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Integrated into Hadoop stack on the same level as…

Apache Storm is a distributed stream processing computation framework . Storm provides realtime computation . Architecture The Apache Storm cluster comprises following critical components: Nodes :…

Apache Flume is a distributed, reliable, and available software for efficiently collecting, aggregating, and moving large amounts of log data . It is robust and fault tolerant with tunable…

Sqoop is a command-line interface application for transferring data between relational databases and Hadoop . Sqoop supports incremental loads of a single table or a free form SQL query as well as…