Friday, July 19, 2013

Hadoop Ecosystem


In this section, I would like to explain:
  1. What is Hadoop?
  2. What are the core modules of Hadoop?
  3. What are the Hadoop-related projects at Apache?
  4. What are the Hadoop-related projects not at Apache?

Section 1: What is Hadoop?

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models

      It is designed to scale out from single servers to thousands of machines, each offering local computation and storage.

      Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.


Section 2: What are the core modules of Hadoop?

There are 4 core modules of Hadoop. They are:
  1. Hadoop Common: The common utilities that support the other Hadoop modules.
  2. Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
  3. Hadoop YARN: A framework for job scheduling and cluster resource management.
  4. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.


Section 3: What are the Hadoop-related projects at Apache?

Hadoop-related projects at Apache includes:
  1. Ambari: The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.The set of Hadoop components that are currently supported by Ambari includes: HDFS, MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, Sqoop

  2. Avro: A data serialization system.

  3. Cassandra: The Apache Cassandra multi-master database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple data centers is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

  4. Chukwa: A data collection system for managing large distributed systems. 

  5. Flume: Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many fail over and recovery mechanisms.

  6. HBase: Apache HBase is an open-source, distributed, versioned, column-oriented store for large tables. It is modeled after Google's Bigtable: A Distributed Storage System for Structured Data. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. Use it when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. 

  7. Hive: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

  8. Mahout: A Scalable machine learning and data mining library. 

  9. Oozie: Oozie is a workflow scheduler system to manage Apache Hadoop jobs. It's Workflow jobs are Directed Acyclical Graphs (DAGs) of actions and Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability.

  10. Pig: Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows. Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data.

  11. Sqoop: Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

  12. ZooKeeper: A high-performance coordination service for distributed applications. It is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Section 4: What are the Hadoop-related projects not at Apache?

This part is Under Construction!!!

References:  
  1. Hadoop Official 
  2. Programming Pig Book -  Only Definition
  3. All Official websites of Ambari, Avro, Cassandra, Chukwa, Flume, HBase, Hive, Mahout, Oozie,  Sqoop and Zookeeper