Showing posts with label Hadoop. Show all posts
Showing posts with label Hadoop. Show all posts

Saturday, February 28, 2015

Hadoop - Word Count Example

Hadoop Version used in the below example is: 1.2.1


WordCount.java


import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {

    public static void main(String[] args) throws IOException,
            InterruptedException, ClassNotFoundException {

        Path inputPath = new Path(args[0]);
        Path outputDir = new Path(args[1]);

        // Create configuration
        Configuration conf = new Configuration(true);

        // Create job
        Job job = new Job(conf, "WordCount");
        job.setJarByClass(WordCountMapper.class);

        // Setup MapReduce
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);
        job.setNumReduceTasks(1);

        // Specify key / value
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // Input
        FileInputFormat.addInputPath(job, inputPath);
        job.setInputFormatClass(TextInputFormat.class);

        // Output
        FileOutputFormat.setOutputPath(job, outputDir);
        job.setOutputFormatClass(TextOutputFormat.class);

        // Delete output if exists
        FileSystem hdfs = FileSystem.get(conf);
        if (hdfs.exists(outputDir))
            hdfs.delete(outputDir, true);

        // Execute job
        int code = job.waitForCompletion(true) ? 0 : 1;
        System.exit(code);

    }

}


WordCountMapper.java


import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends
        Mapper<Object, Text, Text, IntWritable> {

    private final IntWritable ONE = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context)
            throws IOException, InterruptedException {

        String[] csv = value.toString().split(" ");
        for (String str : csv) {
            word.set(str);
            context.write(word, ONE);
        }
    }
}


WordCountReducer.java


import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends
        Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text text, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable value : values) {
            sum += value.get();
        }
        context.write(text, new IntWritable(sum));
    }
}

Wednesday, August 7, 2013

HDFS Architecture

HDFS design goals can be found here:
HDFS Design Goals
  1. HDFS stands for Hadoop Distributed File System
  2. HDFS is designed to run on low-cost hardware
  3. HDFS is highly fault-tolerance(as it supports block replication)
  4. HDFS was originally built as infrastructure for the Apache Nutch web search engine project
  5. HDFS is now an Apache Hadoop sub project
  6. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system's data
  7. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access
  8. A typical file in HDFS is gigabytes to terabytes in size
  9. HDFS applications need a write-once-read-many access model for files. This assumption simplifies data coherency issues and enables high throughput data access
Architecture:


  1. HDFS is built using the Java language
  2. HDFS has master/slave architecture: NameNode/DataNode
  3. An HDFS cluster consists of a single NameNode and a number of DataNodes
  4. The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS)
  5. NameNode is a master server that manages the file system namespace and regulates access to files by clients. It executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly.
  6. DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. They are responsible for serving read and write requests from the file system’s clients. They  also perform block creation, deletion, and replication upon instruction from the NameNode.
  7. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes
Conclusion: The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode

Frequently Asked Questions:

Q) What kind of file organization does HDFS supports?
A) Traditional hierarchical file organization

Q) Does HDFS supports user quotas?
A) No

Q) Does HDFS supports soft links or hard links?
A) No

Q) Define replication factor. Who will store this information?
A) The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode

Q) How does HDFS stores a file?
A) HDFS stores each file as a sequence of blocks; all blocks in a file except the last block are the same size

Q) What is the default block size?
A) 64MB

Q) What is the default replication factor?
A) 3

Q) What does a block report contains?
A) A Blockreport contains a list of all blocks on a DataNode

Q) Is data locality principle applicable to HDFS?
A) Yes, when loading data from a DataNode's local drive into HDFS, the NameNode will schedule one copy to go into the local DataNode, and will pick two other machines at random from the cluster(preferably, second replica on a different DataNode on the same rack, third replica could be on a random DataNode on a different rack)

Note: More questions will be added soon....

Reference: Hadoop Official website

Wednesday, July 31, 2013

JobTracker

The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.
  1. Client applications submit jobs to the Job tracker
  2. The JobTracker talks to the NameNode to determine the location of the data 
  3. The JobTracker locates TaskTracker nodes with available slots at or near the data
  4. The JobTracker submits the work to the chosen TaskTracker nodes
  5. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker
  6. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable
  7. When the work is completed, the JobTracker updates its status
  8. Client applications can poll the JobTracker for information.
The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.

Reference: Hadoop Official

Friday, July 19, 2013

Hadoop Ecosystem


In this section, I would like to explain:
  1. What is Hadoop?
  2. What are the core modules of Hadoop?
  3. What are the Hadoop-related projects at Apache?
  4. What are the Hadoop-related projects not at Apache?

Section 1: What is Hadoop?

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models

      It is designed to scale out from single servers to thousands of machines, each offering local computation and storage.

      Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.


Section 2: What are the core modules of Hadoop?

There are 4 core modules of Hadoop. They are:
  1. Hadoop Common: The common utilities that support the other Hadoop modules.
  2. Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
  3. Hadoop YARN: A framework for job scheduling and cluster resource management.
  4. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.


Section 3: What are the Hadoop-related projects at Apache?

Hadoop-related projects at Apache includes:
  1. Ambari: The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.The set of Hadoop components that are currently supported by Ambari includes: HDFS, MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, Sqoop

  2. Avro: A data serialization system.

  3. Cassandra: The Apache Cassandra multi-master database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple data centers is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

  4. Chukwa: A data collection system for managing large distributed systems. 

  5. Flume: Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many fail over and recovery mechanisms.

  6. HBase: Apache HBase is an open-source, distributed, versioned, column-oriented store for large tables. It is modeled after Google's Bigtable: A Distributed Storage System for Structured Data. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. Use it when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. 

  7. Hive: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

  8. Mahout: A Scalable machine learning and data mining library. 

  9. Oozie: Oozie is a workflow scheduler system to manage Apache Hadoop jobs. It's Workflow jobs are Directed Acyclical Graphs (DAGs) of actions and Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability.

  10. Pig: Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows. Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data.

  11. Sqoop: Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

  12. ZooKeeper: A high-performance coordination service for distributed applications. It is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Section 4: What are the Hadoop-related projects not at Apache?

This part is Under Construction!!!

References:  
  1. Hadoop Official 
  2. Programming Pig Book -  Only Definition
  3. All Official websites of Ambari, Avro, Cassandra, Chukwa, Flume, HBase, Hive, Mahout, Oozie,  Sqoop and Zookeeper

Monday, June 17, 2013

Hadoop: MadReduce: Data Flow

A MapReduce job is a unit of work that the client wants to be performed. It consists of:
  1. the input data
  2. the MapReduce program
  3. configuration information
Hadoop runs the job by dividing it into tasks, of which there are two types:
  1. map tasks
  2. reduce tasks
There are two types of nodes that control the job execution process:
  1. a jobtracker
  2. a number of tasktrackers.
The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers. Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job. If a task fails, the jobtracker can reschedule it on a different tasktracker.

Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits. Hadoop creates one map task for each split, which runs the user defined map function for each record in the split.

Having many splits means the time taken to process each split is small compared to the time to process the whole input. So if we are processing the splits in parallel, the processing is better load-balanced if the splits are small, since a faster machine will be able to process proportionally more splits over the course of the job than a slower machine. Even if the machines are identical, failed processes or other jobs running concurrently make load balancing desirable, and the quality of the load balancing increases as the splits become more fine-grained.

On the other hand, if splits are too small, then the overhead of managing the splits and of map task creation begins to dominate the total job execution time. For most jobs, a good split size tends to be the size of an HDFS block, 64 MB by default, although this can be changed for the cluster (for all newly created files), or specified when each file is created.

Hadoop does its best to run the map task on a node where the input data resides in HDFS. This is called the data locality optimization since it doesn’t use valuable cluster bandwidth. Sometimes, however, all three nodes hosting the HDFS block replicas for a map task’s input split are running other map tasks so the job scheduler will look for a free map slot on a node in the same rack as one of the blocks. Very occasionally even this is not possible, so an off-rack node is used, which results in an inter-rack network transfer.

It should now be clear why the optimal split size is the same as the block size: it is the largest size of input that can be guaranteed to be stored on a single node. If the split spanned two blocks, it would be unlikely that any HDFS node stored both blocks, so some of the split would have to be transferred across the network to the node running the map task, which is clearly less efficient than running the whole map task using local data.

Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is intermediate output: it’s processed by reduce tasks to produce the final output, and once the job is complete the map output can be thrown away. So storing it in HDFS, with replication, would be overkill. If the node running the map task fails before the map output has been consumed by the reduce task, then Hadoop will automatically rerun the map task on another node to re-create the map output.

Reduce tasks don’t have the advantage of data locality—the input to a single reduce task is normally the output from all mappers. In the present example, we have a single reduce task that is fed by all of the map tasks. Therefore, the sorted map outputs have to be transferred across the network to the node where the reduce task is running, where they are merged and then passed to the user-defined reduce function. The output of the reduce is normally stored in HDFS for reliability. For each HDFS block of the reduce output, the first replica is stored on the local node, with other replicas being stored on off-rack nodes. Thus, writing the reduce output does consume network bandwidth, but only as much as a normal HDFS write pipeline
consumes.


Note: The number of reduce tasks is not governed by the size of the input, but is specified independently.

When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition. The partitioning can be controlled by a user-defined partitioning function, but normally the default partitioner—which buckets keys using a hash function—works very well.


Finally, it’s also possible to have zero reduce tasks. This can be appropriate when you don’t need the shuffle since the processing can be carried out entirely in parallel(NLineInputFormat). In this case, the only off-node data transfer is when the map tasks write to HDFS.

Saturday, June 15, 2013

MapReduce: Map Phase: Class Mapper

MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: the map function and the reduce function.

The Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. Mapper implementations can access the Configuration for the job via the JobContext.getConfiguration().

The framework first calls setup(org.apache.hadoop.mapreduce.Mapper.Context), followed by map(Object, Object, Context) for each key/value pair in the InputSplit. Finally cleanup(Context) is called.

All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to a Reducer to determine the final output. Users can control the sorting and grouping by specifying two key RawComparator classes.

The Mapper outputs are partitioned per Reducer. Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.

Users can optionally specify a combiner, via Job.setCombinerClass(Class), to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer.

Applications can specify if and how the intermediate outputs are to be compressed and which CompressionCodecs are to be used via the Configuration.

If the job has zero reduces then the output of the Mapper is directly written to the OutputFormat without sorting by keys.

Applications may override the run(Context) method to exert greater control on map processing e.g. multi-threaded Mappers  etc.

Source: Hadoop Official API documentation: Mapper

Tuesday, June 11, 2013

MapReduce Introduction

MapReduce is a data processing model(like pipelines and message queues in UNIX). Its greatest advantage is the easy scaling of data processing over multiple computing nodes.

MapReduce programs are executed in two main phases, called mapping and reducing. Each phase is defined by a data processing function, and these functions are called mapper and reducer, respectively.

In the mapping phase, MapReduce takes the input data and feeds each data element to the mapper. In the reducing phase, the reducer processes all the outputs from the mapper and arrives at a final result. In simple terms, the mapper is meant to filter and transform the input into something that the reducer can aggregate over.

Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once you write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change.

Partitioning and shuffling are common design patterns that go along with mapping and reducing. Unlike mapping and reducing, though, partitioning and shuffling are generic functionalities that are not too dependent on the particular data processing application. The MapReduce framework provides a default implementation that works in most situations.

In order for mapping, reducing, partitioning, and shuffling (and a few others we haven’t mentioned) to seamlessly work together, we need to agree on a common structure for the data being processed. It should be flexible and powerful enough to handle most of the targeted data processing applications.

MapReduce uses lists and (key/value) pairs as its main data primitives. The keys and values are often integers or strings but can also be dummy values to be ignored or complex object types.

Friday, June 7, 2013

The building blocks of Hadoop

Hadoop employs a master/slave architecture for both distributed storage and distributed computation. The distributed storage system is called the Hadoop Distributed File System (HDFS).

On a fully configured cluster, "running Hadoop" means running a set of daemons, or resident programs, on the different servers in you network. These daemons have specific roles; some exists only on one server, some exist across multiple servers. The daemons include:
  1. NameNode
  2. DataNode
  3. Secondary NameNode
  4. JobTracker
  5. TaskTracker
NameNode: The NameNode is the master of HDFS that directs the slave DataNode daemons to perform the low-level I/O tasks. It is the bookkeeper of HDFS; it keeps track of how your files are broken down into file blocks, which nodes store those blocks and the overall health of the distributed filesystem.

The server hosting the NameNode typically doesn't store any user data or perform any computations for a MapReduce program to lower the workload on the machine, hence memory & I/O intensive.

There is unfortunately a negative aspect to the importance of the NameNode - it's a single point of failure of your Hadoop cluster. For any of the other daemons, if their host fail for software or hardware reasons, the Hadoop cluster will likely continue to function smoothly or you can quickly restart it. Not so for the NameNode.

DataNode: Each slave machine in your cluster will host a DataNode daemon to perform the grunt work of the distributed filesystem - reading and writing HDFS blocks to actual files on the local file system

When you want to read or write a HDFS file, the file is broken into blocks and the NameNode will tell your client which DataNode each block resides in. Your client communicates directly with the DataNode daemons to process the local files corresponding to the blocks.

Furthermore, a DataNode may communicate with other DataNodes to replicate its data blocks for redundancy.This ensures that if any one DataNode crashes or becomes inaccessible over the network, you'll still able to read the files.

DataNodes are constantly reporting to the NameNode. Upon initialization, each of the DataNodes informs the NameNode of the blocks it's currently storing. After this mapping is complete, the DataNodes continually poll the NameNode to provide information regarding local changes as well as receive instructions to create, move, or delete from the local disk.


Secondary NameNode(SNN): The SNN is an assistant daemon for monitoring the state of the cluster HDFS. Like the NameNode, each cluster has one SNN, and it typically resides on its own machine as well. No other DataNode or TaskTracker daemons run on the same server. The SNN differs from the NameNode in that this process doesn't receive or record any real-time changes to HDFS. Instead, it communicates with the NameNode to take snapshots of the HDFS metadata at intervals defined by the cluster configuration.

As mentioned earlier, the NameNode is a single point of failure for a Hadoop cluster, and the SNN snapshots help minimize the downtime and loss of data.

There is another topic which can be covered under SNN, i.e., fsimage(filesystem image) file and edits file:

The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata. For example, creating a new file in HDFS causes the NameNode to insert a record into the EditLog indicating this. Similarly, changing the replication factor of a file causes a new record to be inserted into the EditLog. The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too.



JobTracker: Once you submit your code to your cluster, the JobTracker determines the execution plan by determining which files to process, assigns nodes to different tasks, and monitors all tasks as they're running. should a task fail, the JobTracker will automatically relaunch the task, possibly on a different node, up to a predefined limit of retries.

There is only one JobTracker daemon per Hadoop cluster. It's typically run on a server as a master node of the cluster.

TaskTracker: As with the storage daemons, the computing daemons also follow a master/slave architecture: the JobTracker is the master overseeing the overall execution of a MapReduce job and the TaskTracker manage the execution of individual tasks on each slave node.

Each TaskTracker is responsible for executing the individual tasks that the JobTracker assigns. Although there is a single TaskTracker per slave node, each TaskTracker can spawn multiple JVMs to handle many map or reduce tasks in parallel.

One responsibility of the TaskTracker is to constantly communicate with the JobTracker. If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster.

Reference: Hadoop in Action Book

HDFS Design

Hadoop Distributed File System(HDFS) is part of the Apache Hadoop Core project. It is a fault-tolerant distributed file system designed to run on low-cost hardware. It has many similarities with existing distributed file systems. However, it relaxes a few POSIX requirements to enable streaming access to file system data.

On the other side, it provides high throughput(as it offers parallel processing) access to application data and is suitable for applications that have large data sets.

Note: It was originally built as infrastructure for the Apache Nutch web search engine project.

Assumptions and Goals:
  1. Hardware Failure: Hardware failure is a norm rather than the exception. An HDFS instance may consists of hundreds or thousands of server machines, each sorting part of the file system's data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
  2. Streaming Data Access: Applications that run on HDFS need streaming access to their data sets. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access.
  3. Large Data Sets:  Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.
  4. Simple Coherency Model: HDFS applications need a write-once-read-may access model for files. A file once created, written and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access.
  5. Moving Computation is Cheaper than Moving Data: A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases
  6. Portability Across Heterogeneous Hardware and Software Platforms 
Reference: Hadoop Official Documentation

Wednesday, June 5, 2013

HDFS - Hadoop Distributed File System

Filesystems that manage the storage across a network of machines are called distributed filesystems.

HDFS is Hadoop's flagship filesystem designed for storing very large files with streaming data access pattern, running on clusters of commodity hardware.

Streaming data access:
HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.

Reference: Hadoop The Definite Guide Book

Tuesday, June 4, 2013

Introduction to Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. 

Source: Hadoop Official

Fancy terms used:
  1. framework
  2. distributed processing
  3. large datasets
  4. cluster
  5. simple programs
  6. scale up to thousands of machines
  7. local computing (data locality)
  8. fault-tolerance is achieved through framework, which is awesome!!!