Problem: Huge amounts of data are produced and accumulated daily, but large-scale processing of that data on commodity computers is difficult → Big Data is difficult
- Commodity Hardware: We have lots of resources (1000s of cheap PCs), but they are very hard to utilize
- Parallel Programming: We have clusters with over 10k cores, but it is hard to program 10k concurrent threads
- Fault Tolerance: We have 1000s of storage devices, but some may break daily. Failure is a norm rather than an assumption...
- Scalable: Scale Up vs Scale Out
- Expensive: There are many technologies available in the market for Big Data processing, but are proprietary in nature
Solution:
- Hadoop: Runs on commodity hard ware, supports scale out and it is free ware
- HDFS(Hadoop Distributed File System): Supports data replication, thereby high availability
- MapReduce: Supports parallel execution of tasks