Showing posts with label Big Data. Show all posts
Showing posts with label Big Data. Show all posts

Wednesday, July 24, 2013

BigData Challenges

Problem: Huge amounts of data are produced and accumulated daily, but large-scale processing of that data on commodity computers is difficult → Big Data is difficult
  • Commodity Hardware: We have lots of resources (1000s of cheap PCs), but they are very hard to utilize
  • Parallel Programming: We have clusters with over 10k cores, but it is hard to program 10k concurrent threads
  • Fault Tolerance: We have 1000s of storage devices, but some may break daily. Failure is a norm rather than an assumption...
  • Scalable: Scale Up vs Scale Out
  • Expensive: There are many technologies available in the market for Big Data processing, but are proprietary in nature
Solution:
  1. Hadoop: Runs on commodity hard ware, supports scale out and it is free ware
  2. HDFS(Hadoop Distributed File System): Supports data replication, thereby high availability
  3. MapReduce: Supports parallel execution of tasks 

Tuesday, June 4, 2013

Introduction to Big Data

Big Data: is the capability to manage a huge volume of disparate data, at the right speed, and within the right time frame to allow real-time analysis and reaction.

Big data is typically broken down by 4characteristics: 
  1. Volume: How much data
  2. Velocity: How fast that data is processed
  3. Variety: The various types of data
  4. Veracity: How accurate is that data in predicting business value? Do the results of a big data analysis actually make sense?



Big Data types:
  1. Structured data
  2. Unstructured data
  3. Semi-structured dat
The sources of data are divided into 2 categories:
  1. Computer- or machine-generated: Machine-generated data generally refers to data that is created by a machine without human intervention.
  2. Human-generated: This is data that humans, in interaction with computers, supply.
Structured Data: refers to data that has a defined length and format. Examples of structured data include numbers, dates, and strings. It is usually stored in a database.

Example of structured machine-generated structured data can include the following:
  1. Sensor data: radio frequency Id(RFID) tags, smart meters, Global Positioning System(GPS) data
  2. Web log data: when servers, applications, networks, and so on operate, they capture all kinds of data about their activity. This can amount to huge volumes of data that can be useful, for example, to deal with service-level agreements or to predict security breaches.
  3. Point-of-sale data: When the cashier swipes the bar code of any product that you are purchasing, all that data associated with the product is generated. Just think of all the products across all the people who purchase them and you can understand how big this data set can be.
  4. Financial data: stock trading
Examples of structured human-generated data might include the following:
  1. Input data: Any input given by a user through html forms, etc.
  2. Click-stream data: Data is generated every time you click a link on a website. This data can be analyzed to determined customer behavior and buying patterns
  3. Gaming-related data: Every move you make in a game can be recorded. This can be useful in understanding how end users move though a gaming portfolio.
Unstructured data: is data that does not follow a specified format. Until recently, however, the technology didn’t really support doing much with it except storing it or analyzing it manually.

Examples of machine-generated unstructured data: 
  1. Satellite images: weather data, Google Earth
  2. Scientific data: seismic imagery, atmospheric data, and high energy physics
  3. Photographs and video: This includes security, surveillance, and traffic video
  4. Radar or sonar data: vehicular, meteorological, and oceanographic seismic profiles 
Examples of human-generated unstructured data:
  1. Text internal to your company: text within documents, logs, survey results, and e-mails.
  2. Social media data: YouTube, Facebook, Twitter, LinkedIn, and Flickr
  3. Mobile data: This includes data such as text messages and location information
  4. Website content: This comes from any site delivering unstructured content
Semi-structured data: is a kind of data that falls between structured and unstructured data. It does not necessarily conform to a fixed schema but may be self-describing and may have label/value pairs.

Examples: EDI, SWIFT, and XML

Source of this tutorial:
  1. Big Data for Dummies Book
  2. The real-world user case of Big Data