- This is a part of a project for Data Warehousing (ULB) course
- link to the final report: report
- presentation: html or slide share
Introduction
Today the amounts of data stored in Data Warehouses are becoming more and more enormous. While traditional ways of Data Warehousing design on top of Relational Databases are still popular, they fail to curb terabytes of data efficiently, which is mostly attributed to complexity of scaling relational databases. There are new emerging approaches that try to address this problem.
One of such approaches is to use the MapReduce paradigm and Hadoop as the implementation for building large Data Warehouses over distributed network of servers that can handle huge volumes of data. Hadoop has already become a proven tool for BigData analytics and now there is a rising interest in this technology for Data Warehousing purposes .
The goal of the work is to discuss in what ways Hadoop, as a Map-Reduce framework, can be used in Data Warehouses then compare it with traditional approaches and see in which situations it should be beneficial to use Hadoop in a Data Warehousing project. Additionally we plan to see what are cases where traditional approaches should still be preferred over Hadoop.
Motivation
- Proven useful in Big Data challenges
- Interesting in Business Intelligence
- Many companies want to integrate Hadoop into existent Data Warehousing solutions
Types of Data
- Hadoop can handle Structured data (Relational), Semi-structured (XML, JSON, …)
- but also it’s very good at unstructured and machine-generated data: Data with no Data Model
- Hadoop will help to structure large amounts of unstructured data
Growth
Today have lots of documents
- Not all structured (Text, audio, video)
- Want to structure then
- E.g. Lots of recorded calls - we want to extract certain keywords from them
The number of structured documents is also growing rapidly
- Transactions
- etc
Entire Data Warehouse on Hadoop
It’s possible to build entire Data Warehouse on top of Hadoop
- Example: Cheetah by Turn.inc
- It’s a specialized Data Warehouse
- Hive is a more generic Data Warehouse solution
Design
Virtual views
- based on Star and Snowflake schemas
- central Fact Table
- connected to dimension tables via primary key and foreign key dependencies
- they are needed for joining
- they join the fact table with its dimensions
- only views are exposed to users for querying, not the tables
Operations
- Filtering, Grouping and Aggregations
- easily supported by hadoop
- joins
- recall that Reduce-Side Joins are expensive - it may lead to data re-distribution
- it’s better to denormalize dimension tables as much as possible
- i.e. store all data in fact tables
Denormalization works well
- all dimensions are either insertion only or Slowly Changing Dimensions
Queries
Special-purpose QL for querying it
Unfortunately Chetach is proprietary - wasn’t able to play with it
Hadoop and Data Warehouse
Hadoop as a Part of Data Warehouse
- they are interchangeable
- and have lots of differences
- often used side by side
- and it can help to make the costs lower
It can be
- Transitory Platform for ETL
- Active Storage
Transitory platform for ETL
Hadoop as an ETL process
- this was the initial use case of Hadoop
- goal: extract value from terabytes of information
- Hadoop is rather a component in ETL tools, not an ETL tool itself
Algorithm
- load data into Hadoop
- discover something there with it
- ’'’E’’’ parse and prepare (with MapReduce)
- ’'’T’’’ clean and transform to some structured format (with MapReduce)
- ’'’L’’’ extract data from Hadoop and load to a Data Warehouse
Input:
- not structured provisional data
- semi-structured, unstructured or machine-generated data
- provisional - meaning analyzed or used in isolation and don’t need integration
- unlike traditional operational Data Warehousing data
- i.e. this data don’t really fit into DW paradigm, but Hadoop was created to handle it
Text Processing
- most common use case of Hadoop (the survey)
-
RDBMs not good for this - no SQL functions for text processing - text stream -> structured data
Examples
- like: finding keywords, sentiment analysis
- once refined and structured the text, can put it into DW for further analysis
- say to match sentiment with customer profile
Examples
Example 1
- process raw click streams
- use Data Mining to detect patterns
- then put all the findings into a Data Warehouse
Example 2
- suppose we run a eBay-like website
- someone publishes an advertisement - it’s a dress with some description
- but the description doesn’t mention the color
-
user thinks it’s obvious - it can be seen on the picture - but we want to be able to answer queries such as “red dress” - so we process this image to retrieve features - color
- type
- other things that might be used for search
-
- not possible with SQL
- but this information can be used in out Data Warehouse as well
Repository
Hadoop can store lots of data
It’s possible to use it in 2 ways (as a storage)
- Active Storage for unstructured data
- Live Archive for “dormant” structured enterprise data
Active Storage
First use-case
- storage cost: Hadoop cluster typically costs 50-100 times less than typical DB solutions (per Tb)
- data access is fast
- can use it as a long-term storage
- not necessary to move everything to a Data Warehouse each time we need to do some analysis - can do that with Hadoop
- but still can move some portions of processed data to a DW for BI analysis
Example
- need to capture data arriving from raw streams
- want to store reliably and cost efficiently
- put it into Hadoop
Enterprise Data Live Archive
Second use case: Live archive for traditional enterprise data
motivation
- 85% of tables in DW are not used
- can transfer this “dormant” data to low cost hadoop storage
- result in extraordinary savings
Analytical Sandboxes
Last common use case:
- Our input data is not structured
- That means we do not know yet what we want to derive from it
- Need to play around to see what value we can extract
- And hadoop (along with Hive or Pig for ad-hoc queries) is good for it
How to Integrate
- step-by step recommendations
- first: use it as a big staging area
- for mining unstructured (and structured) data
- then execute some ETL transformations on it
- next: use as non-dimensional warehouse for raw data
- integrate more
Conclusion
Hadoop and Data Warehouses should work together in a single information chain
Hadoop use cases:
- front end: preprocessing raw unstructured data
- front end: analytical sandbox
- back end: long-term active storage
- in-between: supplementing existent technologies with parallel processing
Data Warehousing use cases:
- structured and integrated data for BI and OLAP
Questions
Q: Why?
- to be able to run on commodity hardware
Q: Who should adapt this?
Q: How much data a company should have?
- when data no longer fits in MySQL or PostreSQL - it’s a good point that should adopt Hadoop
See also
Sources
- Lee et al, Parallel Data Processing with MapReduce: A Survey [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf]
- Ordonez et al, Relational versus non-relational database systems for data warehousing [http://www2.cs.uh.edu/~ordonez/w-2010-DOLAP-relnonrel.pdf]
- Paper by Cloudera and Teradata, Awadallah and Graham, Hadoop and the Data Warehouse: When to Use Which. [http://www.teradata.com/white-papers/Hadoop-and-the-Data-Warehouse-When-to-Use-Which/]
- Chen. Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce. [http://www.vldb.org/pvldb/vldb2010/papers/I08.pdf]
- Russom, Integrating Hadoop into Business Intelligence and Data Warehousing. [http://www.slideshare.net/emcacademics/tdwi-best-practices-report-hadoop-foro-bi-and-dw-april-2013]