Hadoop for Data Warehousing


Introduction

Today the amounts of data stored in Data Warehouses are becoming more and more enormous. While traditional ways of Data Warehousing design on top of Relational Databases are still popular, they fail to curb terabytes of data efficiently, which is mostly attributed to complexity of scaling relational databases. There are new emerging approaches that try to address this problem.

One of such approaches is to use the MapReduce paradigm and Hadoop as the implementation for building large Data Warehouses over distributed network of servers that can handle huge volumes of data. Hadoop has already become a proven tool for BigData analytics and now there is a rising interest in this technology for Data Warehousing purposes .

The goal of the work is to discuss in what ways Hadoop, as a Map-Reduce framework, can be used in Data Warehouses then compare it with traditional approaches and see in which situations it should be beneficial to use Hadoop in a Data Warehousing project. Additionally we plan to see what are cases where traditional approaches should still be preferred over Hadoop.


Motivation

  • Proven useful in Big Data challenges
  • Interesting in Business Intelligence
  • Many companies want to integrate Hadoop into existent Data Warehousing solutions


Types of Data

  • Hadoop can handle Structured data (Relational), Semi-structured (XML, JSON, ...)
  • but also it's very good at unstructured and machine-generated data: Data with no Data Model
  • Hadoop will help to structure large amounts of unstructured data


Growth

Today have lots of documents

  • Not all structured (Text, audio, video)
  • Want to structure then
  • E.g. Lots of recorded calls - we want to extract certain keywords from them

The number of structured documents is also growing rapidly

  • Transactions
  • etc


Entire Data Warehouse on Hadoop

It's possible to build entire Data Warehouse on top of Hadoop

  • Example: Cheetah by Turn.inc
  • It's a specialized Data Warehouse
  • Hive is a more generic Data Warehouse solution


Design

Virtual views

  • based on Star and Snowflake schemas
    • central Fact Table
    • connected to dimension tables via primary key and foreign key dependencies
    • they are needed for joining
  • they join the fact table with its dimensions
  • only views are exposed to users for querying, not the tables
  • virt-views.png


Operations

  • Filtering, Grouping and Aggregations
    • easily supported by hadoop
  • joins
    • recall that Reduce-Side Joins are expensive - it may lead to data re-distribution
    • it's better to denormalize dimension tables as much as possible
    • i.e. store all data in fact tables

Denormalization works well


Queries

Special-purpose QL for querying it

Unfortunately Chetach is proprietary - wasn't able to play with it


Hadoop and Data Warehouse

Hadoop as a Part of Data Warehouse

  • they are interchangeable
  • and have lots of differences
  • often used side by side
  • and it can help to make the costs lower

It can be

  • Transitory Platform for ETL
  • Active Storage

Transitory platform for ETL

Hadoop as an ETL process

  • this was the initial use case of Hadoop
  • goal: extract value from terabytes of information
  • Hadoop is rather a component in ETL tools, not an ETL tool itself
    • it's just another channel in ETL designers
    • many vendors came up with Hadoop components in their graphical languages
    • and it's possible to do Pig and Hive queries inside these channels
    • for example, Informatica can do that


Algorithm

  • load data into Hadoop
  • discover something there with it
  • E parse and prepare (with MapReduce)
  • T clean and transform to some structured format (with MapReduce)
  • L extract data from Hadoop and load to a Data Warehouse

Input:

  • not structured provisional data
  • semi-structured, unstructured or machine-generated data
  • provisional - meaning analyzed or used in isolation and don't need integration
    • unlike traditional operational Data Warehousing data
  • i.e. this data don't really fit into DW paradigm, but Hadoop was created to handle it


Text Processing

  • most common use case of Hadoop (the survey)
  • RDBMs not good for this!
  • no SQL functions for text processing
  • text stream -> structured data

Examples

  • like: finding keywords, sentiment analysis
  • once refined and structured the text, can put it into DW for further analysis
  • say to match sentiment with customer profile


Examples

Example 1

Example 2

  • suppose we run a eBay-like website
  • someone publishes an advertisement - it's a dress with some description
  • but the description doesn't mention the color
    • user thinks it's obvious - it can be seen on the picture!
  • but we want to be able to answer queries such as "red dress"
  • so we process this image to retrieve features
    • color
    • type
    • other things that might be used for search
  • not possible with SQL
  • but this information can be used in out Data Warehouse as well


Repository

Hadoop can store lots of data

It's possible to use it in 2 ways (as a storage)

  • Active Storage for unstructured data
  • Live Archive for "dormant" structured enterprise data


Active Storage

First use-case

  • storage cost: Hadoop cluster typically costs 50-100 times less than typical DB solutions (per Tb)
  • data access is fast
  • can use it as a long-term storage
    • not necessary to move everything to a Data Warehouse each time we need to do some analysis - can do that with Hadoop
    • but still can move some portions of processed data to a DW for BI analysis


Example

  • need to capture data arriving from raw streams
  • want to store reliably and cost efficiently
  • put it into Hadoop


Enterprise Data Live Archive

Second use case: Live archive for traditional enterprise data

motivation

  • 85% of tables in DW are not used
  • can transfer this "dormant" data to low cost hadoop storage
  • result in extraordinary savings


Analytical Sandboxes

Last common use case:

  • Our input data is not structured
  • That means we do not know yet what we want to derive from it
  • Need to play around to see what value we can extract
  • And hadoop (along with Hive or Pig for ad-hoc queries) is good for it


How to Integrate

  • step-by step recommendations
  • first: use it as a big staging area
    • for mining unstructured (and structured) data
    • then execute some ETL transformations on it
  • next: use as non-dimensional warehouse for raw data
  • integrate more


Conclusion

Hadoop and Data Warehouses should work together in a single information chain

Hadoop use cases:

  • front end: preprocessing raw unstructured data
  • front end: analytical sandbox
  • back end: long-term active storage
  • in-between: supplementing existent technologies with parallel processing


Data Warehousing use cases:

  • structured and integrated data for BI and OLAP


Questions

Q: Why?

  • to be able to run on commodity hardware

Q: Who should adapt this?

Q: How much data a company should have?

  • when data no longer fits in MySQL or PostreSQL - it's a good point that should adopt Hadoop


See also

Sources

  • Lee et al, Parallel Data Processing with MapReduce: A Survey [1]
  • Ordonez et al, Relational versus non-relational database systems for data warehousing [2]
  • Paper by Cloudera and Teradata, Awadallah and Graham, Hadoop and the Data Warehouse: When to Use Which. [3]
  • Chen. Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce. [4]
  • Russom, Integrating Hadoop into Business Intelligence and Data Warehousing. [5]