Data Warehouse

Definition

A data warehouse is a storage with the following four characteristics: it's subject-oriented, integrated, time-variant and non-volatile.

Subject-Oriented

  • A data warehouse is organized around a particular subject area.
  • For example, around "sales".

Integrated

  • Data comes from different sources and is integrated into one.
  • Integrated also implies consistency
  • For example, in source A and source B products have different identifiers, but in a data warehouse data from all sources have adhere to the same way of identifying.
  • Another example: date may be stored in different format, but it's converted to the single common format in a data warehouse

Time-Variant

  • It keeps historical data.
  • We can retrieve 3-month-old data, 6-month, one year and even older data from a data warehouse.
  • In contrast to a transactional system, where typically only the most recent data is kept.
  • For example, in OLTP system we have the most recent address of a customer, while in a data warehouse we keep all the history (see Slowly Changing Dimensions)

Non-volatile

  • Once we put data into a data warehouse, we never change it.
  • If it's there - it's there forever.

Alternative Definition

A Data Warehousing is a platform for supplying clean, standardized, dimensional, aggregated data


Goal

The main role of data warehouses is to support decision making process.

  • i.e. answer essential business questions


Features

Best for most BI deliverables

Other systems, like Hadoop, are not good at this.


Data Warehouses on Top of Relational Databases

Usually DWs are built on RBDs

  • this structure dominates in data warehousing
  • so far it's best technology to manage and analyze DWs
    • very mature here

RDBMSs are likely to remain standard in Data Warehousing worlds


Language

For Data Warehousing purposes we can use

  • standard SQL (joins, group by, etc)
  • ROLAP - SQL extensions for OLAP (group by cube, ect)
  • Queries are ad-hoc


Performance

Data Warehouses (especially expensive solutions) are very effective performance-wise.

There are a lot of techniques for speeding up query executions


Disadvantages

Nowadays there are lots of not structured data

Difficult to integrate Data Mining Algorithms

  • Data Mining typically happens outside of RDBs

It's costly

  • costly to run scalable parallel RDBs
  • need expensive specialized hardware (like IBMs Aster)

Amounts of Data

  • new data arrive to fast to process


See Also

Sources