A data warehouse is a storage with the following four characteristics: it's subject-oriented, integrated, time-variant and non-volatile.
- A data warehouse is organized around a particular subject area.
- For example, around "sales".
- Data comes from different sources and is integrated into one.
- Integrated also implies consistency
- For example, in source A and source B products have different identifiers, but in a data warehouse data from all sources have adhere to the same way of identifying.
- Another example: date may be stored in different format, but it's converted to the single common format in a data warehouse
- It keeps historical data.
- We can retrieve 3-month-old data, 6-month, one year and even older data from a data warehouse.
- In contrast to a transactional system, where typically only the most recent data is kept.
- For example, in OLTP system we have the most recent address of a customer, while in a data warehouse we keep all the history (see Slowly Changing Dimensions)
- Once we put data into a data warehouse, we never change it.
- If it's there - it's there forever.
A Data Warehousing is a platform for supplying clean, standardized, dimensional, aggregated data
The main role of data warehouses is to support decision making process.
- i.e. answer essential business questions
Best for most BI deliverables
Other systems, like Hadoop, are not good at this.
Usually DWs are built on RBDs
- this structure dominates in data warehousing
- so far it's best technology to manage and analyze DWs
RDBMSs are likely to remain standard in Data Warehousing worlds
For Data Warehousing purposes we can use
- standard SQL (joins, group by, etc)
- ROLAP - SQL extensions for OLAP (group by cube, ect)
- Queries are ad-hoc
Data Warehouses (especially expensive solutions) are very effective performance-wise.
There are a lot of techniques for speeding up query executions
Nowadays there are lots of not structured data
Difficult to integrate Data Mining Algorithms
- Data Mining typically happens outside of RDBs
- costly to run scalable parallel RDBs
- need expensive specialized hardware (like IBMs Aster)
Amounts of Data
- new data arrive to fast to process