Database System Catalog

To estimate a cost of Physical operators in a DBMS we use the following statistics:

  • $B(R)$ - # of blocks that relation $R$ holds
  • $T(R)$ - # of tuples in $R$
    • typically can be used to calculate $B(R)$ when we know how many bytes we have per block
  • $V(R, A_1, ..., A_n) = | \delta \pi_{A_1, ..., A_n} (R) |$ - # of distinct values


This statistics in DBMS is a system catalog

  • they are regularly collected (when needed, scheduled, etc)
  • and regularly revisited
  • note that this data is kept only for base relations, not for subresults of a query!


Statistics

For base relations we typically have some Histograms that show how values are distributed

Equal-Width Histogram

  • In this type of histograms the values are grouped in equal-width buckets
  • We assume that the values are distributed uniformly within there buckets

Example

range [1, 10) [11, 20) [21, 30) [31, 40) [41, 50)
# of tuples 50 2000 2000 3000 2950


See Also

Sources

Machine Learning Bookcamp: Learn machine learning by doing projects. Get 40% off with code "grigorevpc".

Share your opinion