To estimate a cost of Physical operators in a DBMS we use the following statistics:

- $B(R)$ - # of blocks that relation $R$ holds
- $T(R)$ - # of tuples in $R$
- typically can be used to calculate $B(R)$ when we know how many bytes we have per block

- $V(R, A_1, ..., A_n) = | \delta \pi_{A_1, ..., A_n} (R) |$ - # of distinct values

This statistics in DBMS is a *system catalog*

- they are regularly collected (when needed, scheduled, etc)
- and regularly revisited
- note that this data is kept only for base relations, not for subresults of a query!

For base relations we typically have some Histograms that show how values are distributed

- In this type of histograms the values are grouped in equal-width buckets
- We assume that the values are distributed uniformly within there buckets

Example

range | [1, 10) | [11, 20) | [21, 30) | [31, 40) | [41, 50) |
---|---|---|---|---|---|

# of tuples | 50 | 2000 | 2000 | 3000 | 2950 |

- Database Systems Architecture (ULB)
- Database Systems: The Complete Book (2nd edition) by H. Garcia-Molina, J. D. Ullman, and J. Widom