Query Result Size Estimation

Choosing a physical operator for a Relational Algebra operator depends on

  • a particular case and statistics kept in Database System Catalog
  • note that this data is kept only for base relations, not for sub-results
    • but we need to be able to estimate them for sub-results as well!


So the goal:

  • for every internal node $n$ estimate parameters
    • $B(n)$ - the number of blocks,
    • $T(n)$ - the number of tuples,
    • $V(n, A_1, ..., A_k)$ - the number of distinct values
  • note that we can compute $B(n)$ given (1) $T(n)$ (2) size of each tuple in $n$ and (3) size of a block


for bag-based projection $\pi_L (R)$ the general formula is

  • $T(\pi_L (R)) = T(R)$: tuples are not eliminated
  • but $B(\pi_L (R))$ can change since the size of each tuple changes


Relation $R(A, B, C)$

  • $A, B$ - 4 bytes int, $C$ - 100 bytes string
  • each tuple has header 12 bytes
  • block size: 1024 bytes, and block header is 24 bytes
  • $T(R) = 10000, B(R) = 1250$
  • how many blocks needed to store $\pi_{A,B}(R)$


  • 1024 - 24 = 1000 bytes per block
  • 12 + 4 + 4 = 20 bytes per projected record
  • 1000 / 20 = 50 tuples per block
  • $B(\pi_{A,B}(R)) = T(\pi_{A,B}(R)) / 50 = 10000 / 50 = 200$

If size of records is variable, it's harder.

  • In this case usually keep some statistics to estimate the avg size of a projected record


$\sigma_p(R)$ for some filtering predicate $p$

  • estimation is $T(\sigma_p(R)) = T(R) \times \text{sel}_p (R)$
  • where $\text{sel}_p (R)$ is selectivity of predicate $p$ on relation $R$
    • or the probability that a tuple $t \in R$ will satisfy $p$
  • calculating $\text{sel}_p$ depends on the type of predicate $p$


Selection $\sigma_{A = c}(R)$ where $c$ is a constant

  • $\text{sel}_{A = c} (R) = \cfrac{1}{V(R, A)}$
  • where $V(R, A)$ is the number of distinct values in $R$
  • in this case for simplicity we assume the uniform distribution of values in $R$


  • Given: $R(A, B, C)$, $T(R) = 10000$, $V(R, A) = 50$
  • $T(\sigma_{A = 10}(R)) = \cfrac{T(R)}{V(R, A)} = \cfrac{10000}{50} = 200$

But typically Databases collect some statistics in the Database System Catalog

range [1, 10) [11, 20) [21, 30) [31, 40) [41, 50)
# of tuples 50 2000 2000 3000 2950
  • suppose we have equal-width histogram on $A$:
  • then we can estimate $\text{sel}_{A = 10} = \underbrace{\cfrac{50}{10000}}_{50 values} \times \underbrace{\cfrac{1}{10}}_{10 possible values}$


Selection $\sigma_{A < c} (R)$ where $c$ is constant

Suppose we don't have any statistics

  • in this case we apply a simple following heuristic
    • $\text{sel}_{A < c} = \cfrac{1}{2} or \text{sel}_{A < c}(R) = \cfrac{1}{3}$
  • rationale: queries with inequalities usually retrieve a small fraction of the possible tuples, not all of them


  • Given: $R(A, B, C), T(R) = 10 000$
  • estimation: $T(\sigma_{B < 100} (R)) = T(R) / 3 = 3334$

Better estimates are possible if we have some statistics

  • Given: $R(A, B, C), T(R) = 10 000$, values of $B$ lay in range [8, 57] distributed uniformly
  • Therefore $V(R, B) \leqslant 57 - 8 + 1$ - that many values of $B$ are possible
  • Estimate $\text{sel}_{B < 10} (R)$
  • only $B = 8$ and $B = 9$ satisfy $B < 10$
  • therefore $\text{sel}_{B < 10} (R) = \cfrac{2}{50} = 0.04$
  • and $T(\sigma_{B < 10} (R)) = T(R) \times 0.04 = 400$


Selection $\sigma_{A \ne c} (R)$ where $c$ is constant

  • this is the inverse of $\sigma_{A = c} (R)$
  • $\text{sel}_{A \ne c} (R) = \cfrac{V(R, A) - 1}{V(R, A)}$
  • this is estimated probability that a tuple doesn't satisfy the predicate $A = c$

Inversion (Not)

Selection $\sigma_{\text{not}(p)} (R) $

  • same as for Inequality
  • $\text{sel}_{\text{not}(p)} (R) = 1 - \text{sel}_{p} (R)$


Selection $\sigma_{p_1 \land p_2} (R)$

  • $\sigma_{p_1 \land p_2} (R) = \sigma_{p_1} \sigma_{p_2} (R) = \sigma_{p_2} \sigma_{p_1} (R)$ (order doesn't matter)
  • in this case, $\text{sel}_{p_1 \land p_2}(R) = \text{sel}_{p_1}(R) \times \text{sel}_{p_2}(R)$
  • important assumption: $p_1$ and $p_2$ are independent
    • for example, doesn't hold for $A > 100 \land A < 200$ - because the conditions are correlated in this case


  • $T(R) = 10000, V(R, A)$ = 50
  • estimate $T(\sigma_{A = 10 \land B < 10 (R)})$:
  • $T(R) \times \text{sel}_{A = 10} (R) \times \text{sel}_{B < 10} (R) = \cfrac{T(R)}{V(R, A) \times 3} = 67$


Selection $\sigma_{p_1 \lor p_2} (R)$

  • $\text{sel}_{p_1 \lor p_2} (R) = \min (\text{sel}_{p_1} (R) + \text{sel}_{p_2} (R), 1)$
    • it cannot be greater than 1
  • assumptions
    • $p_1$ and $p_2$ are independent
    • also they select disjoint sets of tuples (otherwise we would count some tuples twice)

Another way: to use De-Morgan Rule

  • $p_1 \lor p_2 \equiv \overline{\overline{p_1} \land \overline{p_2}}$ (the line over means not)
  • $\text{sel}_{p_1 \lor p_2}(R) = 1 - (1 - \text{sel}_{p_1}(R)) \times (1 - \text{sel}_{p_2}(R))$
  • in this case we also have the same assumptions

Cartesian Product

$R \times S$

The general formula is:

  • $T(R \times S) = T(R) \times T(S)$


Simple Cases

$R \Join S, R(X, Y), S(Y, Z)$ (i.e. we join on $Y$)

  1. $R$ and $S$ have no tuples in common
    • $T(R \Join S) = 0$
  2. $Y$ is a key in $S$ and a foreign key of $R$
    • each tuple of $R$ joins exactly with one tuple in $S$
    • $T(R \Join S) = T(R)$
  3. almost all tuples of $R$ and $S$ have the same $Y$ value
    • then $T(R \Join S) \approx T(R) \times T(S)$ (degenerates to a Cartesian product)

One Join Attribute

$R \Join S, R(X, Y), S(Y, Z)$ (i.e. we join on $Y$)

  • it's same as selection with predicate $R.Y = S.Y$


For other harder cases we need the following simplifications:

Containment of Value Sets

  • if $R(V, Y) \leqslant V(S, Y)$
  • then every value of $Y \in R$ will have a joining tuple with $Y \in S$
  • that means: all matched values in $X$ will have a corresponding value in $Y$ - or vice-versa

Preservation of Value Sets

  • when joining two relations, all non-matching attributes are not lost
  • i.e. they get transfered to the results
  • (if we join two relations on $Y$, $R$ has $X$ and $S$ has $Z$, then all possible values are going to occur in the output)
  • i.e. $V(R \Join S, X) = V(R, X)$ and $V(R \Join S, Z) = V(S, Z)$

Under there simplification we will consider two cases

Case 1

$V(R, Y) \leqslant V(S, Y)$ (say one-to-many relationship)

  • every tuple if $R$ has a match is $S$ by the containment assumption
  • or each tuple in $R$ has $\approx \cfrac{T(S)}{V(S, Y)}$ tuples in $S$ (assuming uniform distribution)
  • therefore $T(R \Join S) = T(R) \times \cfrac{T(S)}{V(S, Y)}$

Case 2

$V(R, Y) \geqslant V(S, Y)$ (say many-to-one relationship)

  • each tuple in $S$ has $\approx \cfrac{T(R)}{V(R, Y)}$ tuples in $R$
  • therefore $T(R \Join S) = T(S) \times \cfrac{T(R)}{V(R, Y)}$

General Case

$T(R \Join S) = \cfrac{T(S) \times T(R)}{\min(V(R, Y), V(S, Y))}$

More Join Attributes

Now assume that we join on two attributes $Y_1, Y_2$:

  • $R(X, Y_1, Y_2) \Join S(Y_1, Y_2, Z)$
  • (under the same assumptions)
  • same as selection with predicate with AND: $R.Y_1 = S.Y_1 \land R.Y_2 = S.Y_2$

Case 1

$V(R, Y_1) \leqslant V(S, Y_1)$ and $V(R, Y_2) \leqslant V(S, Y_2)$

  • a tuple in $R$ has $\cfrac{1}{V(S, Y_1)} \times \cfrac{1}{V(S, Y_2)}$ chance of joining with a tuple in $S$
    • (again assuming uniform distribution)
  • therefore $T(R \Join S) = T(R) \times \cfrac{T(S)}{V(S, Y_1) \times V(S, Y_2)}$

Case 2

$V(S, Y_1) \leqslant V(R, Y_1)$ and $V(S, Y_2) \leqslant V(R, Y_2)$

  • symmetric to Case 1
  • chance of tuple from $S$ joining with $R$ is $\cfrac{1}{V(R, Y_1)} \times \cfrac{1}{V(R, Y_2)}$
  • therefore $T(R \Join S) = T(S) \times \cfrac{T(R)}{V(R, Y_1) \times V(R, Y_2)}$

Case 3

$V(R, Y_1) \leqslant V(S, Y_1)$ but $V(S, Y_2) \leqslant V(R, Y_2)$

  • chance of tuple from $R$ joining with $S$ is $\cfrac{1}{V(R, Y_1)} \times \cfrac{1}{V({\color{blue}{S}}, Y_2)}$
  • therefore $T(R \Join S) = T(R) \times \cfrac{T(S)}{V(R, Y_1) \times V(S, Y_2)}$

Case 4

$V(S, Y_1) \leqslant V(R, Y_1)$ but $V(R, Y_2) \leqslant V(S, Y_2)$

  • chance of tuple from $R$ joining with $S$ is $\cfrac{1}{V({\color{blue}{S}}, Y_1)} \times \cfrac{1}{V(R, Y_2)}$
  • therefore $T(R \Join S) = T(R) \times \cfrac{T(S)}{V(S, Y_1) \times V(R, Y_2)}$

General Formula

$T(R \Join S) = \cfrac{T(R) \times T(S)}{\max(V(R, Y_1), V(S, Y_1) \times \max(V(R, Y_2), V(S, Y_2)}$

This formula generalizes to more than 2 joining attributes

See also


Machine Learning Bookcamp: Learn machine learning by doing projects. Get 40% off with code "grigorevpc".

Share your opinion