Logical Query Plan Optimization

query-processing-1st.png

Translating SQL to RA expression is the first step in Query Processing Pipeline

  • Input: SQL
  • Output: Logical Query Plan - expression in Extended Relational Algebra


Example

Suppose we have this query

SELECT DISTINCT x.name, z.name
FROM Product x, Purchase y, Customer z
WHERE x.pid = y.pid AND y.cid = z.cid AND
      x.price > 100 AND z.city = 'Seattle'

We translate it to the following expression:

  • logical-query-plan-ex2.png


But there is a more optimal way to obtain the same results

  • logical-query-plan-ex2-opt.png

The process of finding a cheaper equivalent expression is called (logical) query optimization


Optimality

  • every node needs executing
  • hence, the fewer nodes we have, the faster the execution

A Relational Algebra expression $e$ is optimal if there is no other expression $e'$ s.t.

  • $e'$ is equivalent to $e$ (i.e. for every database $D:$ $e(D) = e'(D)$)
  • $e'$ is shorten (i.e. has fewer operations)


The Optimization Problem

The Optimization Problem

  • input: a RA expression $e$
  • output: the optimal expression $e'$ s.t. $e \equiv e'$

Undecidability

  • This problem in undecidable: on some expressions it may run forever.
  • However we can optimize plans of a particular form


Select-Project-Join Expression

In practice, most queries are of the form Select-Project-Join (SPJ)

Example

  • $\pi_\text{...} \sigma_{A_1 = B_1 \land ... \land A_n = B_n} (R_1 \times ... \times R_n)$
  • only selections, projections and joins,
  • for selections only equalities are used as predicates

And we can optimize this kind of queries!


Example

SELECT movieTitle FROM StarsIn S1
WHERE starName IN (
    SELECT name
    FROM MovieStar, StarsIn S2
    WHERE birthdate = 1960
    ANDS2.movieTitle = S1.movieTitle)

Note that this query is equivalent to

SELECT movieTitle FROM StarsIn
WHERE starName IN (
    SELECT name
    FROM MovieStar
    WHERE birthdate = 1960)

This one has one join less to execute (and the join is the most expensive operation!)

Why?

  • the first query may be a result of view expansion (the subquery is actually a view that is expanded for query evaluation)
  • careless programmers

It is possible to automatically translate from first kind of query to second one

  • the process is called removing redundant joins


Removing Redundant Joins

The problem:

  • given SPJ expression $e$
  • eliminate as many joins as possible
  • and return equivalent expression $e'$

To do that we could exploit one of the properties of Conjunctive Querys:

  • the containment problem (and therefore the equivalence problem) is decidable in them


The algorithm to remove redundant joins is as follows:

This will eliminate the redundant joins in an RA expression


Heuristics

To optimize an RA expression further (after eliminating redundant joins) we can use some heuristics

For relations $R(A, B), S(C, D)$ consider the following expression

  • $\pi_A \sigma_{A = 5 \land B < D} (R \times S)$


Pushing Selection

  • we want to have selections as close as possible to the actual tables
  • so this way we eliminate some tuples before joining them

$\pi_A \sigma_{B < D} \big( \sigma_{A = 5}(R) \times S \big)$


Recognizing Joins

  • we replace Cartesian product with Joins
  • to help the optimizer assign the best physical operation to this join

$\pi_A \big( \sigma_{A = 5}(R) \Join_{B < D} S \big)$


Introduce Projections

  • in this example we see that we need only attribute $D$ in the relation $S$
  • so we can introduce projections to save memory

$\pi_A \big( \sigma_{A = 5}(R) \Join_{B < D} \pi_D (S) \big)$


Integrated Exercises

See LQP Optimization Exercises (DBSA)



Sources