# ML Wiki

## Logical Query Plan Optimization

Translating SQL to RA expression is the first step in Query Processing Pipeline

• Input: SQL
• Output: Logical Query Plan - expression in Extended Relational Algebra

### Example

Suppose we have this query

SELECT DISTINCT x.name, z.name
FROM Product x, Purchase y, Customer z
WHERE x.pid = y.pid AND y.cid = z.cid AND
x.price > 100 AND z.city = 'Seattle'


We translate it to the following expression:

But there is a more optimal way to obtain the same results

The process of finding a cheaper equivalent expression is called (logical) query optimization

## Optimality

• every node needs executing
• hence, the fewer nodes we have, the faster the execution

A Relational Algebra expression $e$ is optimal if there is no other expression $e'$ s.t.

• $e'$ is equivalent to $e$ (i.e. for every database $D:$ $e(D) = e'(D)$)
• $e'$ is shorten (i.e. has fewer operations)

### The Optimization Problem

The Optimization Problem

• input: a RA expression $e$
• output: the optimal expression $e'$ s.t. $e \equiv e'$

Undecidability

• This problem in undecidable: on some expressions it may run forever.
• However we can optimize plans of a particular form

## Select-Project-Join Expression

In practice, most queries are of the form Select-Project-Join (SPJ)

Example

• $\pi_\text{...} \sigma_{A_1 = B_1 \land ... \land A_n = B_n} (R_1 \times ... \times R_n)$
• only selections, projections and joins,
• for selections only equalities are used as predicates

And we can optimize this kind of queries!

### Example

SELECT movieTitle FROM StarsIn S1
WHERE starName IN (
SELECT name
FROM MovieStar, StarsIn S2
WHERE birthdate = 1960
ANDS2.movieTitle = S1.movieTitle)


Note that this query is equivalent to

SELECT movieTitle FROM StarsIn
WHERE starName IN (
SELECT name
FROM MovieStar
WHERE birthdate = 1960)

This one has one join less to execute (and the join is the most expensive operation!)

Why?

• the first query may be a result of view expansion (the subquery is actually a view that is expanded for query evaluation)
• careless programmers

It is possible to automatically translate from first kind of query to second one

• the process is called removing redundant joins

### Removing Redundant Joins

The problem:

• given SPJ expression $e$
• eliminate as many joins as possible
• and return equivalent expression $e'$

To do that we could exploit one of the properties of Conjunctive Querys:

• the containment problem (and therefore the equivalence problem) is decidable in them

The algorithm to remove redundant joins is as follows:

This will eliminate the redundant joins in an RA expression

## Heuristics

To optimize an RA expression further (after eliminating redundant joins) we can use some heuristics

For relations $R(A, B), S(C, D)$ consider the following expression

• $\pi_A \sigma_{A = 5 \land B < D} (R \times S)$

### Pushing Selection

• we want to have selections as close as possible to the actual tables
• so this way we eliminate some tuples before joining them

$\pi_A \sigma_{B < D} \big( \sigma_{A = 5}(R) \times S \big)$

### Recognizing Joins

• we replace Cartesian product with Joins
• to help the optimizer assign the best physical operation to this join

$\pi_A \big( \sigma_{A = 5}(R) \Join_{B < D} S \big)$

### Introduce Projections

• in this example we see that we need only attribute $D$ in the relation $S$
• so we can introduce projections to save memory

$\pi_A \big( \sigma_{A = 5}(R) \Join_{B < D} \pi_D (S) \big)$