# ML Wiki

## Conjunctive Query

A Conjunctive Query (CQ) is

A CQ is an expression of the following form:

$\underbrace{Q(x_1, ..., x_n)}_{\text{head}} \leftarrow \underbrace{R(a_1, ..., a_m), ..., S(c_1, ..., c_k)}_{\text{body}}$

• it consists of two parts: head and body
• $V = (a_1, ..., a_m, ..., c_1, ..., c_k)$ - variables and constants from the body
• $X = (x_1, ..., x_n)$ is a set of variables from the head
• $X \subseteq V$
• the variables from $X$ are called distinguished variables (also free or placeholder variables)
• $Y = V - X$ - variables from the body which aren't in the head
• $Y \subseteq V$
• variables from $Y$ are called existential variables (also bound variables)
• note that the head doesn't necessarily have to contain only variables, it might as well contain constants
• these constants will be returned in the results

$R(a_1, ..., a_m)$ is called an atom

• if an atom does not contain any variables, only constants, it's a fact
• so we can view a database as a set of facts

Alternative vector form:

• $Q( \vec{x} ) \leftarrow A_1(\vec{u}_1), \ ..., \ A_k(\vec{u}_k)$

### Example

Suppose we have the following database

$R \\ \begin{array}{| c c |} \hline 1 & 2 \\ 2 & 3 \\ 2 & 5 \\ 6 & 7 \\ 7 & 5 \\ 5 & 5 \\ \hline \end{array}$$S \\ \begin{array}{| c |} \hline 2 \\ 7 \\ \hline \end{array}$

the query $Q(x, y) \leftarrow R(x, y), R(y, 5), S(y)$ wants to retrieve all pairs $(x, y)$ s.t.

• $(x, y)$ occurs in $R$
• $y$ then occurs in $R$ together with 5
• and that $y$ occurs as a value in $S$

### Substitution

def: a substitution $f$ of $Q$ into database $D$ is

a function that maps all variables from $Q$ to constants from $D$

def: a substitution $f$ of $Q$ into a database $D$ is a matching if

$f(\text{body}) \subseteq D$, i.e. if we evaluate $f$ on the body of the query $Q$, we will get a subset of $D$
so when we apply $f(\text{head})$ we get a tuple that is a part of the result of evaluating $Q$ on $D$

Then the result of a conjunctive query can be shown as:

• $Q(D) \leftarrow \{ f(\text{head}) | f \text{ is a matching of$Q$to$D$} \}$

Let's consider the previous example: $Q(x, y) \leftarrow \underbrace{R(x, y)}_{(1)}, \underbrace{R(y, 5)}_{(2)}, \underbrace{S(y)}_{(3)}$ for the same database

Example with matching

1. $(1)$ gets a tuple form $R$
• say it's (1, 2)
• so it assigns $x \mapsto 1, y \mapsto 2$
so our substitution so far is $f: x \mapsto 1, y \mapsto 2$
2. $(2)$ looks for a tuple in $R$ with $y = 2$
• because we established the matching $f$ with $y \mapsto 2$
• second value of this tuple must be a constant 5
• we find such tuple: it's (2, 5)
• nothing gets assigned at this step
3. $(3)$ looks for a value in $S$ with $y = 2$
• same reason: we have $y \mapsto 2$
• it finds a fact $S(2)$ in the database
4. so it returns a matching pair (1, 2)
• we say there is a matching (1, 2) in the database $D$

Example without matching

1. $(1)$ we try another tuple from $R$, say (7, 5)
• we assign $x \mapsto 7, y \mapsto 5$
2. $(2)$ now we try to find a tuple ($y$, 5) = (5, 5) in $R$
• no such tuple
3. therefor (7, 5) is not a matching and will not be returned by $Q$

So this way we try all values of our database table and return only matching ones

• for this particular example the query returns two tuples: (1, 2) and (6, 7)
• i.e. $Q(D) = \{ (1, 2), (6, 7) \}$

### Translation

• CQs can be translated to First Order Logic expressions
• for query $q(x_1, ..., x_n) = A_1(...), \ ..., \ A_n(...)$
• FOL expressions is $\{ \, x_1, ..., x_n \ \mid \ \exists \ y_1, ..., y_m : A_1(...) \ \land \ ... \ \land \ A_n(...) \, \}$
• $x_1, ..., x_n$ - distinguished variables, and $y_1, ..., y_m$ are existential

### Properties

CQs have an interesting property

• The query containment problem is undecidable for SQL and Relational Algebra (see Logical Query Plan Optimization), but it is decidable for Conjunctive Queries
• The decidability of containment is NP-complete problem, but usually CQs are not big, so it is acceptable

This makes CQs very suitable for Logical Query Plan Optimization, namely, for removing redundant joins

## Containment and Equivalence

### Containment

$Q_1$ is contained in $Q_2$ (denoted as $Q_1 \subseteq Q_2$) if

• for any database $D$ holds $Q_1(D) \subseteq Q_2(D)$
• i.e. the result of $Q_1$ is always a subset of the result of $Q_2$, no matter what database $D$ they are evaluated on

$Q_1$ is equivalent to $Q_2$ (denoted as $Q_1 \equiv Q_2$)

• iff $Q_1 \subseteq Q_2 \land Q_2 \subseteq Q_1$

Example

• $A(x, y) \leftarrow R(x, w), G(w, z), R(z, y)$
• $B(x, w) \leftarrow R(x, w), G(w, w), R(w, y)$
• $B \subseteq A$

To show that we use the definition

• let $D$ be an arbitrary DB and
• let $t \in B(D)$ (one arbitrary result of $B$ evaluated on $D$)
• there exists a matching $f$ of $B$ into $D$ s.t. $t = (f(x), f(y))$ (by definition of matching)
• so we need to show that $t = (f(x), f(y)) \in A(D)$
• let $h$ be a substitution:
• $h: x \mapsto f(x), y \mapsto f(y), w \mapsto f(w), z \mapsto f(w)$
• then $h$ is a matching of $A$ into $D$
• $t = (f(x), f(y)) = (h(x), h(y)) \in A$

It is also possible to show that containment is decidable.

### Homomorphism

def: a homomorphism of $Q_2$ to $Q_1$ is

• a function $h$ that maps each variable in $Q_2$ to either
• a variable from $Q_1$ or
• a constant form $Q_1$
• s.t.
• $h(\text{head}_2) = \text{head}_1$
• $h(\text{body}_2) \subseteq \text{body}_1$
• i.e. the values returned by queries are always the same, and body of one query is a subset of the other's query body

Examples

### Containment Theorem

Thm: $Q_1 \subset Q_2 \iff$ there's a homomorphism from $Q_2$ to $Q_1$

Proof of $\Leftarrow$ (if)

• let $h: Q_2 \to Q_1$ be a homomorphism
• let $D$ be a database
• if we fix an arbitrary tuple $t \in Q_1(D)$, we need to prove that $t \in Q_2(D)$
• since $t \in Q_1(D)$ we know that
• $t = f(\text{head}_1)$
• with a matching $f$ of $Q_1$ into $D$
• (by semantics of CQs)
• let's consider a composition $f \circ h$ (composition of $f$ with the homomorphism $h$)
• this is a substitution of $Q_2$ into $D$! (we fist applied the homomorphism and then the matching - and got the substitution)
• since $h$ is a homomorphism, $h(\text{body}_2) \subseteq \text{body}_2$ (by def of homomorphism)
• $\to$ $f\big(h(\text{body}_2)\big) \subseteq f(\text{body}_1) \subseteq D$
• in other words, $f \circ h$ is a matching of $Q_2$ into $D$
• hence
• $f\big(h(\text{body}_2) \big) \in Q_2(D)$
• and finally $t = f(\text{head}_1) = f\big(h(\text{head}_2) \big) \in Q_2(D)$

Proof of $\Rightarrow$ (only if)

• suppose that $Q_1 \subseteq Q_2$
• let's consider variables in $Q_1$ as constants
• we can view $\text{body}_1$ as a mini-database $D_{Q_1}$
• (this database is called a canonical database of $Q_1$)
• the identity function is a matching of $Q_1$ to $D_{Q_1}$
• hence $\text{head}_1 \in Q_1(D_{Q_1})$
• since $Q_1 \subseteq Q_2$ we know that $\text{head}_1 \in Q_2(D_{Q_1})$
• by construction of our database $D_{Q_1}$
• so there exists a matching $f$ from $Q_2$ to $D_{Q_1}$ s.t.
• $\text{head}_1 = f(\text{head}_2)$ and
• $f(\text{body}_2) \subseteq D_{Q_1} = \text{body}_1$
• $f$ is a homomorphism of $Q_2$ to $Q_1$ by the definition (by considering variables again as variables)

$\blacksquare$

From the second part of the proof we may get a way of checking for containment: the Golden Method

### The Golden Method

To decide whether $Q_1 \subseteq Q_2$

• evaluate $Q_2$ on a canonical database $D_{Q_1}$ (which is a body of $Q_1$, see #Containment Theorem)
• check if the head of $Q_1$ is in the results

QueryContainment($Q_1$, $Q_2$)

• input
• $Q_1(\vec{x}) \leftarrow g_1(\vec{x}_1), ..., g_n(\vec{x}_n)$
• $Q_2(\vec{y}) \leftarrow h_1(\vec{y}_1), ..., h_m(\vec{y}_m)$
• freeze $Q_1$: construct a canonical database $D_{Q_1} \equiv \{ g_i \big( v( \vec{x}_i ) \big) \}$
• with $v$ being a matching
• if $v(\vec{x}) \in Q_2(D_{Q_1})$ return yes, otherwise no

## Sources

Machine Learning Bookcamp: Learn machine learning by doing projects. Get 40% off with code "grigorevpc".