Conjunctive Query

A Conjunctive Query (CQ) is


A CQ is an expression of the following form:

$\underbrace{Q(x_1, ..., x_n)}_{\text{head}} \leftarrow \underbrace{R(a_1, ..., a_m), ..., S(c_1, ..., c_k)}_{\text{body}}$

  • it consists of two parts: head and body
  • $V = (a_1, ..., a_m, ..., c_1, ..., c_k)$ - variables and constants from the body
  • $X = (x_1, ..., x_n)$ is a set of variables from the head
    • $X \subseteq V$
    • the variables from $X$ are called distinguished variables (also free or placeholder variables)
  • $Y = V - X$ - variables from the body which aren't in the head
    • $Y \subseteq V$
    • variables from $Y$ are called existential variables (also bound variables)
  • note that the head doesn't necessarily have to contain only variables, it might as well contain constants
    • these constants will be returned in the results


$R(a_1, ..., a_m)$ is called an atom

  • if an atom does not contain any variables, only constants, it's a fact
  • so we can view a database as a set of facts


Alternative vector form:

  • $Q( \vec{x} ) \leftarrow A_1(\vec{u}_1), \ ..., \ A_k(\vec{u}_k)$


Example

Suppose we have the following database

[math]R \\ \begin{array}{| c c |} \hline 1 & 2 \\ 2 & 3 \\ 2 & 5 \\ 6 & 7 \\ 7 & 5 \\ 5 & 5 \\ \hline \end{array}[/math][math]S \\ \begin{array}{| c |} \hline 2 \\ 7 \\ \hline \end{array}[/math]

the query $Q(x, y) \leftarrow R(x, y), R(y, 5), S(y)$ wants to retrieve all pairs $(x, y)$ s.t.

  • $(x, y)$ occurs in $R$
  • $y$ then occurs in $R$ together with 5
  • and that $y$ occurs as a value in $S$


Substitution

def: a substitution $f$ of $Q$ into database $D$ is

a function that maps all variables from $Q$ to constants from $D$

def: a substitution $f$ of $Q$ into a database $D$ is a matching if

$f(\text{body}) \subseteq D$, i.e. if we evaluate $f$ on the body of the query $Q$, we will get a subset of $D$
so when we apply $f(\text{head})$ we get a tuple that is a part of the result of evaluating $Q$ on $D$

Then the result of a conjunctive query can be shown as:

  • $Q(D) \leftarrow \{ f(\text{head}) | f \text{ is a matching of $Q$ to $D$ } \}$


Let's consider the previous example: $Q(x, y) \leftarrow \underbrace{R(x, y)}_{(1)}, \underbrace{R(y, 5)}_{(2)}, \underbrace{S(y)}_{(3)}$ for the same database

Example with matching

  1. $(1)$ gets a tuple form $R$
    • say it's (1, 2)
    • so it assigns $x \mapsto 1, y \mapsto 2$
    so our substitution so far is $f: x \mapsto 1, y \mapsto 2$
  2. $(2)$ looks for a tuple in $R$ with $y = 2$
    • because we established the matching $f$ with $y \mapsto 2$
    • second value of this tuple must be a constant 5
    • we find such tuple: it's (2, 5)
    • nothing gets assigned at this step
  3. $(3)$ looks for a value in $S$ with $y = 2$
    • same reason: we have $y \mapsto 2$
    • it finds a fact $S(2)$ in the database
  4. so it returns a matching pair (1, 2)
    • we say there is a matching (1, 2) in the database $D$


Example without matching

  1. $(1)$ we try another tuple from $R$, say (7, 5)
    • we assign $x \mapsto 7, y \mapsto 5$
  2. $(2)$ now we try to find a tuple ($y$, 5) = (5, 5) in $R$
    • no such tuple
  3. therefor (7, 5) is not a matching and will not be returned by $Q$


So this way we try all values of our database table and return only matching ones

  • for this particular example the query returns two tuples: (1, 2) and (6, 7)
  • i.e. $Q(D) = \{ (1, 2), (6, 7) \}$


Translation

First Order Logic

  • CQs can be translated to First Order Logic expressions
  • for query $q(x_1, ..., x_n) = A_1(...), \ ..., \ A_n(...)$
  • FOL expressions is $\{ \, x_1, ..., x_n \ \mid \ \exists \ y_1, ..., y_m : A_1(...) \ \land \ ... \ \land \ A_n(...) \, \}$
    • $x_1, ..., x_n$ - distinguished variables, and $y_1, ..., y_m$ are existential


Relational Algebra


Properties

CQs have an interesting property

  • The query containment problem is undecidable for SQL and Relational Algebra (see Logical Query Plan Optimization), but it is decidable for Conjunctive Queries
  • The decidability of containment is NP-complete problem, but usually CQs are not big, so it is acceptable

This makes CQs very suitable for Logical Query Plan Optimization, namely, for removing redundant joins


Containment and Equivalence

Containment

$Q_1$ is contained in $Q_2$ (denoted as $Q_1 \subseteq Q_2$) if

  • for any database $D$ holds $Q_1(D) \subseteq Q_2(D)$
  • i.e. the result of $Q_1$ is always a subset of the result of $Q_2$, no matter what database $D$ they are evaluated on

$Q_1$ is equivalent to $Q_2$ (denoted as $Q_1 \equiv Q_2$)

  • iff $Q_1 \subseteq Q_2 \land Q_2 \subseteq Q_1$

Example

  • $A(x, y) \leftarrow R(x, w), G(w, z), R(z, y)$
  • $B(x, w) \leftarrow R(x, w), G(w, w), R(w, y)$
  • $B \subseteq A$

To show that we use the definition

  • let $D$ be an arbitrary DB and
  • let $t \in B(D)$ (one arbitrary result of $B$ evaluated on $D$)
  • there exists a matching $f$ of $B$ into $D$ s.t. $t = (f(x), f(y))$ (by definition of matching)
    • so we need to show that $t = (f(x), f(y)) \in A(D)$
  • let $h$ be a substitution:
    • $h: x \mapsto f(x), y \mapsto f(y), w \mapsto f(w), z \mapsto f(w)$
  • then $h$ is a matching of $A$ into $D$
    • $t = (f(x), f(y)) = (h(x), h(y)) \in A$


It is also possible to show that containment is decidable.


Homomorphism

def: a homomorphism of $Q_2$ to $Q_1$ is

  • a function $h$ that maps each variable in $Q_2$ to either
    • a variable from $Q_1$ or
    • a constant form $Q_1$
  • s.t.
    • $h(\text{head}_2) = \text{head}_1$
    • $h(\text{body}_2) \subseteq \text{body}_1$
    • i.e. the values returned by queries are always the same, and body of one query is a subset of the other's query body


Examples


Containment Theorem

Thm: $Q_1 \subset Q_2 \iff $ there's a homomorphism from $Q_2$ to $Q_1$

Proof of $\Leftarrow$ (if)

  • let $h: Q_2 \to Q_1$ be a homomorphism
  • let $D$ be a database
  • if we fix an arbitrary tuple $t \in Q_1(D)$, we need to prove that $t \in Q_2(D)$
  • since $t \in Q_1(D)$ we know that
    • $t = f(\text{head}_1)$
    • with a matching $f$ of $Q_1$ into $D$
    • (by semantics of CQs)
  • let's consider a composition $f \circ h$ (composition of $f$ with the homomorphism $h$)
    • this is a substitution of $Q_2$ into $D$! (we fist applied the homomorphism and then the matching - and got the substitution)
    • since $h$ is a homomorphism, $h(\text{body}_2) \subseteq \text{body}_2$ (by def of homomorphism)
    • $\to$ $f\big(h(\text{body}_2)\big) \subseteq f(\text{body}_1) \subseteq D$
    • in other words, $f \circ h$ is a matching of $Q_2$ into $D$
  • hence
    • $f\big(h(\text{body}_2) \big) \in Q_2(D)$
    • and finally $t = f(\text{head}_1) = f\big(h(\text{head}_2) \big) \in Q_2(D)$

Proof of $\Rightarrow$ (only if)

  • suppose that $Q_1 \subseteq Q_2$
  • let's consider variables in $Q_1$ as constants
    • we can view $\text{body}_1$ as a mini-database $D_{Q_1}$
    • (this database is called a canonical database of $Q_1$)
  • the identity function is a matching of $Q_1$ to $D_{Q_1}$
    • hence $\text{head}_1 \in Q_1(D_{Q_1})$
  • since $Q_1 \subseteq Q_2$ we know that $\text{head}_1 \in Q_2(D_{Q_1})$
    • by construction of our database $D_{Q_1}$
  • so there exists a matching $f$ from $Q_2$ to $D_{Q_1}$ s.t.
    • $\text{head}_1 = f(\text{head}_2)$ and
    • $f(\text{body}_2) \subseteq D_{Q_1} = \text{body}_1$
  • $f$ is a homomorphism of $Q_2$ to $Q_1$ by the definition (by considering variables again as variables)

$\blacksquare$


From the second part of the proof we may get a way of checking for containment: the Golden Method


The Golden Method

To decide whether $Q_1 \subseteq Q_2$

  • evaluate $Q_2$ on a canonical database $D_{Q_1}$ (which is a body of $Q_1$, see #Containment Theorem)
  • check if the head of $Q_1$ is in the results


QueryContainment($Q_1$, $Q_2$)

  • input
    • $Q_1(\vec{x}) \leftarrow g_1(\vec{x}_1), ..., g_n(\vec{x}_n)$
    • $Q_2(\vec{y}) \leftarrow h_1(\vec{y}_1), ..., h_m(\vec{y}_m)$
  • freeze $Q_1$: construct a canonical database $D_{Q_1} \equiv \{ g_i \big( v( \vec{x}_i ) \big) \}$
    • with $v$ being a matching
  • if $v(\vec{x}) \in Q_2(D_{Q_1})$ return yes, otherwise no


Exercise


Sources

Machine Learning Bookcamp: Learn machine learning by doing projects. Get 40% off with code "grigorevpc".

Share your opinion