|
|
Line 33: |
Line 33: |
| Suppose we have the following database | | Suppose we have the following database |
| | | |
− | $
| + | <math>R \\ |
− | R \\ | + | |
| \begin{array}{| c c |} | | \begin{array}{| c c |} |
| \hline | | \hline |
Line 44: |
Line 43: |
| 5 & 5 \\ | | 5 & 5 \\ |
| \hline | | \hline |
− | \end{array} | + | \end{array}</math><math>S \\ |
− | $
| + | |
− | $
| + | |
− | S \\ | + | |
| \begin{array}{| c |} | | \begin{array}{| c |} |
| \hline | | \hline |
Line 53: |
Line 49: |
| 7 \\ | | 7 \\ |
| \hline | | \hline |
− | \end{array} | + | \end{array}</math> |
− | $
| + | |
| | | |
| the query $Q(x, y) \leftarrow R(x, y), R(y, 5), S(y)$ wants to retrieve all pairs $(x, y)$ s.t. | | the query $Q(x, y) \leftarrow R(x, y), R(y, 5), S(y)$ wants to retrieve all pairs $(x, y)$ s.t. |
Line 110: |
Line 105: |
| * CQs can be translated to [[First Order Logic]] expressions | | * CQs can be translated to [[First Order Logic]] expressions |
| * for query $q(x_1, ..., x_n) = A_1(...), \ ..., \ A_n(...)$ | | * for query $q(x_1, ..., x_n) = A_1(...), \ ..., \ A_n(...)$ |
− | * FOL expressions is $\{ x_1, ..., x_n \ | \ \exists \ y_1, ..., y_m : A_1(...) \ \land \ ... \ \land \ A_n(...) \}$ | + | * FOL expressions is $\{ \, x_1, ..., x_n \ \mid \ \exists \ y_1, ..., y_m : A_1(...) \ \land \ ... \ \land \ A_n(...) \, \}$ |
| ** $x_1, ..., x_n$ - distinguished variables, and $y_1, ..., y_m$ are existential | | ** $x_1, ..., x_n$ - distinguished variables, and $y_1, ..., y_m$ are existential |
| | | |
Line 185: |
Line 180: |
| ** this is a substitution of $Q_2$ into $D$! (we fist applied the homomorphism and then the matching - and got the substitution) | | ** this is a substitution of $Q_2$ into $D$! (we fist applied the homomorphism and then the matching - and got the substitution) |
| ** since $h$ is a homomorphism, $h(\text{body}_2) \subseteq \text{body}_2$ (by def of homomorphism) | | ** since $h$ is a homomorphism, $h(\text{body}_2) \subseteq \text{body}_2$ (by def of homomorphism) |
− | ** $\to$ $f(h(\text{body}_2)) \subseteq f(\text{body}_1) \subseteq D$ | + | ** $\to$ $f\big(h(\text{body}_2)\big) \subseteq f(\text{body}_1) \subseteq D$ |
| ** in other words, $f \circ h$ is a matching of $Q_2$ into $D$ | | ** in other words, $f \circ h$ is a matching of $Q_2$ into $D$ |
| * hence | | * hence |
− | ** $f(h(\text{body}_2)) \in Q_2(D)$ | + | ** $f\big(h(\text{body}_2) \big) \in Q_2(D)$ |
− | ** and finally $t = f(\text{head}_1) = f(h(\text{head}_2)) \in Q_2(D)$ | + | ** and finally $t = f(\text{head}_1) = f\big(h(\text{head}_2) \big) \in Q_2(D)$ |
| | | |
| Proof of $\Rightarrow$ (only if) | | Proof of $\Rightarrow$ (only if) |
Line 234: |
Line 229: |
| * [[Database Systems Architecture (ULB)]] | | * [[Database Systems Architecture (ULB)]] |
| * Database Systems Architecture lecture notes #2 by S. Vansummeren [https://dl.dropboxusercontent.com/sh/r0zvy3zaycbevx8/U0XnqCSwGZ/lect2-notes-conjunctive.pdf] | | * Database Systems Architecture lecture notes #2 by S. Vansummeren [https://dl.dropboxusercontent.com/sh/r0zvy3zaycbevx8/U0XnqCSwGZ/lect2-notes-conjunctive.pdf] |
− | * Web Data Management book [http://webdam.inria.fr/Jorge] | + | * [[Web Data Management (book)]] |
− | | + | |
| | | |
| [[Category:Relational Databases]] | | [[Category:Relational Databases]] |
Latest revision as of 15:35, 23 November 2015
Conjunctive Query
A Conjunctive Query (CQ) is
A CQ is an expression of the following form:
$\underbrace{Q(x_1, ..., x_n)}_{\text{head}} \leftarrow
\underbrace{R(a_1, ..., a_m), ..., S(c_1, ..., c_k)}_{\text{body}}$
- it consists of two parts: head and body
- $V = (a_1, ..., a_m, ..., c_1, ..., c_k)$ - variables and constants from the body
- $X = (x_1, ..., x_n)$ is a set of variables from the head
- $X \subseteq V$
- the variables from $X$ are called distinguished variables (also free or placeholder variables)
- $Y = V - X$ - variables from the body which aren't in the head
- $Y \subseteq V$
- variables from $Y$ are called existential variables (also bound variables)
- note that the head doesn't necessarily have to contain only variables, it might as well contain constants
- these constants will be returned in the results
$R(a_1, ..., a_m)$ is called an atom
- if an atom does not contain any variables, only constants, it's a fact
- so we can view a database as a set of facts
Alternative vector form:
- $Q( \vec{x} ) \leftarrow A_1(\vec{u}_1), \ ..., \ A_k(\vec{u}_k)$
Example
Suppose we have the following database
[math]R \\
\begin{array}{| c c |}
\hline
1 & 2 \\
2 & 3 \\
2 & 5 \\
6 & 7 \\
7 & 5 \\
5 & 5 \\
\hline
\end{array}[/math][math]S \\
\begin{array}{| c |}
\hline
2 \\
7 \\
\hline
\end{array}[/math]
the query $Q(x, y) \leftarrow R(x, y), R(y, 5), S(y)$ wants to retrieve all pairs $(x, y)$ s.t.
- $(x, y)$ occurs in $R$
- $y$ then occurs in $R$ together with 5
- and that $y$ occurs as a value in $S$
Substitution
def: a substitution $f$ of $Q$ into database $D$ is
- a function that maps all variables from $Q$ to constants from $D$
def: a substitution $f$ of $Q$ into a database $D$ is a matching if
- $f(\text{body}) \subseteq D$, i.e. if we evaluate $f$ on the body of the query $Q$, we will get a subset of $D$
- so when we apply $f(\text{head})$ we get a tuple that is a part of the result of evaluating $Q$ on $D$
Then the result of a conjunctive query can be shown as:
- $Q(D) \leftarrow \{ f(\text{head}) | f \text{ is a matching of $Q$ to $D$ } \}$
Let's consider the previous example: $Q(x, y) \leftarrow \underbrace{R(x, y)}_{(1)}, \underbrace{R(y, 5)}_{(2)}, \underbrace{S(y)}_{(3)}$ for the same database
Example with matching
- $(1)$ gets a tuple form $R$
- say it's (1, 2)
- so it assigns $x \mapsto 1, y \mapsto 2$
- so our substitution so far is $f: x \mapsto 1, y \mapsto 2$
- $(2)$ looks for a tuple in $R$ with $y = 2$
- because we established the matching $f$ with $y \mapsto 2$
- second value of this tuple must be a constant 5
- we find such tuple: it's (2, 5)
- nothing gets assigned at this step
- $(3)$ looks for a value in $S$ with $y = 2$
- same reason: we have $y \mapsto 2$
- it finds a fact $S(2)$ in the database
- so it returns a matching pair (1, 2)
- we say there is a matching (1, 2) in the database $D$
Example without matching
- $(1)$ we try another tuple from $R$, say (7, 5)
- we assign $x \mapsto 7, y \mapsto 5$
- $(2)$ now we try to find a tuple ($y$, 5) = (5, 5) in $R$
- therefor (7, 5) is not a matching and will not be returned by $Q$
So this way we try all values of our database table and return only matching ones
- for this particular example the query returns two tuples: (1, 2) and (6, 7)
- i.e. $Q(D) = \{ (1, 2), (6, 7) \}$
Translation
First Order Logic
- CQs can be translated to First Order Logic expressions
- for query $q(x_1, ..., x_n) = A_1(...), \ ..., \ A_n(...)$
- FOL expressions is $\{ \, x_1, ..., x_n \ \mid \ \exists \ y_1, ..., y_m : A_1(...) \ \land \ ... \ \land \ A_n(...) \, \}$
- $x_1, ..., x_n$ - distinguished variables, and $y_1, ..., y_m$ are existential
Relational Algebra
Properties
CQs have an interesting property
- The query containment problem is undecidable for SQL and Relational Algebra (see Logical Query Plan Optimization), but it is decidable for Conjunctive Queries
- The decidability of containment is NP-complete problem, but usually CQs are not big, so it is acceptable
This makes CQs very suitable for Logical Query Plan Optimization, namely, for removing redundant joins
Containment and Equivalence
Containment
$Q_1$ is contained in $Q_2$ (denoted as $Q_1 \subseteq Q_2$) if
- for any database $D$ holds $Q_1(D) \subseteq Q_2(D)$
- i.e. the result of $Q_1$ is always a subset of the result of $Q_2$, no matter what database $D$ they are evaluated on
$Q_1$ is equivalent to $Q_2$ (denoted as $Q_1 \equiv Q_2$)
- iff $Q_1 \subseteq Q_2 \land Q_2 \subseteq Q_1$
Example
- $A(x, y) \leftarrow R(x, w), G(w, z), R(z, y)$
- $B(x, w) \leftarrow R(x, w), G(w, w), R(w, y)$
- $B \subseteq A$
To show that we use the definition
- let $D$ be an arbitrary DB and
- let $t \in B(D)$ (one arbitrary result of $B$ evaluated on $D$)
- there exists a matching $f$ of $B$ into $D$ s.t. $t = (f(x), f(y))$ (by definition of matching)
- so we need to show that $t = (f(x), f(y)) \in A(D)$
- let $h$ be a substitution:
- $h: x \mapsto f(x), y \mapsto f(y), w \mapsto f(w), z \mapsto f(w)$
- then $h$ is a matching of $A$ into $D$
- $t = (f(x), f(y)) = (h(x), h(y)) \in A$
It is also possible to show that containment is decidable.
Homomorphism
def: a homomorphism of $Q_2$ to $Q_1$ is
- a function $h$ that maps each variable in $Q_2$ to either
- a variable from $Q_1$ or
- a constant form $Q_1$
- s.t.
- $h(\text{head}_2) = \text{head}_1$
- $h(\text{body}_2) \subseteq \text{body}_1$
- i.e. the values returned by queries are always the same, and body of one query is a subset of the other's query body
Examples
Containment Theorem
Thm: $Q_1 \subset Q_2 \iff $ there's a homomorphism from $Q_2$ to $Q_1$
Proof of $\Leftarrow$ (if)
- let $h: Q_2 \to Q_1$ be a homomorphism
- let $D$ be a database
- if we fix an arbitrary tuple $t \in Q_1(D)$, we need to prove that $t \in Q_2(D)$
- since $t \in Q_1(D)$ we know that
- $t = f(\text{head}_1)$
- with a matching $f$ of $Q_1$ into $D$
- (by semantics of CQs)
- let's consider a composition $f \circ h$ (composition of $f$ with the homomorphism $h$)
- this is a substitution of $Q_2$ into $D$! (we fist applied the homomorphism and then the matching - and got the substitution)
- since $h$ is a homomorphism, $h(\text{body}_2) \subseteq \text{body}_2$ (by def of homomorphism)
- $\to$ $f\big(h(\text{body}_2)\big) \subseteq f(\text{body}_1) \subseteq D$
- in other words, $f \circ h$ is a matching of $Q_2$ into $D$
- hence
- $f\big(h(\text{body}_2) \big) \in Q_2(D)$
- and finally $t = f(\text{head}_1) = f\big(h(\text{head}_2) \big) \in Q_2(D)$
Proof of $\Rightarrow$ (only if)
- suppose that $Q_1 \subseteq Q_2$
- let's consider variables in $Q_1$ as constants
- we can view $\text{body}_1$ as a mini-database $D_{Q_1}$
- (this database is called a canonical database of $Q_1$)
- the identity function is a matching of $Q_1$ to $D_{Q_1}$
- hence $\text{head}_1 \in Q_1(D_{Q_1})$
- since $Q_1 \subseteq Q_2$ we know that $\text{head}_1 \in Q_2(D_{Q_1})$
- by construction of our database $D_{Q_1}$
- so there exists a matching $f$ from $Q_2$ to $D_{Q_1}$ s.t.
- $\text{head}_1 = f(\text{head}_2)$ and
- $f(\text{body}_2) \subseteq D_{Q_1} = \text{body}_1$
- $f$ is a homomorphism of $Q_2$ to $Q_1$ by the definition (by considering variables again as variables)
$\blacksquare$
From the second part of the proof we may get a way of checking for containment: the Golden Method
The Golden Method
To decide whether $Q_1 \subseteq Q_2$
- evaluate $Q_2$ on a canonical database $D_{Q_1}$ (which is a body of $Q_1$, see #Containment Theorem)
- check if the head of $Q_1$ is in the results
QueryContainment($Q_1$, $Q_2$)
- input
- $Q_1(\vec{x}) \leftarrow g_1(\vec{x}_1), ..., g_n(\vec{x}_n)$
- $Q_2(\vec{y}) \leftarrow h_1(\vec{y}_1), ..., h_m(\vec{y}_m)$
- freeze $Q_1$: construct a canonical database $D_{Q_1} \equiv \{ g_i \big( v( \vec{x}_i ) \big) \}$
- with $v$ being a matching
- if $v(\vec{x}) \in Q_2(D_{Q_1})$ return yes, otherwise no
Exercise
Sources