Suppose we have a relation $R(A, B, C, D)$, each tuple - 32 bytes

- $128 \times 10^6$ tuples is $R$
- Block size is $B$ = 4kb
- I.e. can store 128 tuples per block, $10^6$ tuples in total

Suppose we want to find a tuple with $C = 10$

Searching in I/O Model of Computation

- for each block $X \in R$
- load $X$
- check if there's a tuple $T \in X$ with $C = 10$
- yes - output $T$
- no - continue
- release $X$ from memory

In worst case it's $10^6$ I/Os

- suppose $10^{-3}$ seconds per I/O operation
- 16.6 minutes in total

Can we do better?

An *index* is any secondary memory data structure that

- takes a search key as input
- efficiently returns the collection of matching records

- Dense Index
- Sparse Index
- Secondary Index as a combination of both

Index can be clustered or unclustered

- When index is clustered it means the records themselves are stored in index, not pointers
- I.e. a clustered index ensures that all data is stored in some order
- Usually there is only one clustered index per relation (otherwise the data will be duplicated)

If an index (say, B-Tree) is not clustered, then instead of following each pointer other techniques can be used, such as Bitmap Heap Scan

Indexing can also be applied to unstructured data such as text

- Inverted Index builds an index from words to documents where these words are contained
- Locality Sensitive Hashing gives an approximate answer to KNN queries

- Database Systems Architecture (ULB)
- Database Systems: The Complete Book (2nd edition) by H. Garcia-Molina, J. D. Ullman, and J. Widom