External Merge Sort
Full name: External Memory Multi-Way Merge Sort
Algorithm
This is a multi-way algorithms. That is, the sorting of relation $R$ is done in several passes
- First Pass - in memory sorting
- Second and onwards - merging
Notation:
- $R$ relation we want to sort
- $M$ - max # of blocks we can use in memory
First Pass
1-st pass:
- read $M$ blocks
- sort all elements in memory with any sorting algorithm (say, Merge Sort or Quick Sort)
- write sorted results back to disk
- repeat for remaining blocks of $R$ until all are processed
After the first pass we have $\left\lceil \cfrac{B(R)}{M} \right\rceil $ sorted sub-results
Second Pass
- Since we have $M$ available buffers, we can simultaneously process at most $M$ sorted sub-results from the previous pass
- So we divide all sub-results into categories with $M$ sub-results in each
- Since we have $M$ sub-results and $M$ blocks in each, each result of this pass should have $M \times M$ blocks
- After this pass we will have $\left \lceil \cfrac{B(R) / M}{M} \right\rceil = \left \lceil \cfrac{B(R)}{M^2} \right\rceil$ sorted subresults
2-nd pass:
- merge the first $M$ sublists to a single sublist of $M^2$ blocks
- by synchronous iteration
Synchronous Iteration
This is essentially the merging phase of Merge Sort
Algorithm:
- load block of $R$ to $N_R$, block of $S$ to $N_S$
- iterate over tuples $t_R \in N_R$ and $t_S \in N_S$ synchronously
- if $t_R \geqslant t_S$
- output $t_R$
- move $t_R$ pointer to the next tuple in $R$ (load next block if needed)
- if $t_R > t_S$
- output $t_S$
- move $t_S$ pointer to the next tuple in $S$ (load next block if needed)
Third and Onwards Passes
- if needed - repeat the second pass again
- until everything is sorted
Cost
- at each pass we read and write the entire relation once
- there are $P = \lceil \log_M B(R) \rceil$ passes
- so the total cost is $\underbrace{2 \times B(R)}_\text{1 read + 1 write} \times P = 2 \times B(R) \times \lceil \log_M B(R) \rceil$
See also
Sources