$\require{color}$
Sequence Mining
Data Model
- similar to Local Pattern Discovery
- item - binary-valued attribute (either present - 1, or not present - 0)
- itemset - lexicographically sorted subset of all items
- sequence - an ordered list of itemsets (i.e. transactions)
- they may be ordered by date/time, location, price, etc
Sequence Data
Examples
Supermarket
- a supermarket: each product is an item
- A receipt - an itemset (don’t consider the quantity)
- for a certain customer, all his receipt ordered by date form a sequence of transactions
Web:
- each user session is a sequence
- each click is an itemset
- items: remote hosts, user names, date, URLs, etc
NLP:
- text - sequence with Part of Speech Tagging
- sentence - itemset
- words with POS-tags - items
Notation
- items: lowercase letters $a,b,c,…$
- itemsets: uppercase letters $I=(abc), I’=(abc), I_1=(abc)$
- sequence:
- $s = \langle I_1 \ I_2 \ … \ I_k \rangle$ or
- $s = \langle (ab)(c)(bdc) \rangle \equiv \langle (ab)c(bdc) \rangle$
- note that $\langle abc \rangle \equiv \langle (a)(b)(c) \rangle \not \equiv \langle (abc) \rangle$
Subsequences
Sequence containment:
- $s$ is contained in $s’$ ($s \sqsubseteq s’$) $\iff$
- $\forall I \in s$ (in order) $\exists I’ \in s’$ (in order) s.t. $I \subseteq I’$
-
order is important - $s’$ is a super sequence for $s$, or $s’$ “supports” $s$
Example:
- $\langle (a)(bc)(d)(c) \rangle \sqsubseteq \langle (a)(abc)(ac)(d)(cf) \rangle$
- $\langle (a)(b)(cd)(c) \rangle \not \sqsubseteq \langle (a)(abc)(ac)(d)(cf) \rangle$
Sequential Pattern Mining
Like in Local Pattern Discovery, we have the notion of Support
- the support of sequence $s$ w.r.t to dataset $D$ is the # of sequenced in $D$ that support $s$
- $\text{supp}(s, D) = \big| { s’ \in D \ : \ s \sqsubseteq s’ } \big|$ | Frequent patterns:
- a sequence $s$ is frequent if $\text{supp}(s, D) \geqslant \theta$
- where $\theta$ is the desired minimal support (parameter)
- A frequent (sub)sequence is called a ‘‘sequential pattern’’
Sequential Pattern Mining
- given a sequence database $D$, find the complete set of all frequent subsequences
Example
Given:
- Support threshold $\theta = 3$
- Database $D$
SID | Sequence | 10 | $\langle (a)(abc)(ac)(d)(af) \rangle$ | 20 | $\langle (ad)(c)(bc)(ae) \rangle$ | 30 | $\langle (ef)(ab)(df)(c)(b) \rangle$ | 40 | $\langle (ae)(af)(c)(b)(cf) \rangle$ | 50 | $\langle (abd)(af)(c)(b)(ad) \rangle$ |
Which of the following patterns are frequent?
- $\langle (a) \rangle$
- $\langle (a)(bc) \rangle$
- $\langle (a)(b)(c) \rangle$
- $\langle (a)(b)(c)(d) \rangle$
Pattern | Frequency | Table | $\langle (a) \rangle$ | 5 | $\langle (\colorbox{red}{a})(abc)(ac)(d)(af) \rangle$ $\langle (\colorbox{red}{a}d)(c)(bc)(ae) \rangle$ $\langle (ef)(\colorbox{red}{a}b)(df)(c)(b) \rangle$ $\langle (\colorbox{red}{a}e)(af)(c)(b)(cf) \rangle$ $\langle (\colorbox{red}{a}bd)(af)(c)(b)(ad) \rangle$ |
$\langle (a)(bc) \rangle$ | 2 | $\langle (\colorbox{red}{a})(\colorbox{red}{ab}c)(ac)(d)(af) \rangle$ $\langle (\colorbox{red}{a}d)(c)(\colorbox{red}{bc})(ae) \rangle$ $\langle (ef)(ab)(df)(c)(b) \rangle$ $\langle (ae)(af)(c)(b)(cf) \rangle$ $\langle (abd)(af)(c)(b)(ad) \rangle$ |
$\langle (a)(b)(c) \rangle$ | 2 | $\langle (\colorbox{red}{a})(a\colorbox{red}{b}c)(a\colorbox{red}{c})(d)(af) \rangle$ $\langle (ad)(c)(bc)(ae) \rangle$ $\langle (ef)(ab)(df)(c)(b) \rangle$ $\langle (\colorbox{red}{a}e)(af)(c)(\colorbox{red}{b})(\colorbox{red}{c}f) \rangle$ $\langle (abd)(af)(c)(b)(ad) \rangle$ |
$\langle (a)(b)(c)(d) \rangle$ | 1 | $\langle (\colorbox{red}{a})(a\colorbox{red}{b}c)(a\colorbox{red}{c})(\colorbox{red}{d})(af) \rangle$ $\langle (ad)(c)(bc)(ae) \rangle$ $\langle (ef)(ab)(df)(c)(b) \rangle$ $\langle (ae)(af)(c)(b)(cf) \rangle$ $\langle (abd)(af)(c)(b)(ad) \rangle$ |
Downwards Closure
Frequency is an anti-monotonic property of a Lattice
- it’s sometimes called “Apriori Property”
- it forms the down-wards closure
Downwards Closure property
- if a sequence is frequent, then all its subsequences are frequent
E.g.
- if $\langle (a)(bc) \rangle$ is frequent, then
- $\langle (a) \rangle$, $\langle (b) \rangle$, $\langle (c) \rangle$, $\langle (bc) \rangle$, $\langle (a)(b) \rangle$, and $\langle (a)(c) \rangle$ are also frequent
Sequential Apriori Approach
Can adapt Apriori to mining sequential patterns:
- R. Agrawal and R. Srikant. Mining sequential patterns. 1995.
Algorithm
Sequential Apriori
- at level 1
- generate length-1 sequences: ones that contain only 1 itemset of 1 item
- prune non-frequent
- at level $i$
- generate length-$i$ candidate sequences from frequent length-$(i-1)$ sequences
- prune non-frequent
Example
Given
- items $a, b, c, d, e, f, g, h$
- the following database $D$
\langle (a) \rangle, \langle (b) \rangle, \langle (c) \rangle, \langle (d) \rangle, \langle (e) \rangle, \langle (f) \rangle, \langle (g) \rangle, \langle (h) \rangle
SID | Sequence | 10 | $\langle (bd)(c)(b)(ac) \rangle$ | 20 | $\langle (bf)(ce)(b)(fg) \rangle$ | 30 | $\langle (ah)(bf)(a)(b)(f) \rangle$ | 40 | $\langle (be)(ce)(d) \rangle$ | 50 | $\langle (a)(bd)(b)(c)(b)(ade) \rangle$ |
’'’Step 1:’’’
- generate length-1 candidates
- $\langle (a) \rangle, \langle (b) \rangle, \langle (c) \rangle, \langle (d) \rangle, \langle (e) \rangle, \langle (f) \rangle, \langle (g) \rangle, \langle (h) \rangle$
- calculate frequency
- $\text{freq}\big(\langle (a) \rangle \big) = 3$
- $\text{freq}\big(\langle (b) \rangle \big) = 5$
- $\text{freq}\big(\langle (c) \rangle \big) = 4$
- $\text{freq}\big(\langle (d) \rangle \big) = 3$
- $\text{freq}\big(\langle (e) \rangle \big) = 3$
- $\text{freq}\big(\langle (f) \rangle \big) = 2$
- $\text{freq}\big(\langle (g) \rangle \big) = {\color{red}{1}}$
- $\text{freq}\big(\langle (h) \rangle \big) = {\color{red}{1}}$
- prune non-frequent: $\langle (g) \rangle$ and $\langle (h) \rangle$
’'’Step 2’’’ Now need to generate length-2 candidates
- two ways
- adding an item to a new itemset, i.e. $\langle (a) \rangle$ becomes $\langle (a)(b) \rangle$ when we add $b$
- adding to last itemset: $\langle (a) \rangle$ becomes $\langle (ab) \rangle$
So we have the following candidates:
| + Way 1 || | $\langle (a) \rangle$ | $\langle (b) \rangle$ | $\langle (c) \rangle$ | $\langle (d) \rangle$ | $\langle (e) \rangle$ | $\langle (f) \rangle$ | $\langle (a) \rangle$ | $\langle (a)(a) \rangle$ | $\langle (a)(b) \rangle$ | $\langle (a)(c) \rangle$ | $\langle (a)(d) \rangle$ | $\langle (a)(e) \rangle$ | $\langle (a)(f) \rangle$ || $\langle (b) \rangle$ | $\langle (b)(a) \rangle$ | $\langle (b)(b) \rangle$ | $\langle (b)(c) \rangle$ | $\langle (b)(d) \rangle$ | $\langle (b)(e) \rangle$ | $\langle (b)(f) \rangle$ || $\langle (c) \rangle$ | $\langle (c)(a) \rangle$ | $\langle (c)(b) \rangle$ | $\langle (c)(c) \rangle$ | $\langle (c)(d) \rangle$ | $\langle (c)(e) \rangle$ | $\langle (c)(f) \rangle$ || $\langle (d) \rangle$ | $\langle (d)(a) \rangle$ | $\langle (d)(b) \rangle$ | $\langle (d)(c) \rangle$ | $\langle (d)(d) \rangle$ | $\langle (d)(e) \rangle$ | $\langle (d)(f) \rangle$ || $\langle (e) \rangle$ | $\langle (e)(a) \rangle$ | $\langle (e)(b) \rangle$ | $\langle (e)(c) \rangle$ | $\langle (e)(d) \rangle$ | $\langle (e)(e) \rangle$ | $\langle (e)(f) \rangle$ || $\langle (f) \rangle$ | $\langle (f)(a) \rangle$ | $\langle (f)(b) \rangle$ | $\langle (f)(c) \rangle$ | $\langle (f)(d) \rangle$ | $\langle (f)(e) \rangle$ | $\langle (f)(f) \rangle$ | | | + Way 2 || | $\langle (a) \rangle$ | $\langle (b) \rangle$ | $\langle (c) \rangle$ | $\langle (d) \rangle$ | $\langle (e) \rangle$ | $\langle (f) \rangle$ | $\langle (a) \rangle$ | | $\langle (ab) \rangle$ | $\langle (ac) \rangle$ | $\langle (ad) \rangle$ | $\langle (ae) \rangle$ | $\langle (af) \rangle$ || $\langle (b) \rangle$ | | | $\langle (bc) \rangle$ | $\langle (bd) \rangle$ | $\langle (be) \rangle$ | $\langle (bf) \rangle$ || $\langle (c) \rangle$ | | | | $\langle (cd) \rangle$ | $\langle (ce) \rangle$ | $\langle (cf) \rangle$ || $\langle (d) \rangle$ | | | | | $\langle (de) \rangle$ | $\langle (df) \rangle$ || $\langle (e) \rangle$ | | | | | | $\langle (ef) \rangle$ || $\langle (f) \rangle$ | | | | | | | |
Now we prune infrequent
And so on…
Drawbacks
- Exponential growth in # of combinations
- Computationally expensive
PrefixSpan
PrefixSpan: A prefix-projected pattern growth method
- Pei et al. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth, 2001.
- doesn’t generate non-existing patterns (like Apriori)
Idea:
- use database $D$ prefix-projected on some sequence $s$ (it’s called prefix-projected database)
- grow frequent subsequences from them - without generating candidates
- examine only prefix subsequences