Document Clustering
The goal of text clustering is
- to assign documents to different topics or topic hierarchies
- i.e. when the topics/hierarchies are not known in advance
- as opposed to Document Classification when labels are known
- It’s a Cluster Analysis task: Unsupervised Learning applied to textual data
Objects to be clustered are
- documents, paragraphs, sentences
- or terms (Term Clustering)
Applications:
- Cluster Analysis is also useful in Text Mining
- E.g. organizing documents for better Information Retrieval
- Organizing documents intro hierarhical clusters Cutting1992
- see Anick1997, Cutting1993 (Scatter/Gather)
- Corpus Summarization
- Improving Document Classification - see Baker1998 and Bekkerman2001
Preprocessing
Document Representation
The most commonly used document representation is Vector Space Model:
- extract a list of unique terms and weight them
- weighted with TF or TF-IDF
Alternative representation:
- terms as a Probability Distribution: Language Models
- then we can measure (dis)similarity with a symmetric variation of KL Divergence
Feature Selection
Concept of distance and similarity may be not meaningful in high-dimensional space
- so may need to reduce dimensionality
In text mining usually referred as “Term Selection”:
- Remove stop words
- use document frequency to cut away infrequent and very frequent words. such words usually don’t contribute much (or anything) to similarity computation
- Subset Selection and Feature Filtering won’t work because we don’t have labels
Dimensionality Reduction
- Term Clustering: find clusters of terms and replace the terms by their centroids
- PCA gives the basis for Latent Semantic Analysis
- Non-Negative Matrix Factorization
Clustering
- Hierarchical Clustering: good for Document clustering because it creates a tree structure
- Partitioning Clustering Algorithms
- Parametric Modeling Methods like Expectation Maximization
Distances and Similarity
Popular choice:
- Euclidean Distance is not very good for high-dimensional data
- Jaccard Coefficient or Cosine Similarity are better
If not Vector Space Models:
- Language Models: symmetric variant KL Divergence
- Keep documents as strings: Edit Distance (but it’ll most likely be extremely slow)
Papers:
- Strehl2000: Survey on distances for documents
- Sahami2006: When text segments are too short (e.g. tweets or sentences)
Direct similarity measures are not always reliable for high-dimensional clustering (see Guha1999)
- high dimensional data is sparse and therefore on average similarity is low
- also see Curse of Dimensionality
- SNN Distance solves it: Shared Nearest Neighbors Distance, # of KNNs two documents share (as used in SNN Clustering)
Algorithms
K-Means
- centroids = weighed average of all docs in the cluster
- to compare a document with a cluster, calculate cosine between document and cluster
A variation of K-Means:
- Bisecting K-Means: gives good performance for document clusters
- K-Medoids for non-Euclidean distances, using medoid ($\approx$ median) instead of mean for selecting a centroid
- Scatter/Gather:
- smart seed selection
- centroid = concatenation of all docs in the cluster
- Split and Join refinement operations
Two-Phase Document Clustering
Main idea:
- use Mutual Information to find best term clustering
- and then use mutual information to find best document clustering
Co-Clustering
Clustering terms and documents at the same time
- clustering of terms and clustering of documents are dual problems
- take advantage of that
- also can use Non-Negative Matrix Factorization $A \approx UV^T$ where $U$ are clusters of docs and $V$ are clusters of terms
Latent Semantic Analysis
Using PCA define new features from terms
- it creates a new semantic space where problems like symomymy or polysemy are solved
- term-document matrix is decomposed using SVD
Not only SVD is good:
- can also use Non-Negative Matrix Factorization techniques
- this way it’s easy to interpret and clusters can be fuzzy
Topic Models
- define some probabilistic generative models for text documents
- in some way it’s similar to LSA, but it’s probabilistic
- see Probabilistic LSA or Latent Dirichlet Allocation
Semi-Supervised Clustering
Use prior knowledge to help clustering
- e.g. if you know some of the labels, do better seed selection for K-Means
Performance
Issues
- Text data usually has very high dimensionality
- especially important for large corpus - it will be very slow, especially for hierarchical algorithms
Inverted Index
Idea:
- usually a document contains only a small portion of terms
- so document vectors are very sparse
- typical distance is cosine similarity - it ignores zeros. for cosine to be non-zero, two docs need to share at least one term
- $D^T$ is the inverted index of the term-document matrix $D$
this, to find docs similar to $d$:
- for each $w_i \in d$
- let $D_i = { d_i } - d = \text{index}[w_i]$ be a set of documents that also contain $w_i$
- then take the union of all $D_i$
- calculate similarity only with documents from this union
Term Selection
Idea:
- Only top high-weighted terms contribute substantially to the norm
- so keep only those weights that contribute 90% of the norm
- and set the rest to 0
It can reduce the number of documents to consider but without losing much information
References
- Anick, Peter G., and Shivakumar Vaithyanathan. “Exploiting clustering and phrases for context-based information retrieval.” 1997.
- Cutting, Douglass R., et al. “Scatter/gather: A cluster-based approach to browsing large document collections.” 1992. [http://courses.washington.edu/info320/au11/readings/Week4.Cutting.et.al.1992.Scatter-Gather.pdf]
- Cutting, Douglass R., David R. Karger, and Jan O. Pedersen. “Constant interaction-time scatter/gather browsing of very large document collections.” 1993.
- Baker, L. Douglas, and Andrew Kachites McCallum. “Distributional clustering of words for text classification.” 1998. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.7488&rep=rep1&type=pdf]
- Bekkerman, Ron, et al. “On feature distributional clustering for text categorization.” 2001. [http://scholar.google.com/scholar?cluster=1445350076343942781&hl=ru&as_sdt=0,5]
- Sahami, Mehran, and Timothy D. Heilman. “A web-based kernel function for measuring the similarity of short text snippets.” 2006. [http://research.google.com/intl/en/pubs/archive/32.pdf]
- Strehl, Alexander, Joydeep Ghosh, and Raymond Mooney. “Impact of similarity measures on web-page clustering.” 2000. [http://strehl.com/download/strehl-aaai00.pdf]
- Guha, Sudipto, Rajeev Rastogi, and Kyuseok Shim. “ROCK: A robust clustering algorithm for categorical attributes.” 1999. [http://www.cacs.louisiana.edu/~jyoon/grad/adb/References/clustering/ROCK-clus99icde.pdf]
Sources
- Steinbach, Michael, George Karypis, and Vipin Kumar. “A comparison of document clustering techniques.” 2000. ([https://wwws.cs.umn.edu/tech_reports_upload/tr2000/00-034.ps])
- Larsen, Bjornar, and Chinatsu Aone. “Fast and effective text mining using linear-time document clustering.” 1999. ([http://comminfo.rutgers.edu/~muresan/IR/Docs/Articles/sigkddLarsen1999.pdf])
- Aggarwal, Charu C., and ChengXiang Zhai. “A survey of text clustering algorithms.” Mining Text Data. Springer US, 2012. [http://ir.nmu.org.ua/bitstream/handle/123456789/144935/d1784ebed3eab2708026b202b2b65309.pdf?sequence=1#page=90]
- Jing, Liping. “Survey of text clustering.” (2008). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.112.3476&rep=rep1&type=pdf]
- Ertöz, Levent, Michael Steinbach, and Vipin Kumar. “Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data.” 2003. [http://static.msi.umn.edu/rreports/2003/73.pdf]