Document Classification
Document/Text Classification/Categorization is an NLP/Text Mining task of labeling unseen documents with categories from some predefined set
Formulation
Document Classification, most general formulation:
- a task of assigning a boolean value to each pair $\langle d_j, c_i \rangle \in D \times C$
- $D$ - domain of documents
- $C = \{ c_1, \ ... \ , c_K \}$ - predefined categories
- we assign TRUE to $\langle d_j, c_i \rangle$ if document $d_j$ belongs to category $c_i$
- the task is to approximate the unknown target function $\Phi: D \times C \to \{ \text{T}, \text{F} \}$
Types
- non overlapping categories, i.e. can assign only one label to each document.
- the unknown function becomes $\Phi: D \to C$
- overlapping - can assign several labels
Hard categorization vs Ranking
- instead of assigning true/false we can rank each category
Features
How do we come up with features?
- text cannot be fed to the classifier directly
- so need to do some indexing to map a document $d_j$ to some representation
- for example, to Vector Space Model: represent document as a vector of term weights - "features"
- $d_j = \langle w_{1j}, \ ... \ , w_{nj} \rangle$, with $w_{ij}$ telling how much term $t_i$ contributes to the semantics of document $d_j$
Weights are often term frequencies or TF-IDF:
- $\text{tf-idf}(t_k, d_j) = \text{tf}(t_k, d_j) \cdot \log \cfrac{N}{\text{df}(t_k)}$
- $\text{tf}(t_k, d_j)$ term frequency: how many times term $t_k$ appeared in document $d_j$
- $\text{df}(t_k)$ document frequency: how many documents in the training set contain term $t_k$
- $N$ number of documents in the training set
Why TF-IDF is good for classification:
- the more often a term occurs in a document, the more representative it is of this document
- the more documents contain a term, the lest discriminating it becomes
- Given a vocabulary/term set $V$ of size $| V |$
- the goal is to find $V'$ s.t. $| V' | \ll | V |$ ($V'$ is called "reduced vocabulary" or "reduced term set")
- DR techniques tend to reduce Overfitting:
- if dimensionality of data is $|V'|$ and there are $N$ examples in the training set
- then it's good to have $|V'| \approx N$ to avoid overfitting
We can divide dimensionality reduction techniques by locality:
- local dimensionality reduction
- applied to each category $c_i$
- choose a reduced set $| V'_i | \ll | V_i |$ for each category
- global: choose $| V' |$ using all categories
There are two (very different) types of dimensionality reduction
- by term selection (Feature Selection): select $V' \subset V$
- by term extraction: terms in $V'$ are not necessarily the same as in $V$
Usual IR and indexing techniques for reducing dimensionality are
Stop Words Removal
- Before indexing some function words are sometimes removed
- for example Stop Words - topic neutral words such as articles, prepositions, conjunctions
- cases when stop words are not removed: author identification ("the little words give authors away")
Stemming
- Stemming is grouping words that share the same morphological root
- it's controversial whether it's helpful for document classification or not
- usually it's used: it reduces the dimensionality
- sometimes Lemmatization is applied instead, but it's more involved
Note that DR techniques sometimes may remove important information when removing terms
How to select terms?
- Subset Selection: usually not used because $| V |$ is too large
- Feature Filtering: rank terms according to their "usefulness" and keep only some of them
- Document Frequency: Keep only terms that occur in higher number of documents
- e.g. remove words that occur only in 3 documents or less
Term Extraction techniques:
- these techniques create "artificial" terms that aren't really terms - they are generated, and not the ones that actually occurred in the text
- The original terms don't have the optimal dimensionality for document content representation
- because of the problems of polysemy, homonymy and synonymy
- so we want to find better representation that doesn't suffer from these issues
- methods:
- Term Clustering cluster terms and use centroids instead of words
- Latent Semantic Analysis apply SVD to Term-Document matrix
Good classifiers for text:
Evaluation
Precision and Recall metrics can be extended to Evaluation of Multiclass Classifiers
- similar to the One-vs-All Classification technique
- ways of averaging the results:
- micro: first calculate TP, FP, FN, FN for each category separately, and then use usual formulas for precision and recall
- and macro averaging: calculate precision and recall for each category separately, and then average
F Measure
- usually is a way of combining precision and recall
- depending on how P and R were calculated, there are $F_\beta$-micro and $F_\beta$-macro measures
Sources
- Sebastiani, Fabrizio. "Machine learning in automated text categorization." (2002). [1]