# ML Wiki

## Document Classification

Document/Text Classification/Categorization is an NLP/Text Mining task of labeling unseen documents with categories from some predefined set

### Formulation

Document Classification, most general formulation:

• a task of assigning a boolean value to each pair $\langle d_j, c_i \rangle \in D \times C$
• $D$ - domain of documents
• $C = \{ c_1, \ ... \ , c_K \}$ - predefined categories
• we assign TRUE to $\langle d_j, c_i \rangle$ if document $d_j$ belongs to category $c_i$
• the task is to approximate the unknown target function $\Phi: D \times C \to \{ \text{T}, \text{F} \}$

Types

• non overlapping categories, i.e. can assign only one label to each document.
• the unknown function becomes $\Phi: D \to C$
• overlapping - can assign several labels

Hard categorization vs Ranking

• instead of assigning true/false we can rank each category

## Features

### Vector Space Model

How do we come up with features?

• text cannot be fed to the classifier directly
• so need to do some indexing to map a document $d_j$ to some representation
• for example, to Vector Space Model: represent document as a vector of term weights - "features"
• $d_j = \langle w_{1j}, \ ... \ , w_{nj} \rangle$, with $w_{ij}$ telling how much term $t_i$ contributes to the semantics of document $d_j$

### TF-IDF

Weights are often term frequencies or TF-IDF:

• $\text{tf-idf}(t_k, d_j) = \text{tf}(t_k, d_j) \cdot \log \cfrac{N}{\text{df}(t_k)}$
• $\text{tf}(t_k, d_j)$ term frequency: how many times term $t_k$ appeared in document $d_j$
• $\text{df}(t_k)$ document frequency: how many documents in the training set contain term $t_k$
• $N$ number of documents in the training set

Why TF-IDF is good for classification:

• the more often a term occurs in a document, the more representative it is of this document
• the more documents contain a term, the lest discriminating it becomes

### Dimensionality Reduction

• Given a vocabulary/term set $V$ of size $| V |$
• the goal is to find $V'$ s.t. $| V' | \ll | V |$ ($V'$ is called "reduced vocabulary" or "reduced term set")
• DR techniques tend to reduce Overfitting:
• if dimensionality of data is $|V'|$ and there are $N$ examples in the training set
• then it's good to have $|V'| \approx N$ to avoid overfitting

We can divide dimensionality reduction techniques by locality:

• local dimensionality reduction
• applied to each category $c_i$
• choose a reduced set $| V'_i | \ll | V_i |$ for each category
• global: choose $| V' |$ using all categories

There are two (very different) types of dimensionality reduction

• by term selection (Feature Selection): select $V' \subset V$
• by term extraction: terms in $V'$ are not necessarily the same as in $V$

Usual IR and indexing techniques for reducing dimensionality are

Stop Words Removal

• Before indexing some function words are sometimes removed
• for example Stop Words - topic neutral words such as articles, prepositions, conjunctions
• cases when stop words are not removed: author identification ("the little words give authors away")
• Stemming is grouping words that share the same morphological root
• it's controversial whether it's helpful for document classification or not
• usually it's used: it reduces the dimensionality
• sometimes Lemmatization is applied instead, but it's more involved

Note that DR techniques sometimes may remove important information when removing terms

How to select terms?

• Subset Selection: usually not used because $| V |$ is too large
• Feature Filtering: rank terms according to their "usefulness" and keep only some of them
• Document Frequency: Keep only terms that occur in higher number of documents
• e.g. remove words that occur only in 3 documents or less

Term Extraction techniques:

• these techniques create "artificial" terms that aren't really terms - they are generated, and not the ones that actually occurred in the text
• The original terms don't have the optimal dimensionality for document content representation
• because of the problems of polysemy, homonymy and synonymy
• so we want to find better representation that doesn't suffer from these issues
• methods:
• Term Clustering cluster terms and use centroids instead of words
• Latent Semantic Analysis apply SVD to Term-Document matrix

## Classification

Good classifiers for text:

## Evaluation

Precision and Recall metrics can be extended to Evaluation of Multiclass Classifiers

• similar to the One-vs-All Classification technique
• ways of averaging the results:
• micro: first calculate TP, FP, FN, FN for each category separately, and then use usual formulas for precision and recall
• and macro averaging: calculate precision and recall for each category separately, and then average
• usually is a way of combining precision and recall
• depending on how P and R were calculated, there are $F_\beta$-micro and $F_\beta$-macro measures

## Sources

• Sebastiani, Fabrizio. "Machine learning in automated text categorization." (2002). [1]