Document Classification

Document/Text Classification/Categorization is an NLP/Text Mining task of labeling unseen documents with categories from some predefined set


Formulation

Document Classification, most general formulation:

  • a task of assigning a boolean value to each pair $\langle d_j, c_i \rangle \in D \times C$
  • $D$ - domain of documents
  • $C = \{ c_1, \ ... \ , c_K \}$ - predefined categories
  • we assign TRUE to $\langle d_j, c_i \rangle$ if document $d_j$ belongs to category $c_i$
  • the task is to approximate the unknown target function $\Phi: D \times C \to \{ \text{T}, \text{F} \}$


Types

  • non overlapping categories, i.e. can assign only one label to each document.
    • the unknown function becomes $\Phi: D \to C$
  • overlapping - can assign several labels


Hard categorization vs Ranking

  • instead of assigning true/false we can rank each category


Features

Vector Space Model

How do we come up with features?

  • text cannot be fed to the classifier directly
  • so need to do some indexing to map a document $d_j$ to some representation
  • for example, to Vector Space Model: represent document as a vector of term weights - "features"
  • $d_j = \langle w_{1j}, \ ... \ , w_{nj} \rangle$, with $w_{ij}$ telling how much term $t_i$ contributes to the semantics of document $d_j$


TF-IDF

Weights are often term frequencies or TF-IDF:

  • $\text{tf-idf}(t_k, d_j) = \text{tf}(t_k, d_j) \cdot \log \cfrac{N}{\text{df}(t_k)}$
  • $\text{tf}(t_k, d_j)$ term frequency: how many times term $t_k$ appeared in document $d_j$
  • $\text{df}(t_k)$ document frequency: how many documents in the training set contain term $t_k$
  • $N$ number of documents in the training set


Why TF-IDF is good for classification:

  • the more often a term occurs in a document, the more representative it is of this document
  • the more documents contain a term, the lest discriminating it becomes


Dimensionality Reduction

  • Given a vocabulary/term set $V$ of size $| V |$
  • the goal is to find $V'$ s.t. $| V' | \ll | V |$ ($V'$ is called "reduced vocabulary" or "reduced term set")
  • DR techniques tend to reduce Overfitting:
    • if dimensionality of data is $|V'|$ and there are $N$ examples in the training set
    • then it's good to have $|V'| \approx N$ to avoid overfitting


We can divide dimensionality reduction techniques by locality:

  • local dimensionality reduction
    • applied to each category $c_i$
    • choose a reduced set $| V'_i | \ll | V_i |$ for each category
  • global: choose $| V' |$ using all categories


There are two (very different) types of dimensionality reduction

  • by term selection (Feature Selection): select $V' \subset V$
  • by term extraction: terms in $V'$ are not necessarily the same as in $V$


Usual IR and indexing techniques for reducing dimensionality are


Stop Words Removal

  • Before indexing some function words are sometimes removed
  • for example Stop Words - topic neutral words such as articles, prepositions, conjunctions
  • cases when stop words are not removed: author identification ("the little words give authors away")


Stemming

  • Stemming is grouping words that share the same morphological root
  • it's controversial whether it's helpful for document classification or not
  • usually it's used: it reduces the dimensionality
  • sometimes Lemmatization is applied instead, but it's more involved


Note that DR techniques sometimes may remove important information when removing terms

How to select terms?

  • Subset Selection: usually not used because $| V |$ is too large
  • Feature Filtering: rank terms according to their "usefulness" and keep only some of them
  • Document Frequency: Keep only terms that occur in higher number of documents
    • e.g. remove words that occur only in 3 documents or less


Term Extraction techniques:

  • these techniques create "artificial" terms that aren't really terms - they are generated, and not the ones that actually occurred in the text
  • The original terms don't have the optimal dimensionality for document content representation
    • because of the problems of polysemy, homonymy and synonymy
  • so we want to find better representation that doesn't suffer from these issues
  • methods:
  • Term Clustering cluster terms and use centroids instead of words
  • Latent Semantic Analysis apply SVD to Term-Document matrix


Classification

Good classifiers for text:


Evaluation

Precision and Recall metrics can be extended to Evaluation of Multiclass Classifiers

  • similar to the One-vs-All Classification technique
  • ways of averaging the results:
    • micro: first calculate TP, FP, FN, FN for each category separately, and then use usual formulas for precision and recall
    • and macro averaging: calculate precision and recall for each category separately, and then average


F Measure

  • usually is a way of combining precision and recall
  • depending on how P and R were calculated, there are $F_\beta$-micro and $F_\beta$-macro measures


Sources

  • Sebastiani, Fabrizio. "Machine learning in automated text categorization." (2002). [1]