Vector space model is a statistical model for representing text information for Information Retrieval, NLP, Text Mining

- Representing documents in VSM is called "vectorizing text"
- contains the following information: how many documents contain a term, and what are important terms each document has

How do we represent a free text in terms of queries?

- to do we need some preprocessing steps, often called "NLP Pipeline"
- the pipeline may include the following:
- Tokenization - most important step, extracts individual words - "tokens"
- Stop Words Removal - removes functional words
- Stemming or Lemmatization - reduces words to some common form
- or other Text Normalization techniques
- building a VSM model is usually one of the lasts steps of the pipeline
- for IR we also usually build an Inverted Index to speed up querying

The main assumptions we're making about text data:

- word order is not important, only word counts
- Bag of Words = unordered list of terms
- good enough for topic similarity

We treat all words as independent

- i.e. the basis vector of the term space (see below) is orthogonal

Document-Term Matrix - representation of a document for text analysis

- each row of the matrix - is a
*document vector* - each component of the document vectors is a concept, a key word, or a term, but usually it's terms
- documents don't contain many distinct words, so the matrix is sparse

For each documents words are weighed

Weights can be:

- binary: 1 if term is present and 0 if not
- term frequency (TF): frequency of each word in a document
- sublinear TF: $\log \text{TF}$: sometimes a word is used too often so we want to reduce its influence compared to other less frequently used words
- document frequency: words that are used more in the collections have more weight
- inverse document frequency (IDF): words that are rare across the document collections may be more relevant than frequent words.
- TF-IDF: combination of sublinear TF and IDF

So texts are represented by feature vectors with frequency of each term in a lexicon

Notation:

- let $\mathcal D = \{d_1, \ ... \ , d_m \}$ be a set of $m$ documents
- and let $V = \{w_1, \ ... \ , w_n \}$ be a set of $n$ words (the vocabulary)
- let $D$ be the matrix

**Term-Document Matrix**

- if $D$ is $n \times m$ matrix
- rows of $D$ are indexed by terms $w_i$ and columns are indexed by documents $d_j$
- then $D$ is a Term-Document Matrix and $D_{ij}$ is a weight of $w_i$ in document $d_j$

**Document-Term Matrix**

- if $D$ is $m \times n$ matrix
- rows of $D$ are indexed by terms $w_j$ and columns are indexed by documents $d_i$
- then $D$ is a Document-Term Matrix and $D_{ji}$ is a weight of $w_j$ in document $d_i$

Notes:

- if $D$ is a Term-Document matrix, then $D^T$ is a Document-Term
- in IR literature Term-Document matrices are used more commonly
- in some applications it's more convenient to use document-term matrix,
- Document-Term matrices are more often used in software packages, e.g. in scikit-learn
- Document-Term are especially nice especially for large scale processing it's more convenient to operate rows that are documents

Properties of textual data:

- dimensionality is very large, but vectors are very sparse
- e.g. vocabulary size $| V | = 10^5$, but documents may contain only 500 distinct words
- or even less - when we consider sentences or tweets

- lexicon of document may be large, but words are typically correlated with each other
- so number of concepts ("principal components" (see PCA)) is $\ll | V |$
- may want to account for that

- number of words across different documents may wary a lot
- so need to normalize

Since data is high dimensional, we may need special pre-processing and representation

Usually done via Feature Selection (called "Term Selection")

- discard infrequent and very frequent terms

not all terms have the same descriptiveness power w.r.t. domain/topic e.g frequent words like "a", "the"

- we often eliminate such words from the representation of the text

many infrequent words are eliminated as well

- they are called "hapax logomena": "said only once" and they are usually spelling errors or neologisms that have not yet been lexicalized (i.e. not a part of the vocabulary)

A geometrical way to express BoW features is the Vector Space Model

- let $D$ be a document-term matrix

**TextVSM**: Document Vector Space Model - it's the Row Space of $D$

- each document $d_i$ is $i$th row of the matrix
- dimensions are words and vectors are documents
- source: Semantic Domains in Computational Linguistics (book), Fig 3.1
- similarity between two documents: dot product or cosine

**TermVSM**: Term Vector Space model - it's the Column Space of $D$

- Can do the same for the terms:
- dimensions are documents and vector are terms
- source: Semantic Domains in Computational Linguistics (book), Fig 3.1
- terms are expressed by documents in which they occur
- similarity between two terms: dot product or cosine

TextVSM and TermVSM appear to be very similar, but in reality they are a bit different

- e.g. Words in the vocabulary of a corpus follow the Zip's Law: the size of the vocabulary becomes stable when corpus size increases
- it means that the dimensionality of Text VSM is bounded to the number of terms in the language
- not true for the Term VSM: the number of documents can grow forever
- also, the Curse of Dimensionality: paradox: the larger the corpus size is, the worse the similarity estimation in this space becomes

another difference:

- can do feature selection in the Text VSM easily
- but can't really discard dimensions in the Term VSM: can't remove one documents and keep others

finally these spaces are disjoint: they don't hare any common dimensions

- so we can't measure similarity between a text and a document

Suppose we have two document vectors $d_1, d_2$. Then we can define the following measures of similarity:

- Inner Product: $d_1^T d_2$
- Cosine Similarity: $\cfrac{d_1^T d_2}{\| d_1 \| \cdot \| d_2 \|}$
- Dice Coefficient: $\cfrac{d_1^T d_2}{\|d_1\|^2 + \| d_2\|^2}$
- Jaccard Coefficient: $\cfrac{d_1^T d_2}{\|d_1\|^2 + \| d_2\|^2 - d_1^T d_2}$
- most efficient: normalize $d_1$ and $d_2$ and compute the dot product to get cosine

With that matrix you can compute the similarity of two documents

- multiply the matrix with its own transpose
- $S = D \cdot D^T$
- The result is a square document - document matrix where each cell represents similarity
- and you have (unnormalized) measure of similarity
- if $D$ is row-normalized, then $S$ contains
*cosine scores*between each document - Cosine Similarity - is a measure of the angle between the two document vectors, normalized by magnitude
- computing $D \, D^T$ may be the first step for Document Clustering

In some cases term-term similarity can be useful

- e.g. for Term Clustering
- compute $D^T D$ to get pair-wise term similarity
- to get cosine, you need to unit-normalize columns of $D$
- note that if rows of $D$ should not be normalized

In IR a query is also represented in TextVSM

- in such a case the query is called "pseudo-document" $q$
- so ranking is done by computing cosine between the query and all the documents
- it can be done by matrix multiplication: if $q$ and $D$'s rows are unit normalized
- then Matrix-Vector Multiplication $D \, q$ is the cosine score between each doc and the query

- computing $D \, q$ may be computationally expensive when there are many documents in the corpus
- so usually the documents are indexed and we compute the similarity only to whose documents that share at least one word with the query

VSM is the most popular model in IR

- Query language is simple
- Ranking - reduces to a dot product

problems of VSMs:

- Text VSM can't deal with lexical ambiguity and variability
- e.g.: "he's affected by AIDS" and "HIV is a virus" don't have any words in common
- so in the TextVSM the similarity is 0: these vectors are orthogonal even though the concepts are related
- on the other hand, similarity between "the laptop has a virus" and "HIV is a virus" is not 0: due to the ambiguity of "virus"

Term VSM:

- feature sparseness
- if we want to model domain relations, we're mostly interested in domain-specific words
- such words are quite infrequent compared to non-domain words, so vectors for these words are very sparse, esp in large corpus
- so similarity between domain words would tend to 0
- and the results overall will not be very meaningful and interesting

Solutions:

- Generalized Vector Space Models: Relaxes the Term Independence assumption and uses term co-occurrence information
- Distributed Clusters (Bekkerman et al. Distributional word clusters vs. words for text categorization. 2002)
- Concept-Based Representation (Gonzalo et al. Indexing with WordNet synsets can improve text retrieval. 1998)
- Latent Semantic Analysis
- Domain Spaces

We can decompose the matrix $D$

- using SVD and we'll get Latent Semantic Analysis
- LSA can also be done with Non-Negative Matrix Factorization

A VMS is more a (retrieval) framework

VSM has several components:

- term space for representing documents and queries
- document space for representing terms
- similarity/distance measure

Framework:

- The exact vector representation and similarity is not specified
- therefore it's up to the user to define them
- this flexibility allows to incorporate many advanced indexing techniques and solve problems of VSMs (mentioned earlier) while staying within the framework

For example, consider LSA:

- if we apply SVD to the document-term matrix, we'll have LSA
- this way we'll reduce the dimensionality of data and capture some semantic closeness between terms
- by doing this we changed the way documents and terms are represented
- also the similarity is changed slightly
- but overall the framework stays the same

Don't assume that words are independent

Reference:

- Wong, SK Michael, Wojciech Ziarko, and Patrick CN Wong. "Generalized vector spaces model in information retrieval." 1985. [1])

- Salton, Gerard, Anita Wong, and Chung-Shu Yang. "A vector space model for automatic indexing." (1975). [2]
- Jing, Liping. "Survey of text clustering." 2008. [3]
- Semantic Domains in Computational Linguistics (book)
- Information Retrieval (UFRT)
- Zhai, ChengXiang. "Statistical language models for information retrieval." 2008.