Latent Semantic Analysis

Latent Semantic Analysis (LSA) is an NLP method:

  • mathematical/statistical method for modeling the meaning of words/passages by analysis of text via extracting and inferring relations of expected contextual usage of words in texts
  • idea: words that are used in the same contexts tend to have the same meaning


Problems with Text

Issues with text data:

  • synonymy: many ways to refer to the same object
  • synonymy tends to decrease recall
  • polysemy: many words have more than one distinct meaning (e.g. "chip", "trunk")
  • polysemy tends to decrease precision


Overcoming Synonymy:

  • term extraction, thesauri construction

Overcoming Polysemy:


WEIRD

WEIRD (Koll1979) is the first IR system that dealt with these problems automatically, not with some controlled vocabulary

  • the goal of WIERD: to go from term matching to concept matching
  • can use statistical analysis to empirically find relations among terms
  • so it analyzed term-to-term co-occurrence matrix
  • can use Factor Analysis to identify the right basis for terms s.t. there's little or no loss of information
  • in WEIRD only 7 dimensions were used - based on 7 completely non-overlapping documents found in the collection


The space built by WEIRD acts like an implicit thesaurus

  • synonyms will map to the same concept


LSA

LSA/LSI solves these problems as well

  • it goes further than WIERD: it uses all documents to build a space
  • it does that by applying SVD as a Dimensionality Reduction - which reveals latent structure and "denoises" the data
  • Similarity estimates derived by LSA are not just frequencies or co-occurrences counts: it can infer deeper relations: hence "Latent" and "Semantic"
  • so LSA learns the latent semantic structure of the vocabulary


LSI

Latent Semantic Analysis (LSA) Latent Semantic Indexing (LSI)

  • LSI is the alias of LSA for Information Retrieval
  • indexing and retrieval method that uses SVD to identify patterns in relations between terms and concepts
  • instead of literal match between query and documents (e.g. using cosine in the traditional vector space morels), convert both into the Semantic Space and calculate the cosine there




LSA Steps

3 major steps (by Evangelopoulos2012)


Document preparation


Representation: Vector Space Models

Construct a matrix D

  • D is Term-Document Matrix if rows of D - terms, columns of D - documents/passages
  • D is Document-Term Matrix if rows of D - documents/passages, and columns of D - terms
  • each cell - typically a frequency with which a word occurs in a doc
  • also apply weighting: TF or TF-IDF


SVD and Dimensionality Reduction

Let D be an t×p Term-Passage matrix

  • t rows are terms, p columns are passages, rank D=r
  • then SVD decomposition is D=TΣPT
  • T is t×r Orthogonal Matrix, contains left singular vectors, corresponds to term vectors
  • Σ is r×r a diagonal matrix of singular values
  • P is r×p Orthogonal Matrix, contains right singular vectors, corresponds to passage vectors
  • and then TΣ are loadings for terms and PΣ - for passages


Now reduce the dimensionality:

  • want to combine the surface text information into some deeper abstraction
  • finding the optimal dimensionality for final representation in the Semantic Space is important to properly capture mutual usage of words
  • the "True Semantic Space" should address the Text Problems


So, Apply reduced-rank SVD

  • DTkΣkPTk
  • keep only k largest singular values
  • the result: best k-dim approximation of the original matrix D
  • for NLP k=300±50 usually works the best
  • but it should be tuned because it heavily depends on the domain


Semantic Space

LSA constructs a semantic space via SVD:

  • T is t×r Orthogonal Matrix, contains left singular vectors, corresponds to term vectors
  • Σ is r×r a diagonal matrix of singular values
  • P is r×p Orthogonal Matrix, contains right singular vectors, corresponds to passage vectors
  • and then TΣ are loadings for terms and PΣ - for passages


Language-theoretic interpretation:

  • LSA vectors approximate:
  • the meaning of a word as its average effect of the meaning of passages in which they occur
  • the meaning of a passage as meaning of its words


After doing the SVD, we get the reduced space - this is the semantic space

  • the effect of reducing the dimensionality:
  • removed the noise effect of synomymy and polysemy


Comparisons in the Semantic Space

So we approximated D as DˆD=TkΣkPTk

  • lets omit index k: so below by T we will assume Tk


Term comparisons:

  • How similar are terms ti and tj?
  • In D we would compare rows of D. How to compare them in the semantic space?
  • ˆDˆDT gives a term-term Gram Matrix
    • ˆDˆDT=TΣΣTTT=TΣ(TΣ)T
    • thus [ˆDˆDT]ij is the dot product between ith and jth rows of TΣ
  • rows of TΣ are coordinates for terms in the semantic space


Document comparisons:

  • how similar are documents pi and pj in the semantic space?
  • ˆDTˆD gives a document-document Gram Matrix
  • ˆDTˆD=PΣΣTPT=PΣ(PΣ)T
  • so to compute document i and j you compute the dot product between ith and jth rows of PΣ


Generalization to Unseen Documents

What about objects that didn't originally appear in the training set?

  • e.g. a query q
  • how do we represent q in the semantic space?
  • first, let's see how original documents pi are represented in this space


ˆD=TΣPT

  • multiply by (TΣ)1 on the left
  • (TΣ)1ˆD=PT
  • Σ1TTˆD=PT
  • P=DTTΣ1
  • if di be some document in the original space (column of ˆD) and pi the corresponding representation of di in the document basis, then
  • pi=dTiTΣ1


This, can represent q the same way:

  • ˆq=qTTΣ1
  • where ˆq is the representation of q in the document basis
  • to compare ˆq all we need to do is to scale it by Σ and then compute a dot product


Example

Article Titles Example

Let's consider titles of some articles (from Deerwester90):

  • c1: "Humanmachine interface for ABC computer applications"
  • c2: "A survey of user opinion of computer system response time"
  • c3: "The EPS user interface management system"
  • c4: "Systemand human system engineering testing of EPS"
  • c5: "Relation of user perceived response time to error measurement"
  • m1: "The generation of random, binary, ordered trees"
  • m2: "The intersection graph of paths in trees"
  • m3: "Graph minors IV: Widths of trees and well-quasi-ordering"
  • m4: "Graph minors: A survey"

Matrix:

D=[c1c2c3c4c5m1m2m3m4human100100000interface101000000computer110000000user011010000system011200000response010010000time010010000EPS001100000survey010000001trees000001110graph000000111minors000000011]


Note:

  • row vectors for "human" and "user" are orthogonal: their dot product is zero, but they are supposed to be similar, so it must be positive
  • also, "human" and "minors" are orthogonal, but they are not similar, so it must be negative

Let's apply SVD:

  • D=WΣP
  • 2-dim approximation: D2=W2Σ2P2

D2=[c1c2c3c4c5m1m2m3m4human0.160.40.380.470.180.050.120.160.09interface0.140.370.330.40.160.030.070.10.04computer0.150.510.360.410.240.020.060.090.12user0.260.840.610.70.390.030.080.120.19system0.451.231.051.270.560.070.150.210.05response0.160.580.380.420.280.060.130.190.22time0.160.580.380.420.280.060.130.190.22EPS0.220.550.510.630.240.070.140.20.11survey0.10.530.230.210.270.140.310.440.42trees0.060.230.140.270.140.240.550.770.66graph0.060.340.150.30.20.310.690.980.85minors0.040.250.10.210.150.220.50.710.62]


What's the effect of dimensionality reduction here?

  • words appear less or more frequent than originally
  • consider two cells: ("survey", m4) and ("trees", m4)
  • original document: 1 and 0
  • reduced document: 0.42 and 0.66
  • because m4 contains "graph" and "minor", the 0 for "trees" was replaced by 0.42 - they are related terms
  • so it can be seen as estimate of how many times word "trees" would occur in other samples that contain "graph" and "minor"
  • the count for "survey" went down - it's not expected in this context

So in the reconstructed space:

  • dot product between "user" and "human" is positive
  • dot product between "human" and "minors" is negative
  • it tells us way better whether terms are similar or not even when they never co-occur together


Taking 2 principal components is the same as taking only 2 abstract concepts

  • each word in the vocabulary has some amount of these 2 concepts (we see how much by looking at 1st and 2nd column of W)


The idea:

  • we don't want to reconstruct the underlying data perfectly, but instead we hope to find the correlation and the abstract concepts


Python code

import numpy as np
import numpy.linalg as la
 
D = [[1, 0, 0, 1, 0, 0, 0, 0, 0],
     [1, 0, 1, 0, 0, 0, 0, 0, 0],
     [1, 1, 0, 0, 0, 0, 0, 0, 0],
     [0, 1, 1, 0, 1, 0, 0, 0, 0],
     [0, 1, 1, 2, 0, 0, 0, 0, 0],
     [0, 1, 0, 0, 1, 0, 0, 0, 0],
     [0, 1, 0, 0, 1, 0, 0, 0, 0],
     [0, 0, 1, 1, 0, 0, 0, 0, 0],
     [0, 1, 0, 0, 0, 0, 0, 0, 1],
     [0, 0, 0, 0, 0, 1, 1, 1, 0],
     [0, 0, 0, 0, 0, 0, 1, 1, 1],
     [0, 0, 0, 0, 0, 0, 0, 1, 1]]
D = np.array(D)
 
rows = ['human', 'interface', 'computer', 'user', 'system', 
        'response', 'time', 'EPS', 'survey', 'trees', 'graph', 'minors']
idx = {n: i for (i, n) in enumerate(rows)}
 
D[idx['human']].dot(D[idx['user']]) # 0
D[idx['human']].dot(D[idx['minors']]) # 0
 
 
T, S, P = la.svd(D) # T=terms, P=passages
 
np.set_printoptions(precision=2, suppress=True)
print T[:, 0:2], S[0:2], P[0:2, :]
 
D_hat = T[:, 0:2].dot(np.diag(S[0:2])).dot(P[0:2, :])
 
D_hat[idx['human']].dot(D_hat[idx['user']]) # 0.955
D_hat[idx['human']].dot(D_hat[idx['minors']]) # -0.251


Can do the same without building ˆD:


T = T[:, 0:2]
S = np.diag(S[0:2])
P = P[0:2, :].T

human = T.dot(S)[idx['human']]
user = T.dot(S)[idx['user']]
human.dot(user) # same result: 0.955


Finally, let's calculate cosine between human and user:

human.dot(user) / (la.norm(human) * la.norm(user))
# 0.88784582874340123



Practical Notes

Applications


Limitations

  • makes no use of words order, punctuation
  • if the original terms are already descriptive enough (e.g. for Document Classification), they may be lost during the transformation


When Not Good

  • Sometimes Semantic Spaces alone are not good
  • but we can mix the original vector space and the semantic space together


Mean Centering

LSA and Principal Component Analysis are related via SVD



Extensions of LSA


Links

Sources

  • Koll, Matthew B. "WEIRD: An approach to concept-based information retrieval." 1979.
  • Landauer, Thomas K., Peter W. Foltz, and Darrell Laham. "An introduction to latent semantic analysis." 1998. [1]
  • http://www.scholarpedia.org/article/Latent_semantic_analysis
  • http://edutechwiki.unige.ch/en/Latent_semantic_analysis_and_indexing
  • Evangelopoulos, Nicholas, Xiaoni Zhang, and Victor R. Prybutok. "Latent semantic analysis: five methodological recommendations." (2012). [2] [3]
  • Deerwester, Scott C., et al. "Indexing by latent semantic analysis." 1990. [4]
  • Berry, Michael W., Susan T. Dumais, and Gavin W. O'Brien. "Using linear algebra for intelligent information retrieval." (1995). [5]
  • Korenius, Tuomo, Jorma Laurikkala, and Martti Juhola. "On principal component analysis, cosine and Euclidean measures in information retrieval." 2007. [6]
  • Zhukov, Leonid, and David Gleich. "Topic identification in soft clustering using PCA and ICA". 2004. [7]