Only papers I read and used as sources (or small books that don't deserve a separate wiki page)

- ordered by first author
- ABCDEFGHIJKLMNOPQRSTUVWXYZ

- Aggarwal, Charu C., and ChengXiang Zhai. "A survey of text clustering algorithms." Mining Text Data. 2012. Document Clustering, K-Means, K-Medoids, Co-Clustering, Two-Phase Document Clustering, Non-Negative Matrix Factorization, Semi-Supervised Clustering, Topic Models, Probabilistic LSA, Term Strength, Term Contribution, Stop Words

- Cristianini, Nello, John Shawe-Taylor, and Huma Lodhi. "Latent semantic kernels." 2002. [1] Kernel Methods Latent Semantic Kernels
- Cutting, et al. "Scatter/gather: A cluster-based approach to browsing large document collections." 1992. [2] Scatter/Gather

- Datar, Mayur, et al. "Locality-sensitive hashing scheme based on p-stable distributions." 2004. [3] Locality Sensitive Hashing, Euclidean LSH
- De Kok D., Brouwer H. "Natural language processing for the working programmer", 2011. Collocation Extraction
- De Smet, Yves. "An introduction to multicriteria decision aid: The PROMETHEE and GAIA methods." PROMETHEE
- Deerwester, Scott C., et al. "Indexing by latent semantic analysis." 1990. [4] Latent Semantic Analysis
- Domingos, Pedro. "A few useful things to know about machine learning." 2012. [5] Overfitting

- Elsayed, Tamer, Jimmy Lin, and Douglas W. Oard. "Pairwise document similarity in large collections with MapReduce." 2008. [6] Inverted Index
- Ertöz, Levent et al. "Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data." 2003. [7] Document Clustering, DBSCAN, SNN Clustering, Euclidean Distance, Curse of Dimensionality, Chameleon Clustering, CURE Clustering, ROCK Clustering

- Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." 1999. [8] Locality Sensitive Hashing, Bit Sampling LSH

- Hopcroft, John, and Ravindran Kannan. "Foundations of Data Science1." 2014. Power Iteration

- Jauregui, Jeff. "Principal component analysis with linear algebra." 2012. [9] SVD, Principal Component Analysis
- Jing, Liping. "Survey of text clustering." 2008. [10] Vector Space Model, Document Clustering, Cluster Analysis, Subspace Clustering, Semi-Supervised Clustering

- Kalman, Dan. "A singularly valuable decomposition: the SVD of a matrix." 1996. [11] SVD
- Koll, Matthew B. "WEIRD: An approach to concept-based information retrieval." 1979. Latent Semantic Analysis
- Korenius, Tuomo, Jorma Laurikkala, and Martti Juhola. "On principal component analysis, cosine and Euclidean measures in information retrieval." 2007. [12] Principal Component Analysis, Latent Semantic Analysis, Distance Functions, Cosine Similarity, Euclidean Distance
- Kristianto, et al. "Extracting definitions of mathematical expressions in scientific papers." 2012. [13] Mathematical Definition Extraction, Math-Aware POS Tagging
- Kristianto, et al. "Extracting Textual Descriptions of Mathematical Expressions in Scientific Papers." 2014. [14] Mathematical Definition Extraction

- Landauer, T. et al. "An introduction to latent semantic analysis." 1998. [15] Latent Semantic Analysis
- Larsen, Bjornar et al. "Fast and effective text mining using linear-time document clustering." 1999. [16] Document Clustering
- Lee, K., Lee, Y. et al "Parallel data processing with MapReduce: a survey" 2012. [17] Hadoop, MapReduce, Hadoop MapReduce
- Li, Yong H., et al. "Classification of text documents." 1998. [18] Term Clustering
- Liu, Tao, et al. "An evaluation on feature selection for text clustering." 2003. [19] Term Contribution

Manning C., Schütze H. "Foundations of statistical natural language processing", 1999. Collocation Extraction

- Oikonomakou, N, Vazirgiannis, M. "A review of web document clustering approaches." Data mining and knowledge discovery handbook. 2010. [20] Cluster Analysis Agglomerative Clustering K-Means
- Osinski, S. "Improving quality of search results clustering with approximate matrix factorisations." 2006. [21] Non-Negative Matrix Factorization
- Ordonez, C, et al, "Relational versus non-relational database systems for data warehousing." 2010. [22] Hadoop, Hadoop MapReduce

- Pagael R, Schubotz M. "Mathematical Language Processing Project." 2014. [23] Mathematical Definition Extraction Math-Aware POS Tagging
- Paulevé, L. et al. "Locality sensitive hashing: A comparison of hash function types and querying mechanisms." 2010. [24] Locality Sensitive Hashing, K-Means LSH
- Petrović S. et al. "Comparison of collocation extraction measures for document indexing", 2006. [25]

- Salton, et al. "A vector space model for automatic indexing." 1975. [26] Vector Space Model
- Salton, Buckley. "Term-weighting approaches in automatic text retrieval." 1988. [27] TF-IDF
- Schelter, Sebastian, et al. "Efficient Sample Generation for Scalable Meta Learning." [28]. 2014. Meta Learning
- Schöneberg et al. "POS Tagging and its Applications for Mathematics." 2014. Math-Aware POS Tagging
- Sculley, David. "Web-scale k-means clustering." 2010. [29] K-Means
- Sebastiani, Fabrizio. "Machine learning in automated text categorization." 2002. [30] Document Classification, Term Clustering
- Slaney, Malcolm, and Michael Casey. "Locality-sensitive hashing for finding nearest neighbors [lecture notes]." 2008. [31] Locality Sensitive Hashing, Euclidean LSH
- Steinbach, Michael, et al. "A comparison of document clustering techniques." 2000. Document Clustering, K-Means
- Strang, Gilbert. "The fundamental theorem of linear algebra." 1993. [32] SVD

- Wilbur, W. John, "The automatic identification of stop words." 1992. [33] Stop Words, Term Strength

- Xu, Wei, Xin Liu, and Yihong Gong. "Document clustering based on non-negative matrix factorization." 2003. [34] Cluster Analysis, Non-Negative Matrix Factorization

- Zhai, ChengXiang. "Statistical language models for information retrieval." (Book) 2008. Information Retrieval, Statistical Language Models, Multinomial Distribution, Smoothing for Language Models, TF-IDF, Probabilistic Retrieval Model
- Zhukov, Leonid, and David Gleich. "Topic identification in soft clustering using PCA and ICA". 2004. [35] Latent Semantic Analysis