Lessons Learned from 20 Years of Implementing LSI Applications 1 Roger Bradford Maxim Analytics, Great Falls, VA 22066 USA Abstract This paper summarizes lessons learned, over a period of 20 years, from implementing information systems employing the technique of latent semantic indexing (LSI). The data presented is drawn from 63 projects undertaken over the period 1999 through 2019. Over that period the projects increased in scale from collections of hundreds of thousands of documents to ones involving hundreds of millions of documents. They also increased in sophistication, from simple search and retrieval systems to ones focused on information discovery and automated alerting. This paper summarizes some of the key developments in technology and techniques that enabled those advances in the size and sophistication of the applications. The objective of this paper is to share insights gained from these past two decades of system implementation experience. Keywords 1 Latent Semantic Indexing, LSI, LSI applications, LSA, lessons learned 1. Latent Semantic Indexing Both terms and documents are represented by k-dimensional vectors in this vector space. 5. New queries, terms, and documents can The technique of latent semantic indexing be represented in the space by a process known (LSI) was invented at Bellcore in the late 1980s as folding-in, which extrapolates from known [1]. The original intent was to provide improved vectors. capabilities for retrieval of text. The technique 6. The semantic similarity of any two has, however, proven to be useful in analysis of a objects represented in the space is reflected by wide variety of information types [2, 3]. the proximity of their representation vectors, As applied to a collection of documents, the generally using a cosine measure. LSI algorithm consists of the following primary steps [1, 4]: Experience from a broad range of academic, 1. A term-document matrix is formed, and industrial, and governmental testing has shown (typically) local and global weights are applied that proximity in an LSI space is a remarkably to the elements of this matrix. good proxy for semantic relatedness as judged by 2. Singular value decomposition (SVD) is humans [5]. used to reduce this matrix to a product of three Early commercial applications of LSI included matrices, one of which is diagonal in the identification of people with specific expertise singular values of the original matrix. [6], detection of spam in e-mails [3] and essay 3. Dimensionality is reduced by deleting all scoring [7]. Over time, the technique found wide but the k largest singular values, together with application in areas such as patent search and the corresponding columns of the other two matrices. analysis [8], résumé matching [9], customer 4. This truncation process provides a basis survey analysis [10], and fraud detection [11]. It became the dominant paradigm in electronic for generating a k-dimensional vector space. DESIRES 2021 - 2nd International Conference on Design of Experimental Search & Information REtrieval Systems, September 15--18, 2021, Padua, Italy EMAIL: rbradford@cox.net ORCID: 0000-0003-1750-3125 ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) document discovery [12]. More recently it has matrix manipulation, which limited the scale of been used in bioinformatics discovery [13], LSI applications. However, hardware recommender systems [14], and social media improvements over the past twenty years have analysis [15]. completely changed this situation [16]. Figure 1 shows the dramatic reduction in the observed 2. Structure of the Paper index creation times in nine comparable projects over the period 2002-2016. The times shown are those required to build an LSI index for one This paper summarizes lessons learned from million documents (averaging several kilobytes in 63 information system implementation projects size) at 300 dimensions, using computers that the author took part in over the period 1999 typically employed in applications in the given through 2019. Each of these projects employed years. LSI as a key technical component. The systems addressed a wide range of applications for both commercial and government customers. Over this 20-year period, the systems increased significantly in both size and sophistication.2 The earliest systems utilized the conceptual search, clustering, and categorization functionality of LSI to implement relatively simple capabilities for such tasks as customer survey analysis and matching of résumés with job openings. Extrapolating from experience gained, these fundamental capabilities subsequently were Figure 1. Decline in time required to create an LSI applied in a more abstract fashion to higher-level space for a 1 million document collection considerations in applications such as fraud detection and patent prior art analysis. Successive The early points on the curve correspond to refinement of tools and techniques eventually index creation times from projects using clusters enabled advanced applications incorporating of processors. Subsequent points are from features such as novel information detection and projects primarily employing mid-range servers. secure information sharing. Of note, however, the last point shown is for a Section 3 of this paper provides a brief laptop computer. overview of the principal improvements in The dramatic decline in time required to create technologies and techniques that enabled solution an LSI index progressively enabled a wider of progressively larger and more complex variety of applications. At the present time, LSI problems using LSI. Section 4 describes several applications involving collections of tens of implementation principles that have proven useful millions of documents are routine and multiple in building LSI-based information systems. applications have been implemented that Section 5 summarizes some particularly encompass full LSI indexing of hundreds of interesting results and surprises that were millions of documents. encountered in the course of building these Advances in technology enabled systems. Section 6 concludes with brief improvements not only in scale, but also in the comments on capabilities being incorporated into fidelity of the generated LSI spaces in more recent LSI applications. representing real-world semantic associations. For LSI, as collection size increases, the larger 3. Enabling Advances in Technology number of occurrences of individual terms and Technique diminishes the effects of idiosyncratic occurrences of those terms in specific documents. 3.1. Scaling This improves overall representational fidelity, as shown in Figure 2. The graph displays the Computers in the early days of LSI were not variation in mean reciprocal rank (MRR) of 250 well-suited for SVD computation and large-scale pairs of terms having known real-world semantic 2 Other projects undertaken in this time frame applied LSI to data other than text. However, only systems that focused on text are addressed here. association3, as a function of the size of the dimensions chosen, for a collection4 of five collection. As indicated by the trend line, the million documents [19]. increase in representational fidelity with collection size is approximately logarithmic. Of note is the fact that over 80% of all published literature on LSI deals with collection sizes smaller than the initial point shown (17 thousand documents) and 97% deals with collection sizes less than the second point shown (93 thousand documents). Figure 3. Variation in term similarity ranking as a function of chosen dimensionality The performance of the systems discussed here benefited greatly from the cumulative experience gained over 20 years regarding choice of effective parameters. In nearly all of the systems developed, at least some testing of parameter choices was carried out that was designed to Figure 2. Increase in semantic representation optimize application performance. This is fidelity with collection size addressed further in section 4. Several of the developed applications included the identification of specific patterns of activities 3.3. Indexing of Named Entities and relationships as one of the system objectives. In many cases, the distinctions between patterns In most text applications, named entities of interest and normal patterns were quite subtle. constitute items of particular significance. For In general, the larger the data collection, the more example, names of people are of fundamental effective LSI was in providing indicators of the importance in fraud detection. One of the most existence of patterns of interest. Over time, the important factors contributing to the success of continuing growth in the size of collections that the programs described here was the fact that could be addressed facilitated implementation of nearly all of them employed entity extraction and increasingly sophisticated analytic operations. markup as a preprocessing step prior to creating the LSI spaces involved. Typically, names of persons, locations, and organizations were 3.2. Parameter Optimization extracted, but in some cases more entity types were treated. In the LSI preprocessing, In the construction of an LSI space, there are a occurrences of a name such as John Kennedy were number of parameter choices that must be made; marked up as p_john_kennedy_p, and similarly for example: number and identity of stopwords, for other entity types. (This markup was stripped required number of occurrences for a term to be out prior to presenting results to users.) included in the processing, and the number of With classical LSI, users can create queries of dimensions for the LSI space. The choices that the form: What terms are most closely associated are made can have a significant impact on overall with a given term? In contrast, with entity performance in a specific application [17, 18]. As markup prior to creating the space, an interface an example, Figure 3 shows the variation in mean can be implemented that allows users to enter reciprocal rank of 250 pairs of semantically- queries such as: What people are most closely related terms as a function of the number of associated with a given entity or activity? Such queries are much more natural in most 3 Over the years, such simple and direct metrics for quality of an LSI correlated well with both performance on application-specific tests space proved to be very useful in tuning implemented systems. Over and with human judgment. 4 a wide range of applications, evaluations using such metrics These are not the same documents as those in the collection referenced in Figure 2. applications. The implementation of capabilities • For each candidate phrase, create an to effectively execute those types of queries was a approximate LSI vector by taking a weighted major factor contributing to both operational average of the representation vectors for the efficiency and user satisfaction for the systems documents that contain that phrase. (The described here. folding-in technique of classical LSI applied to Even in the limited number of cases where terms [1]). entities themselves were not of prime importance • Compare the approximate vector for the for users, entity markup prior to creating the LSI phrase with a vector created by simply space was of great importance for improving the combining the terms of the candidate phrase as representational fidelity of the space. In most text an LSI query. collections, failure to treat named entities as • Create a final LSI space, treating as textual units will create vast numbers of spurious textual units the candidate phrases which have associations. For example, the common English the greatest distance (smallest cosine) between given name John may be a component of the approximation vector and the query vector. hundreds of distinct person names. Classical LSI will conflate all of the occurrences of John, In order to ensure that users could employ generating erroneous correlations in the LSI space arbitrary phrases in searches, we developed a produced. In many current LSI applications, the technique that allowed use of phrases in LSI text collections being addressed contain millions queries even for LSI spaces where phrases have to tens of millions of named entities. Failing to not been indexed. The technique is described in treat these entities as textual units when building detail in [21]. Table 1 shows the results from an LSI space for such applications would yield applying this technique in searching a collection millions of distortions of relations in the space. of 1.6 million news articles using the query rare earth element. 3.4. Dealing with Phrases The column labeled NONE shows the ranked query results (closest terms) when no phrase Many LSI applications involve retrieval of processing is applied. In this case, since the terms information of interest based on queries formed rare and element occur in diverse contexts, the by users. It is well-known that, in many instances, term earth has the most significant effect on the the use of phrases in queries can significantly aid results. The results are completely dominated by in expressing a user’s information needs. celestial references; clearly not what a user would Historically, one frequently-cited criticism of desire. classical LSI was that it did not provide a viable The column labeled PRE-PROCESSED shows mechanism for dealing with phrases in queries. It the results for the same collection when rare earth was felt that use of phrases required identification element was marked up as a phrase and treated as of all phrases of interest prior to creating the LSI a textual unit in creating the LSI space. The space, so that those phrases could be treated as results are as expected: primarily names of rare terms in the indexing process. This is a problem earth elements and those of people and in that there are a very large number of phrases in organizations associated with processing of rare a text collection of any significant size. Most of earth elements. the candidate phrases will never be employed by users. Moreover, indexing of most phrases will Table 1 not significantly improve the representational Comparison of pre-indexed and ad hoc phrase fidelity of the LSI space. processing We eventually found a two-part solution to this problem. In order to incorporate phrases that would improve the representational fidelity of the space, we employed the following procedure: • Using a highly productive phrase generation technique, such as RAKE [20], generate a large set of candidate phrases for the collection of interest. • Create an initial LSI index for the collection, with no attempt to extract phrases. The column labeled AD HOC shows the Other user aids that proved to enhance both results when the term folding approach of [21] is operational efficiency and user satisfaction applied to the LSI space where there was no initial included: phrase processing. The results are quite close to • Generation of document summaries those obtained for the case where the phrase was tailored to users’ interests. indexed (60% agreement for the top ten terms in a • Automated generation of graphs showing collection comprising 1.5 million terms). The relationships among entities. adoption of this ad hoc phrase query process in • Automated tracking of topic threads in systems described here resulted in a major long documents and sets of documents. improvement in user satisfaction. 3.6. Secure Information Sharing 3.5. User Aids The representation for a given term in an LSI Over the years, with the dramatic growth in the space is a single point in a vector space that is size of the text collections being addressed, it derived from what may be hundreds of became increasingly important to provide aids for occurrences, even for a relatively rare term. users in areas such as creating queries, identifying Similarly, the representation for a given document topics, interpreting results, and automating is derived from large numbers of occurrences of repetitive tasks. The semantic comparison multiple terms. Even in classical LSI spaces, it is capabilities of LSI allowed a wide variety of such impossible to work backwards to reconstruct the aids to be implemented. Some aids were very actual wording of documents corresponding to simple to implement, but still yielded significant extant document vectors. With slight gains in operational efficiency and user modifications to the index creation process, it can satisfaction. For example, a popup display of the be made impossible to determine even which most closely associated terms when a user moused words occurred in which documents. These over a given term was of great help in determining characteristics enable the use of information in a the meaning of newly-encountered terms such as secure background mode. acronyms and technical terminology. Most of the In many applications there is relevant data users of the systems were knowledge workers, but available that cannot be directly shared with users typically did not have technical backgrounds. for proprietary, legal, or privacy reasons. In such Providing them with immediate contextual cases, these sensitive documents can be processed information regarding technical terms greatly so that the results of operations in the LSI space aided them in understanding the material that they for the application can be enhanced by the were working with. contextual implications of the sensitive data, Over time, such aids became more complex. without risk of disclosure of specific sensitive One that proved very popular was novelty data items themselves. detection. Within some systems, tracking Experience in using LSI in a secure capabilities were implemented to provide an background mode has shown that even a small indication of what information a given user number of documents used in this manner can already was aware of. This included, for example, have great leverage. In some representative cases, monitoring what documents (or other text objects) data treated in background mode has constituted that the user had previously displayed, saved, less than 1% of the total data being examined. printed, or incorporated into work products. Nevertheless, significant gains in application Then, in response to a query from that user, the efficiency still have been achieved [22]. results could be displayed not just in relevance order, but in the order of those results that were relevant but at the same time were least similar to 4. Beneficial Implementation those previously seen. In many applications there Practices is significant redundancy in the content of items collected. In applications with high information Over the years, a number of LSI redundancy, the novelty detection feature greatly implementation practices evolved that improved both efficiency of operations and user significantly improved the quality and efficiency satisfaction. of the systems developed. Perhaps the most significant implementation categorization accuracy. The technique has broad approach adopted was to use analyses in the LSI applicability for noise mitigation in LSI spaces themselves to select effective values for all applications [23]. of the key processing parameters for an The computer employed to carry out analytic application. Typically, we used the following operations in an LSI space does not have to be the approach: same computer on which the LSI space is created. 1. Build an initial LSI space from application- It often proved useful to create LSI spaces on a relevant data, using standard parameter values large server and then distribute the vector spaces and processing choices. created there to smaller devices for use. We also 2. Using a small, representative test set, carry found that distribution of shared LSI spaces can out analyses in this initial space to determine be a powerful enabler for collaborative work. the most effective values for the parameters Sometimes a conceptual search will retrieve and choices. results that do not appear to be appropriate. Users 3. Re-build the LSI spaces using those may find this disconcerting. However, these often parameters and processing choices. can be the most important results – ones that indicate a gap in user understanding of some For example, using a test set representative of aspect of the problem at hand. In multiple systems an application being addressed, it is possible to we found it useful to highlight terms and passages make an effective choice of the number of in retrieved documents based on semantic dimensions to employ in creating the LSI space similarity to the user’s query. Users found this for that application. As long as the initial space useful in trying to determine why a surprising employs a number of dimensions higher than result was obtained. optimal, the requisite tests can be carried out, and Other implementation principles that proved an optimal value found, with vectors from a single effective included: initial LSI space. • Duplicate and near-duplicate documents For other parameter choices, a new LSI space in a collection artificially magnify associated must be created to test each value. For example, term relationships. LSI comparisons between in many applications, terms are only included in documents of a collection can be used very the LSI processing if they occur at least M times effectively to eliminate redundant documents. in the collection and/or in at least N different • For some applications, removal of documents. Pruning the term set in this manner “boilerplate” text can greatly enhance often can significantly improve the performance. For example, many legal representational fidelity of the space. A separate documents contain formulaic blocks of text LSI space must be generated in order to test each that appear on many documents. Appearance prospective pruning value. However, only a of such repeated text creates undesired limited range of values must be tried. In most of associations (i.e., ones that are not related to the applications here, values of M and N in the the content of the documents). range of two to five turned out to be optimal. It • In many instances it is useful to use LSI should be noted that pruning typically was not similarity comparisons to decompose long applied to named entities. In many applications, documents into conceptually cohesive the occurrence of a name may be of significance segments, which are then indexed as individual even if it occurs only once. items. This makes it much easier to identify The dramatic reduction in the time required to information on subsidiary topics. create LSI spaces made it increasingly feasible to • For large applications, parallel processing create trial LSI spaces for optimization testing, approaches such as MapReduce and more even for parameters that required multiple such recent techniques can be employed very spaces to be created. For very large collections, effectively for text preprocessing tasks. optimization analyses typically can be carried out • In analyses involving the LSI vectors of sufficiently effectively using LSI spaces built large collections, use of GPUs for the cosine from a randomly selected subset of the overall comparisons can provide a dramatic speedup collection. compared to using typical CPUs. We also employed iterative refinement of LSI • In many applications, entity-driven spaces to mitigate the effects of errors in training analytic processes can be far more efficient data for categorization applications. This than document-driven ones. approach led to significant improvement in • Monitoring of user actions often can representational fidelity of the spaces produced provide training data that can be employed to was dramatically improved. Having the entities refine the LSI spaces employed and to yield available also set us on a path of implementing improved accuracy of analytic operations. ever more sophisticated entity-driven analysis One particularly effective use of this capabilities. In most applications, entity-driven techniques was in continuously refining processes turned out to be far more efficient that textual representations of user interests. document-driven ones. Many of the applications addressed were 5. Interesting Results and Surprises complicated by the fact that the text items of interest contained multiple variants of names of individuals. These differences came from Over the past 20 years there were a number of misspellings, phonetic renderings, transliteration aspects of LSI that either came as a surprise or differences, and other sources. Because of these were unexpectedly useful. variations, many relationships of interest were When the work described here began, it was suppressed. One of the early features that we generally believed that LSI did not scale well. implemented was a name variant analyzer. For Academic papers of the time estimated that the any given name it combined eight methods for time required to build an LSI space grew as at generating candidate variants and then used least the square of the number of documents comparisons in the LSI space to select the most addressed.5 We were pleasantly surprised that relevant ones. This capability turned out to be actual measurements showed that the growth was significantly more effective than the best close to linear [16]. competing commercial product. Recall was two Indications of semantic similarity as provided to three times greater and confidence ratings for by LSI turned out to be a remarkably good proxy candidate equivalent names turned out to be much for similarity judgments generated by people. In more reliable than anticipated [25]. 2007 a review of 30 studies compared LSI and We were surprised by how easy it was to human judgment in 16 real-world text processing implement ad hoc phrase processing in LSI tasks ranging from synonym matching to spaces. (We also were embarrassed by how long psychological assessment. LSI performed as well it took for us to realize how to do it). as, or better than, humans in 51% of the cases [5]. It was interesting to observe how easily and In more recent, work, covering over 100 studies effectively word senses could be disambiguated and 37 applications, LSI performed as well as, or using clustering techniques in the LSI spaces [26]. better than, humans in 56% of the cases [24]. Of This allowed markup of occurrences of significance is the fact that all of these studies polysemous words in much the same way as was employed straightforward implementations of done for named entities, as was described in LSI. None of the advanced techniques described section 3.3. The disambiguation can be carried in this paper were used in any of the analyzed out in a trial space and then the marked-up senses studies. Moreover, the number of documents used of polysemous words treated as separate textual to create the spaces was very small - having a units in creating the final space to be employed. median value of only 1700. With larger Typically, a point of diminishing returns will be collections, LSI performance in the reviewed reached after disambiguating only a few studies likely would have been significantly thousands to tens of thousands of words. For higher. In the 63 information systems considered some applications, word sense disambiguation of here, in the few cases where human and LSI general terms did not result in major performance performance could be directly compared, LSI increases. Where disambiguation was of great results typically were as good as, or in some cases value, however, was in dealing with person somewhat better than, average human names. In many applications there may be performance. hundreds of people with the same name and One surprise was the huge effect that treating disambiguation is essential. As with phrases, this named entities as textual units produced. For name resolution feature can be incorporated into collections of text such as news articles, the 5 Early estimates tended to overlook one or more of three key factors. sparse. For large collections, often only one in ten thousand to one in First, LSI requires calculation of only the first few hundred singular one hundred thousand entries is non-zero. Finally, the time required values and associated vectors, not a complete SVD of the entire term- to read and preprocess the text being indexed generally is greater than document matrix. Second, term-document matrices are extremely the time required to carry out the SVD. an application either in bulk during preprocessing • Enhancement of machine translation or in an ad hoc fashion at query time. capabilities, especially for technical and other Applications involving the secure background specialized subject matter [38]. mode of dealing with sensitive data often involved • Functionality based on analysis of very small amounts of such data (sometimes less individual LSI vector components.6 [30, 31, than .01% of the total amount of data). In a 32]. number of cases, the extent to which such very • Use of randomized SVD to dramatically small amounts of auxiliary data could improve reduce the computational load when results was quite remarkable. addressing very large collections [35, 36, 37]. Some of the early applications involved text • Extensive use of LSI in discovery that was produced by optical character applications, particularly in the area of recognition (OCR) equipment. LSI turned out to bioinformatics [12]. be surprisingly effective in dealing with the many • Facilitation of human-robot interaction errors produced by OCR devices of that era. In [39, 40]. one categorization application, performance • Various AI-related efforts [41,42]. degradation only began to be detectable when the OCR error rate reached a level where two out of every three words were corrupted [27]. 7. Acknowledgements In cross-lingual applications, it turned out that many languages can be represented in a single LSI I would like to thank the engineers, scientists, space without serious performance degradation. software developers, and others from SAIC, In one case, transitioning from two languages Content Analyst Company, and Agilex represented in one LSI space to 13 languages Technologies Inc. who participated in building the resulted in a decline in cross-lingual similarity systems reviewed here over the past twenty years. comparisons of only a few percent [28]. Those individuals took the techniques described LSI turned out to provide an elegant solution here for improving LSI and transformed them for combining results from diverse information from nascent concepts into working code and systems when employing federated queries [29]. deployed systems. Combining text with other data types (especially relational, geographic, and image 8. References data) often generated unique analytic insights. The combination of such data also supported [1] George W. Furnas, et al. Information implementation of highly effective visual analytic retrieval using a singular value interfaces. decomposition model of latent semantic structure, in: Proceedings of the 11th Annual 6. Recent Developments International ACM SIGIR Conference on Research and Development in Information Although much has been accomplished over Retrieval (SIGIR 88), May 1988, Grenoble, the past twenty years, there are still exciting France, pp. 465–480. activities underway involving LSI. Many of these [2] Susan T. Dumais, Latent Semantic Analysis. involve implementation of ideas that were Annual Review of Information Science and originally suggested in a basic form some years Technology, 38(1), 2004, pp. 188-230, ago, but are just now being incorporated into real- doi:https://asistdl.onlinelibrary.wiley.com/d world applications. Key examples include: oi/abs/10.1002/aris.1440380105. [3] Jerome R. Bellegarda, Latent Semantic • Combined analysis of text and relational Mapping: Principles & Applications. data [29]. Morgan & Claypool, 2007 doi: • Implementation of semantic vector space https://doi.org/10.2200/S00048ED1V01Y20 equivalents of Boolean operators [33] and 0609SAP003. negation [34]. 6 complex and information-rich object. To compare two such objects The components of a typical LSI vector comprise hundreds of indications of derived relationships. In general, the basis vectors of using a single number (such as a cosine) thus ignores a large amount an LSI space closely relate to concepts, or mixtures of such, within of potentially useful information. the collection of text being addressed. An LSI vector is thus a [4] Thomas K. Landauer, Danielle S. Information Networking and Applications, McNamara, Simon Dennis, and Walter March 22-25, 2011, Biopolis, Singapore, pp. Kintsch, eds. 2007. Handbook of Latent 602-609. Semantic Analysis. Lawrence Erlbaum [15] T. Hashimoto, T. Kuboyama and B. Associates. Chakraborty, Temporal awareness of [5] Roger Bradford, Comparability of LSI and changes in afflicted people's needs after East human judgment in text analysis tasks, in: Japan Great Earthquake, in: Proceedings, Proceedings, Applied Computing IEEE International Conference of IEEE Conference, September 28-30, 2009, Athens, Region 10 (TENCON 2013), 1-6. doi: Greece, pp. 359-366. 10.1109/TENCON.2013.6719012. [6] Susan T. Dumais, George W. Furnas, [16] Roger Bradford, Implementation techniques Thomas K. Landauer, Scott Deerwester, and for large-scale latent semantic indexing Richard Harshman, Using latent semantic applications, in: Proceedings of the 20th analysis to improve access to textual ACM International Conference on information, in: Proceedings of the SIGCHI Information and Knowledge Management Conference on Human Factors in Computing (CIKM), October 2011, Glasgow Scotland, Systems, May 1988, Washington, DC, pp. pp. 339-344. 281-285. [17] Zhiqiang Cai et al, Impact of corpus size and [7] Tristan Miller, Essay assessment with latent dimensionality of LSA spaces from semantic analysis, Journal of Educational Wikipedia articles on AutoTutor answer Computing Research, 29(4), (2003) 495-512. evaluation, in: Proceedings, 11th [8] Lexis-Nexis, The Evolution of Semantic International Conference on Educational Search on the Web, 2009. URL: Data Mining (EDM), Jul 16-20, 2018, https://www.lexisnexis.co.uk/pdf/brochures/ Raleigh, NC, pp.127-136. totalpatent-whitepaper.pdf. [18] Thomas K. Landauer and Susan T. Dumais, [9] Jean Isson, Unstructured Data Analytics: A solution to Plato's problem: The latent How to Improve Customer Acquisition, semantic analysis theory of acquisition, Customer Retention, and Fraud Detection induction, and representation of knowledge. and Prevention, John Wiley & Sons, 2018. Psychological Review, 104(2), (1997) 211- [10] Seraina Anagnostopoulou, et al., The impact 240. of online reputation on hotel profitability, [19] Roger Bradford, An empirical study of International Journal of Contemporary required dimensionality for large-scale latent Hospitality Management September 20, semantic indexing applications, in: 2019. doi: 10.1108/IJCHM-03-2019-0247. Proceedings of the 17th ACM conference on [11] Wei Dong, et al., The detection of fraudulent Information and knowledge management financial statements: an integrated language (CIKM 2008), October 19-23, 2008, Napa model, in: Proceeding of the 19th Pacific- Valley, CA, pp. 153-162. doi: Asia Conference on Information Systems https://doi.org/10.1145/ 1458082.1458105. (PACIS 2014), Article 383. [20] Stuart Rose, Dave Engel, Nick Cramer, and [12] Roger Bradford, An overview of information Wendy Cowley, Automatic keyword discovery using latent semantic indexing, in: extraction from individual documents. Text Proceedings, International Conference on Mining: Applications and Theory 1 (2010): Computer Science, Applied Mathematics 1-20. and Applications (ICCSAMA 2017), June 30 [21] Roger Bradford, Incorporating ad hoc -July 1, 2017, Berlin, Germany, pp. 153-164. phrases in LSI queries, in: Proceedings, 6th [13] Hongyu Chen, et al., Effective use of latent International Conference on Knowledge semantic indexing and computational Discovery and Information Retrieval, linguistics in biological and biomedical October 21-24, 2014, Rome, Italy, pp. 61-70. applications, Frontiers in Physiology, 4: 8. [22] Roger Bradford, Exploiting sensitive (2013) doi: 10.3389/fphys.2013.00008. information in background mode using latent [14] Nguyen Ngoc Chan, Walid Gaaloul, and semantic indexing, in: Proceedings of the Samir Tata, A web service recommender Sixth Workshop on Link Analysis, system using vector space model and latent Counterterrorism and Security, SIAM Data semantic indexing, in: Proceedings, 2011 Mining Conference, April 24-26 2008, IEEE International Conference on Advanced Atlanta, Georgia. [23] Roger Bradford and John Pozniak, A Language and Information (ESSLLI), systematic approach to design of a text August 6-18, Birmingham, UK, pp. 156-166. categorizer, in; Proceedings, 2016 IEEE [34] Dominic Widdows, Orthogonal negation in International Conference on Systems, Man, vector spaces for modelling word-meanings and Cybernetics (SMC), October 9-12, 2016, and document retrieval, in: Proceedings of Budapest, Hungary, pp. 509-514. the 41st Annual Meeting of the Association [24] Roger Bradford, Comparability of LSI and for Computational Linguistics, July 7, 2003, human judgment in text analysis tasks – an Volume 1, pp. 136-143. update. Draft, Mar 2, 2021. [35] Nathan Halko, Per-Gunnar Martinsson, and [25] Roger Bradford, Use of latent semantic Joel Tropp, Finding structure with indexing to identify name variants in large randomness: probabilistic algorithms for data collections, in: Proceedings, 2013 IEEE constructing approximate matrix International Conference on Intelligence and decompositions, SIAM Review, 2(3) (2011): Security Informatics (ISI 2013), Seattle, 217-288. WA, 27-32. doi: 10.1109/ISI.2013.6578781. [36] Ming Gu, Subspace iteration randomization [26] Roger Bradford, Word sense and singular value problems, SIAM Journal disambiguation, 2008. Patent No. 7,415,462, on Scientific Computing, 37(3) (2015): Filed January 20, 2006, Issued August 19, 1139-1173. 2009. [37] Per-Gunnar Martinsson, Randomized [27] Anthony Zukas and Robert Price, Document methods for matrix computations, in: The categorization using latent semantic Mathematics of Data, IAS/Park City indexing, in: Proceedings, Fifth Annual Mathematics Series, 25(4) 2018, pp. 187- Symposium on Document Image 231. Understanding Technology (SDIUT), April, [38] Roger Bradford, Machine translation using 2003, Greenbelt, MD, pp. 87-91. vector space representations, 2010. Patent [28] Roger Bradford and John Pozniak, No. 7,765,098, Filed April 24, 2006, Issued Combining modern machine translation July 27, 2010. software with LSI for cross-lingual [39] Phoebe Liu, Dylan Glas, Takayuki Kanda, information processing, in: Proceedings, and Hiroshi Ishiguro, Data-driven HRI: 11th International Conference on Learning social behaviors by example from Information Technology: New Generations human–human interaction, IEEE (ITNG), April 7-9, 2014, Las Vegas, NV, 65- Transactions on Robotics, 32, no. 4, (2016): 72. doi: 10.1109/ITNG.2014.52. 988-1008. [29] Roger Bradford, Federated queries and [40] Francesco Agostaro, et al., A conversational combined text and relational data, 2006. agent based on a conceptual interpretation of Patent Application No. 11434749, Filed May a data driven semantic space, in: 17, 2006. Proceedings, Congress of the Italian [30] Weizhong Zhu, and Chaomei Chen, Association for Artificial Intelligence, Storylines: Visual exploration and analysis in September 21–23, 2005, Milan, Italy, pp. latent semantic spaces. Computers & 381-392. Graphics, 31(3) (2007): 338-349. [41] Robert Speer, Catherine Havasi, and Henry [31] Ricardo Olmos, et al, Transforming selected Lieberman, AnalogySpace: Reducing the concepts into dimensions in latent semantic dimensionality of common-sense analysis, Discourse Processes, 51(5-6) knowledge, in: Proceedings of the Twenty- (2004): 494-510. Third AAAI Conference on Artificial [32] Anna Sidorova, Nicholas Evangelopoulos, Intelligence, Jul 13, 2008, vol. 8, pp. 548- Joseph S. Valacich, and Thiagarajan 553. Ramakrishnan, Uncovering the intellectual [42] Trevor Cohen, Brett Blatter, and Vimla Patel, core of the information systems discipline. Simulating expert clinical comprehension: MIS Quarterly (2008): 467-482. Adapting latent semantic analysis to [33] Preslav Nakov, Getting better results with accurately extract clinical concepts from latent semantic indexing, in: Proceedings of psychiatric narrative, Journal of Biomedical the Students Presentations at the 12th Informatics 41, no. 6 (2008): 1070-1087. European Summer School in Logic,