Lessons Learned from 20 Years of Implementing LSI Applications
                           1
Roger Bradford
Maxim Analytics, Great Falls, VA 22066 USA


                Abstract
                This paper summarizes lessons learned, over a period of 20 years, from implementing
                information systems employing the technique of latent semantic indexing (LSI). The data
                presented is drawn from 63 projects undertaken over the period 1999 through 2019. Over that
                period the projects increased in scale from collections of hundreds of thousands of documents
                to ones involving hundreds of millions of documents. They also increased in sophistication,
                from simple search and retrieval systems to ones focused on information discovery and
                automated alerting. This paper summarizes some of the key developments in technology and
                techniques that enabled those advances in the size and sophistication of the applications. The
                objective of this paper is to share insights gained from these past two decades of system
                implementation experience.

                Keywords 1
                Latent Semantic Indexing, LSI, LSI applications, LSA, lessons learned


1. Latent Semantic Indexing                                                                    Both terms and documents are represented by
                                                                                               k-dimensional vectors in this vector space.
                                                                                               5. New queries, terms, and documents can
   The technique of latent semantic indexing
                                                                                               be represented in the space by a process known
(LSI) was invented at Bellcore in the late 1980s
                                                                                               as folding-in, which extrapolates from known
[1]. The original intent was to provide improved
                                                                                               vectors.
capabilities for retrieval of text. The technique
                                                                                               6. The semantic similarity of any two
has, however, proven to be useful in analysis of a
                                                                                               objects represented in the space is reflected by
wide variety of information types [2, 3].
                                                                                               the proximity of their representation vectors,
   As applied to a collection of documents, the
                                                                                               generally using a cosine measure.
LSI algorithm consists of the following primary
steps [1, 4]:
                                                                                               Experience from a broad range of academic,
   1. A term-document matrix is formed, and
                                                                                            industrial, and governmental testing has shown
   (typically) local and global weights are applied
                                                                                            that proximity in an LSI space is a remarkably
   to the elements of this matrix.
                                                                                            good proxy for semantic relatedness as judged by
   2. Singular value decomposition (SVD) is
                                                                                            humans [5].
   used to reduce this matrix to a product of three
                                                                                               Early commercial applications of LSI included
   matrices, one of which is diagonal in the
                                                                                            identification of people with specific expertise
   singular values of the original matrix.
                                                                                            [6], detection of spam in e-mails [3] and essay
   3. Dimensionality is reduced by deleting all
                                                                                            scoring [7]. Over time, the technique found wide
   but the k largest singular values, together with
                                                                                            application in areas such as patent search and
   the corresponding columns of the other two
   matrices.                                                                                analysis [8], résumé matching [9], customer
   4. This truncation process provides a basis                                              survey analysis [10], and fraud detection [11]. It
                                                                                            became the dominant paradigm in electronic
   for generating a k-dimensional vector space.

DESIRES 2021 - 2nd International Conference on Design of
Experimental Search & Information REtrieval Systems,
September 15--18, 2021, Padua, Italy
EMAIL: rbradford@cox.net
ORCID: 0000-0003-1750-3125
            ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative
            Commons License Attribution 4.0 International (CC BY 4.0).

            CEUR Workshop Proceedings (CEUR-WS.org)
document discovery [12]. More recently it has                        matrix manipulation, which limited the scale of
been used in bioinformatics discovery [13],                          LSI applications.          However, hardware
recommender systems [14], and social media                           improvements over the past twenty years have
analysis [15].                                                       completely changed this situation [16]. Figure 1
                                                                     shows the dramatic reduction in the observed
2. Structure of the Paper                                            index creation times in nine comparable projects
                                                                     over the period 2002-2016. The times shown are
                                                                     those required to build an LSI index for one
    This paper summarizes lessons learned from
                                                                     million documents (averaging several kilobytes in
63 information system implementation projects
                                                                     size) at 300 dimensions, using computers
that the author took part in over the period 1999
                                                                     typically employed in applications in the given
through 2019. Each of these projects employed
                                                                     years.
LSI as a key technical component. The systems
addressed a wide range of applications for both
commercial and government customers.
    Over this 20-year period, the systems
increased significantly in both size and
sophistication.2 The earliest systems utilized the
conceptual search, clustering, and categorization
functionality of LSI to implement relatively
simple capabilities for such tasks as customer
survey analysis and matching of résumés with job
openings. Extrapolating from experience gained,
these fundamental capabilities subsequently were                     Figure 1. Decline in time required to create an LSI
applied in a more abstract fashion to higher-level                   space for a 1 million document collection
considerations in applications such as fraud
detection and patent prior art analysis. Successive                     The early points on the curve correspond to
refinement of tools and techniques eventually                        index creation times from projects using clusters
enabled advanced applications incorporating                          of processors. Subsequent points are from
features such as novel information detection and                     projects primarily employing mid-range servers.
secure information sharing.                                          Of note, however, the last point shown is for a
    Section 3 of this paper provides a brief                         laptop computer.
overview of the principal improvements in                               The dramatic decline in time required to create
technologies and techniques that enabled solution                    an LSI index progressively enabled a wider
of progressively larger and more complex                             variety of applications. At the present time, LSI
problems using LSI. Section 4 describes several                      applications involving collections of tens of
implementation principles that have proven useful                    millions of documents are routine and multiple
in building LSI-based information systems.                           applications have been implemented that
Section 5 summarizes some particularly                               encompass full LSI indexing of hundreds of
interesting results and surprises that were                          millions of documents.
encountered in the course of building these                             Advances        in    technology       enabled
systems.     Section 6 concludes with brief                          improvements not only in scale, but also in the
comments on capabilities being incorporated into                     fidelity of the generated LSI spaces in
more recent LSI applications.                                        representing real-world semantic associations.
                                                                     For LSI, as collection size increases, the larger
3. Enabling Advances in Technology                                   number of occurrences of individual terms
   and Technique                                                     diminishes the effects of idiosyncratic
                                                                     occurrences of those terms in specific documents.
3.1. Scaling                                                         This improves overall representational fidelity, as
                                                                     shown in Figure 2. The graph displays the
  Computers in the early days of LSI were not                        variation in mean reciprocal rank (MRR) of 250
well-suited for SVD computation and large-scale                      pairs of terms having known real-world semantic
2
  Other projects undertaken in this time frame applied LSI to data
other than text. However, only systems that focused on text are
addressed here.
association3, as a function of the size of the                          dimensions chosen, for a collection4 of five
collection. As indicated by the trend line, the                         million documents [19].
increase in representational fidelity with
collection size is approximately logarithmic. Of
note is the fact that over 80% of all published
literature on LSI deals with collection sizes
smaller than the initial point shown (17 thousand
documents) and 97% deals with collection sizes
less than the second point shown (93 thousand
documents).


                                                                        Figure 3. Variation in term similarity ranking as a
                                                                        function of chosen dimensionality

                                                                           The performance of the systems discussed here
                                                                        benefited greatly from the cumulative experience
                                                                        gained over 20 years regarding choice of effective
                                                                        parameters.     In nearly all of the systems
                                                                        developed, at least some testing of parameter
                                                                        choices was carried out that was designed to
Figure 2. Increase in semantic representation                           optimize application performance.        This is
fidelity with collection size                                           addressed further in section 4.

    Several of the developed applications included
the identification of specific patterns of activities                   3.3.       Indexing of Named Entities
and relationships as one of the system objectives.
In many cases, the distinctions between patterns                           In most text applications, named entities
of interest and normal patterns were quite subtle.                      constitute items of particular significance. For
In general, the larger the data collection, the more                    example, names of people are of fundamental
effective LSI was in providing indicators of the                        importance in fraud detection. One of the most
existence of patterns of interest. Over time, the                       important factors contributing to the success of
continuing growth in the size of collections that                       the programs described here was the fact that
could be addressed facilitated implementation of                        nearly all of them employed entity extraction and
increasingly sophisticated analytic operations.                         markup as a preprocessing step prior to creating
                                                                        the LSI spaces involved. Typically, names of
                                                                        persons, locations, and organizations were
3.2.       Parameter Optimization                                       extracted, but in some cases more entity types
                                                                        were treated.       In the LSI preprocessing,
    In the construction of an LSI space, there are a                    occurrences of a name such as John Kennedy were
number of parameter choices that must be made;                          marked up as p_john_kennedy_p, and similarly
for example: number and identity of stopwords,                          for other entity types. (This markup was stripped
required number of occurrences for a term to be                         out prior to presenting results to users.)
included in the processing, and the number of                              With classical LSI, users can create queries of
dimensions for the LSI space. The choices that                          the form: What terms are most closely associated
are made can have a significant impact on overall                       with a given term?        In contrast, with entity
performance in a specific application [17, 18]. As                      markup prior to creating the space, an interface
an example, Figure 3 shows the variation in mean                        can be implemented that allows users to enter
reciprocal rank of 250 pairs of semantically-                           queries such as: What people are most closely
related terms as a function of the number of                            associated with a given entity or activity? Such
                                                                        queries are much more natural in most

3
 Over the years, such simple and direct metrics for quality of an LSI   correlated well with both performance on application-specific tests
space proved to be very useful in tuning implemented systems. Over      and with human judgment.
                                                                        4
a wide range of applications, evaluations using such metrics              These are not the same documents as those in the collection
                                                                        referenced in Figure 2.
applications. The implementation of capabilities           •    For each candidate phrase, create an
to effectively execute those types of queries was a        approximate LSI vector by taking a weighted
major factor contributing to both operational              average of the representation vectors for the
efficiency and user satisfaction for the systems           documents that contain that phrase. (The
described here.                                            folding-in technique of classical LSI applied to
    Even in the limited number of cases where              terms [1]).
entities themselves were not of prime importance           •    Compare the approximate vector for the
for users, entity markup prior to creating the LSI         phrase with a vector created by simply
space was of great importance for improving the            combining the terms of the candidate phrase as
representational fidelity of the space. In most text       an LSI query.
collections, failure to treat named entities as            •    Create a final LSI space, treating as
textual units will create vast numbers of spurious         textual units the candidate phrases which have
associations. For example, the common English              the greatest distance (smallest cosine) between
given name John may be a component of                      the approximation vector and the query vector.
hundreds of distinct person names. Classical LSI
will conflate all of the occurrences of John,              In order to ensure that users could employ
generating erroneous correlations in the LSI space      arbitrary phrases in searches, we developed a
produced. In many current LSI applications, the         technique that allowed use of phrases in LSI
text collections being addressed contain millions       queries even for LSI spaces where phrases have
to tens of millions of named entities. Failing to       not been indexed. The technique is described in
treat these entities as textual units when building     detail in [21]. Table 1 shows the results from
an LSI space for such applications would yield          applying this technique in searching a collection
millions of distortions of relations in the space.      of 1.6 million news articles using the query rare
                                                        earth element.
3.4.    Dealing with Phrases                               The column labeled NONE shows the ranked
                                                        query results (closest terms) when no phrase
    Many LSI applications involve retrieval of          processing is applied. In this case, since the terms
information of interest based on queries formed         rare and element occur in diverse contexts, the
by users. It is well-known that, in many instances,     term earth has the most significant effect on the
the use of phrases in queries can significantly aid     results. The results are completely dominated by
in expressing a user’s information needs.               celestial references; clearly not what a user would
Historically, one frequently-cited criticism of         desire.
classical LSI was that it did not provide a viable         The column labeled PRE-PROCESSED shows
mechanism for dealing with phrases in queries. It       the results for the same collection when rare earth
was felt that use of phrases required identification    element was marked up as a phrase and treated as
of all phrases of interest prior to creating the LSI    a textual unit in creating the LSI space. The
space, so that those phrases could be treated as        results are as expected: primarily names of rare
terms in the indexing process. This is a problem        earth elements and those of people and
in that there are a very large number of phrases in     organizations associated with processing of rare
a text collection of any significant size. Most of      earth elements.
the candidate phrases will never be employed by
users. Moreover, indexing of most phrases will          Table 1
not significantly improve the representational          Comparison of pre-indexed and ad hoc phrase
fidelity of the LSI space.                              processing
    We eventually found a two-part solution to this
problem. In order to incorporate phrases that
would improve the representational fidelity of the
space, we employed the following procedure:
    •    Using a highly productive phrase
    generation technique, such as RAKE [20],
    generate a large set of candidate phrases for the
    collection of interest.
    •    Create an initial LSI index for the
    collection, with no attempt to extract phrases.
   The column labeled AD HOC shows the                    Other user aids that proved to enhance both
results when the term folding approach of [21] is      operational efficiency and user satisfaction
applied to the LSI space where there was no initial    included:
phrase processing. The results are quite close to         •     Generation of document summaries
those obtained for the case where the phrase was          tailored to users’ interests.
indexed (60% agreement for the top ten terms in a         •     Automated generation of graphs showing
collection comprising 1.5 million terms). The             relationships among entities.
adoption of this ad hoc phrase query process in           •     Automated tracking of topic threads in
systems described here resulted in a major                long documents and sets of documents.
improvement in user satisfaction.
                                                       3.6.    Secure Information Sharing
3.5.    User Aids
                                                           The representation for a given term in an LSI
    Over the years, with the dramatic growth in the    space is a single point in a vector space that is
size of the text collections being addressed, it       derived from what may be hundreds of
became increasingly important to provide aids for      occurrences, even for a relatively rare term.
users in areas such as creating queries, identifying   Similarly, the representation for a given document
topics, interpreting results, and automating           is derived from large numbers of occurrences of
repetitive tasks.      The semantic comparison         multiple terms. Even in classical LSI spaces, it is
capabilities of LSI allowed a wide variety of such     impossible to work backwards to reconstruct the
aids to be implemented. Some aids were very            actual wording of documents corresponding to
simple to implement, but still yielded significant     extant document vectors.              With slight
gains in operational efficiency and user               modifications to the index creation process, it can
satisfaction. For example, a popup display of the      be made impossible to determine even which
most closely associated terms when a user moused       words occurred in which documents. These
over a given term was of great help in determining     characteristics enable the use of information in a
the meaning of newly-encountered terms such as         secure background mode.
acronyms and technical terminology. Most of the            In many applications there is relevant data
users of the systems were knowledge workers, but       available that cannot be directly shared with users
typically did not have technical backgrounds.          for proprietary, legal, or privacy reasons. In such
Providing them with immediate contextual               cases, these sensitive documents can be processed
information regarding technical terms greatly          so that the results of operations in the LSI space
aided them in understanding the material that they     for the application can be enhanced by the
were working with.                                     contextual implications of the sensitive data,
    Over time, such aids became more complex.          without risk of disclosure of specific sensitive
One that proved very popular was novelty               data items themselves.
detection.     Within some systems, tracking               Experience in using LSI in a secure
capabilities were implemented to provide an            background mode has shown that even a small
indication of what information a given user            number of documents used in this manner can
already was aware of. This included, for example,      have great leverage. In some representative cases,
monitoring what documents (or other text objects)      data treated in background mode has constituted
that the user had previously displayed, saved,         less than 1% of the total data being examined.
printed, or incorporated into work products.           Nevertheless, significant gains in application
Then, in response to a query from that user, the       efficiency still have been achieved [22].
results could be displayed not just in relevance
order, but in the order of those results that were
relevant but at the same time were least similar to
                                                       4. Beneficial                Implementation
those previously seen. In many applications there         Practices
is significant redundancy in the content of items
collected. In applications with high information           Over the years, a number of LSI
redundancy, the novelty detection feature greatly      implementation      practices    evolved      that
improved both efficiency of operations and user        significantly improved the quality and efficiency
satisfaction.                                          of the systems developed.
   Perhaps the most significant implementation          categorization accuracy. The technique has broad
approach adopted was to use analyses in the LSI         applicability for noise mitigation in LSI
spaces themselves to select effective values for all    applications [23].
of the key processing parameters for an                     The computer employed to carry out analytic
application. Typically, we used the following           operations in an LSI space does not have to be the
approach:                                               same computer on which the LSI space is created.
   1. Build an initial LSI space from application-      It often proved useful to create LSI spaces on a
   relevant data, using standard parameter values       large server and then distribute the vector spaces
   and processing choices.                              created there to smaller devices for use. We also
   2. Using a small, representative test set, carry     found that distribution of shared LSI spaces can
   out analyses in this initial space to determine      be a powerful enabler for collaborative work.
   the most effective values for the parameters             Sometimes a conceptual search will retrieve
   and choices.                                         results that do not appear to be appropriate. Users
   3. Re-build the LSI spaces using those               may find this disconcerting. However, these often
   parameters and processing choices.                   can be the most important results – ones that
                                                        indicate a gap in user understanding of some
    For example, using a test set representative of     aspect of the problem at hand. In multiple systems
an application being addressed, it is possible to       we found it useful to highlight terms and passages
make an effective choice of the number of               in retrieved documents based on semantic
dimensions to employ in creating the LSI space          similarity to the user’s query. Users found this
for that application. As long as the initial space      useful in trying to determine why a surprising
employs a number of dimensions higher than              result was obtained.
optimal, the requisite tests can be carried out, and        Other implementation principles that proved
an optimal value found, with vectors from a single      effective included:
initial LSI space.                                          •    Duplicate and near-duplicate documents
    For other parameter choices, a new LSI space            in a collection artificially magnify associated
must be created to test each value. For example,            term relationships. LSI comparisons between
in many applications, terms are only included in            documents of a collection can be used very
the LSI processing if they occur at least M times           effectively to eliminate redundant documents.
in the collection and/or in at least N different            •    For some applications, removal of
documents. Pruning the term set in this manner              “boilerplate” text can greatly enhance
often      can     significantly    improve       the       performance.       For example, many legal
representational fidelity of the space. A separate          documents contain formulaic blocks of text
LSI space must be generated in order to test each           that appear on many documents. Appearance
prospective pruning value. However, only a                  of such repeated text creates undesired
limited range of values must be tried. In most of           associations (i.e., ones that are not related to
the applications here, values of M and N in the             the content of the documents).
range of two to five turned out to be optimal. It           •    In many instances it is useful to use LSI
should be noted that pruning typically was not              similarity comparisons to decompose long
applied to named entities. In many applications,            documents into conceptually cohesive
the occurrence of a name may be of significance             segments, which are then indexed as individual
even if it occurs only once.                                items. This makes it much easier to identify
    The dramatic reduction in the time required to          information on subsidiary topics.
create LSI spaces made it increasingly feasible to          •    For large applications, parallel processing
create trial LSI spaces for optimization testing,           approaches such as MapReduce and more
even for parameters that required multiple such             recent techniques can be employed very
spaces to be created. For very large collections,           effectively for text preprocessing tasks.
optimization analyses typically can be carried out          •    In analyses involving the LSI vectors of
sufficiently effectively using LSI spaces built             large collections, use of GPUs for the cosine
from a randomly selected subset of the overall              comparisons can provide a dramatic speedup
collection.                                                 compared to using typical CPUs.
    We also employed iterative refinement of LSI            •    In many applications, entity-driven
spaces to mitigate the effects of errors in training        analytic processes can be far more efficient
data for categorization applications.           This        than document-driven ones.
approach led to significant improvement in
    •    Monitoring of user actions often can                            representational fidelity of the spaces produced
    provide training data that can be employed to                        was dramatically improved. Having the entities
    refine the LSI spaces employed and to yield                          available also set us on a path of implementing
    improved accuracy of analytic operations.                            ever more sophisticated entity-driven analysis
    One particularly effective use of this                               capabilities. In most applications, entity-driven
    techniques was in continuously refining                              processes turned out to be far more efficient that
    textual representations of user interests.                           document-driven ones.
                                                                             Many of the applications addressed were
5. Interesting Results and Surprises                                     complicated by the fact that the text items of
                                                                         interest contained multiple variants of names of
                                                                         individuals.      These differences came from
    Over the past 20 years there were a number of                        misspellings, phonetic renderings, transliteration
aspects of LSI that either came as a surprise or                         differences, and other sources. Because of these
were unexpectedly useful.                                                variations, many relationships of interest were
    When the work described here began, it was                           suppressed. One of the early features that we
generally believed that LSI did not scale well.                          implemented was a name variant analyzer. For
Academic papers of the time estimated that the                           any given name it combined eight methods for
time required to build an LSI space grew as at                           generating candidate variants and then used
least the square of the number of documents                              comparisons in the LSI space to select the most
addressed.5 We were pleasantly surprised that                            relevant ones. This capability turned out to be
actual measurements showed that the growth was                           significantly more effective than the best
close to linear [16].                                                    competing commercial product. Recall was two
    Indications of semantic similarity as provided                       to three times greater and confidence ratings for
by LSI turned out to be a remarkably good proxy                          candidate equivalent names turned out to be much
for similarity judgments generated by people. In                         more reliable than anticipated [25].
2007 a review of 30 studies compared LSI and                                 We were surprised by how easy it was to
human judgment in 16 real-world text processing                          implement ad hoc phrase processing in LSI
tasks ranging from synonym matching to                                   spaces. (We also were embarrassed by how long
psychological assessment. LSI performed as well                          it took for us to realize how to do it).
as, or better than, humans in 51% of the cases [5].                          It was interesting to observe how easily and
In more recent, work, covering over 100 studies                          effectively word senses could be disambiguated
and 37 applications, LSI performed as well as, or                        using clustering techniques in the LSI spaces [26].
better than, humans in 56% of the cases [24]. Of                         This allowed markup of occurrences of
significance is the fact that all of these studies                       polysemous words in much the same way as was
employed straightforward implementations of                              done for named entities, as was described in
LSI. None of the advanced techniques described                           section 3.3. The disambiguation can be carried
in this paper were used in any of the analyzed                           out in a trial space and then the marked-up senses
studies. Moreover, the number of documents used                          of polysemous words treated as separate textual
to create the spaces was very small - having a                           units in creating the final space to be employed.
median value of only 1700.             With larger                       Typically, a point of diminishing returns will be
collections, LSI performance in the reviewed                             reached after disambiguating only a few
studies likely would have been significantly                             thousands to tens of thousands of words. For
higher. In the 63 information systems considered                         some applications, word sense disambiguation of
here, in the few cases where human and LSI                               general terms did not result in major performance
performance could be directly compared, LSI                              increases. Where disambiguation was of great
results typically were as good as, or in some cases                      value, however, was in dealing with person
somewhat better than, average human                                      names. In many applications there may be
performance.                                                             hundreds of people with the same name and
    One surprise was the huge effect that treating                       disambiguation is essential. As with phrases, this
named entities as textual units produced. For                            name resolution feature can be incorporated into
collections of text such as news articles, the

5
 Early estimates tended to overlook one or more of three key factors.    sparse. For large collections, often only one in ten thousand to one in
First, LSI requires calculation of only the first few hundred singular   one hundred thousand entries is non-zero. Finally, the time required
values and associated vectors, not a complete SVD of the entire term-    to read and preprocess the text being indexed generally is greater than
document matrix. Second, term-document matrices are extremely            the time required to carry out the SVD.
an application either in bulk during preprocessing                           •    Enhancement of machine translation
or in an ad hoc fashion at query time.                                       capabilities, especially for technical and other
    Applications involving the secure background                             specialized subject matter [38].
mode of dealing with sensitive data often involved                           •    Functionality based on analysis of
very small amounts of such data (sometimes less                              individual LSI vector components.6 [30, 31,
than .01% of the total amount of data). In a                                 32].
number of cases, the extent to which such very                               •    Use of randomized SVD to dramatically
small amounts of auxiliary data could improve                                reduce the computational load when
results was quite remarkable.                                                addressing very large collections [35, 36, 37].
    Some of the early applications involved text                             •    Extensive use of LSI in discovery
that was produced by optical character                                       applications, particularly in the area of
recognition (OCR) equipment. LSI turned out to                               bioinformatics [12].
be surprisingly effective in dealing with the many                           •    Facilitation of human-robot interaction
errors produced by OCR devices of that era. In                               [39, 40].
one categorization application, performance                                  •    Various AI-related efforts [41,42].
degradation only began to be detectable when the
OCR error rate reached a level where two out of
every three words were corrupted [27].                                   7. Acknowledgements
    In cross-lingual applications, it turned out that
many languages can be represented in a single LSI                           I would like to thank the engineers, scientists,
space without serious performance degradation.                           software developers, and others from SAIC,
In one case, transitioning from two languages                            Content Analyst Company, and Agilex
represented in one LSI space to 13 languages                             Technologies Inc. who participated in building the
resulted in a decline in cross-lingual similarity                        systems reviewed here over the past twenty years.
comparisons of only a few percent [28].                                  Those individuals took the techniques described
    LSI turned out to provide an elegant solution                        here for improving LSI and transformed them
for combining results from diverse information                           from nascent concepts into working code and
systems when employing federated queries [29].                           deployed systems.
    Combining text with other data types
(especially relational, geographic, and image                            8. References
data) often generated unique analytic insights.
The combination of such data also supported
                                                                         [1] George W. Furnas, et al. Information
implementation of highly effective visual analytic
                                                                             retrieval     using     a     singular    value
interfaces.
                                                                             decomposition model of latent semantic
                                                                             structure, in: Proceedings of the 11th Annual
6. Recent Developments                                                       International ACM SIGIR Conference on
                                                                             Research and Development in Information
   Although much has been accomplished over                                  Retrieval (SIGIR 88), May 1988, Grenoble,
the past twenty years, there are still exciting                              France, pp. 465–480.
activities underway involving LSI. Many of these                         [2] Susan T. Dumais, Latent Semantic Analysis.
involve implementation of ideas that were                                    Annual Review of Information Science and
originally suggested in a basic form some years                              Technology, 38(1), 2004, pp. 188-230,
ago, but are just now being incorporated into real-                          doi:https://asistdl.onlinelibrary.wiley.com/d
world applications. Key examples include:                                    oi/abs/10.1002/aris.1440380105.
                                                                         [3] Jerome R. Bellegarda, Latent Semantic
    •    Combined analysis of text and relational                            Mapping: Principles & Applications.
    data [29].                                                               Morgan       &      Claypool,     2007     doi:
    •    Implementation of semantic vector space                             https://doi.org/10.2200/S00048ED1V01Y20
    equivalents of Boolean operators [33] and                                0609SAP003.
    negation [34].

    6                                                                    complex and information-rich object. To compare two such objects
      The components of a typical LSI vector comprise hundreds of
indications of derived relationships. In general, the basis vectors of   using a single number (such as a cosine) thus ignores a large amount
an LSI space closely relate to concepts, or mixtures of such, within     of potentially useful information.
the collection of text being addressed. An LSI vector is thus a
[4] Thomas K. Landauer, Danielle S.                       Information Networking and Applications,
     McNamara, Simon Dennis, and Walter                   March 22-25, 2011, Biopolis, Singapore, pp.
     Kintsch, eds. 2007. Handbook of Latent               602-609.
     Semantic Analysis. Lawrence Erlbaum             [15] T. Hashimoto, T. Kuboyama and B.
     Associates.                                          Chakraborty, Temporal awareness of
[5] Roger Bradford, Comparability of LSI and              changes in afflicted people's needs after East
     human judgment in text analysis tasks, in:           Japan Great Earthquake, in: Proceedings,
     Proceedings,        Applied       Computing          IEEE International Conference of IEEE
     Conference, September 28-30, 2009, Athens,           Region 10 (TENCON 2013), 1-6. doi:
     Greece, pp. 359-366.                                 10.1109/TENCON.2013.6719012.
[6] Susan T. Dumais, George W. Furnas,               [16] Roger Bradford, Implementation techniques
     Thomas K. Landauer, Scott Deerwester, and            for large-scale latent semantic indexing
     Richard Harshman, Using latent semantic              applications, in: Proceedings of the 20th
     analysis to improve access to textual                ACM       International     Conference     on
     information, in: Proceedings of the SIGCHI           Information and Knowledge Management
     Conference on Human Factors in Computing             (CIKM), October 2011, Glasgow Scotland,
     Systems, May 1988, Washington, DC, pp.               pp. 339-344.
     281-285.                                        [17] Zhiqiang Cai et al, Impact of corpus size and
[7] Tristan Miller, Essay assessment with latent          dimensionality of LSA spaces from
     semantic analysis, Journal of Educational            Wikipedia articles on AutoTutor answer
     Computing Research, 29(4), (2003) 495-512.           evaluation,      in:     Proceedings,     11th
[8] Lexis-Nexis, The Evolution of Semantic                International Conference on Educational
     Search on the Web, 2009. URL:                        Data Mining (EDM), Jul 16-20, 2018,
     https://www.lexisnexis.co.uk/pdf/brochures/          Raleigh, NC, pp.127-136.
     totalpatent-whitepaper.pdf.                     [18] Thomas K. Landauer and Susan T. Dumais,
[9] Jean Isson, Unstructured Data Analytics:              A solution to Plato's problem: The latent
     How to Improve Customer Acquisition,                 semantic analysis theory of acquisition,
     Customer Retention, and Fraud Detection              induction, and representation of knowledge.
     and Prevention, John Wiley & Sons, 2018.             Psychological Review, 104(2), (1997) 211-
[10] Seraina Anagnostopoulou, et al., The impact          240.
     of online reputation on hotel profitability,    [19] Roger Bradford, An empirical study of
     International Journal of Contemporary                required dimensionality for large-scale latent
     Hospitality Management September 20,                 semantic      indexing     applications,   in:
     2019. doi: 10.1108/IJCHM-03-2019-0247.               Proceedings of the 17th ACM conference on
[11] Wei Dong, et al., The detection of fraudulent        Information and knowledge management
     financial statements: an integrated language         (CIKM 2008), October 19-23, 2008, Napa
     model, in: Proceeding of the 19th Pacific-           Valley,      CA,     pp.     153-162.     doi:
     Asia Conference on Information Systems               https://doi.org/10.1145/ 1458082.1458105.
     (PACIS 2014), Article 383.                      [20] Stuart Rose, Dave Engel, Nick Cramer, and
[12] Roger Bradford, An overview of information           Wendy Cowley,           Automatic keyword
     discovery using latent semantic indexing, in:        extraction from individual documents. Text
     Proceedings, International Conference on             Mining: Applications and Theory 1 (2010):
     Computer Science, Applied Mathematics                1-20.
     and Applications (ICCSAMA 2017), June 30        [21] Roger Bradford, Incorporating ad hoc
     -July 1, 2017, Berlin, Germany, pp. 153-164.         phrases in LSI queries, in: Proceedings, 6th
[13] Hongyu Chen, et al., Effective use of latent         International Conference on Knowledge
     semantic indexing and computational                  Discovery and Information Retrieval,
     linguistics in biological and biomedical             October 21-24, 2014, Rome, Italy, pp. 61-70.
     applications, Frontiers in Physiology, 4: 8.    [22] Roger Bradford, Exploiting sensitive
     (2013) doi: 10.3389/fphys.2013.00008.                information in background mode using latent
[14] Nguyen Ngoc Chan, Walid Gaaloul, and                 semantic indexing, in: Proceedings of the
     Samir Tata, A web service recommender                Sixth Workshop on Link Analysis,
     system using vector space model and latent           Counterterrorism and Security, SIAM Data
     semantic indexing, in: Proceedings, 2011             Mining Conference, April 24-26 2008,
     IEEE International Conference on Advanced            Atlanta, Georgia.
[23] Roger Bradford and John Pozniak, A                    Language and Information (ESSLLI),
     systematic approach to design of a text               August 6-18, Birmingham, UK, pp. 156-166.
     categorizer, in; Proceedings, 2016 IEEE          [34] Dominic Widdows, Orthogonal negation in
     International Conference on Systems, Man,             vector spaces for modelling word-meanings
     and Cybernetics (SMC), October 9-12, 2016,            and document retrieval, in: Proceedings of
     Budapest, Hungary, pp. 509-514.                       the 41st Annual Meeting of the Association
[24] Roger Bradford, Comparability of LSI and              for Computational Linguistics, July 7, 2003,
     human judgment in text analysis tasks – an            Volume 1, pp. 136-143.
     update. Draft, Mar 2, 2021.                      [35] Nathan Halko, Per-Gunnar Martinsson, and
[25] Roger Bradford, Use of latent semantic                Joel Tropp, Finding structure with
     indexing to identify name variants in large           randomness: probabilistic algorithms for
     data collections, in: Proceedings, 2013 IEEE          constructing        approximate       matrix
     International Conference on Intelligence and          decompositions, SIAM Review, 2(3) (2011):
     Security Informatics (ISI 2013), Seattle,             217-288.
     WA, 27-32. doi: 10.1109/ISI.2013.6578781.        [36] Ming Gu, Subspace iteration randomization
[26] Roger        Bradford,       Word       sense         and singular value problems, SIAM Journal
     disambiguation, 2008. Patent No. 7,415,462,           on Scientific Computing, 37(3) (2015):
     Filed January 20, 2006, Issued August 19,             1139-1173.
     2009.                                            [37] Per-Gunnar      Martinsson,      Randomized
[27] Anthony Zukas and Robert Price, Document              methods for matrix computations, in: The
     categorization using latent semantic                  Mathematics of Data, IAS/Park City
     indexing, in: Proceedings, Fifth Annual               Mathematics Series, 25(4) 2018, pp. 187-
     Symposium         on     Document      Image          231.
     Understanding Technology (SDIUT), April,         [38] Roger Bradford, Machine translation using
     2003, Greenbelt, MD, pp. 87-91.                       vector space representations, 2010. Patent
[28] Roger Bradford and John Pozniak,                      No. 7,765,098, Filed April 24, 2006, Issued
     Combining modern machine translation                  July 27, 2010.
     software with LSI for cross-lingual              [39] Phoebe Liu, Dylan Glas, Takayuki Kanda,
     information processing, in: Proceedings,              and Hiroshi Ishiguro, Data-driven HRI:
     11th      International    Conference      on         Learning social behaviors by example from
     Information Technology: New Generations               human–human           interaction,      IEEE
     (ITNG), April 7-9, 2014, Las Vegas, NV, 65-           Transactions on Robotics, 32, no. 4, (2016):
     72. doi: 10.1109/ITNG.2014.52.                        988-1008.
[29] Roger Bradford, Federated queries and            [40] Francesco Agostaro, et al., A conversational
     combined text and relational data, 2006.              agent based on a conceptual interpretation of
     Patent Application No. 11434749, Filed May            a data driven semantic space, in:
     17, 2006.                                             Proceedings, Congress of the Italian
[30] Weizhong Zhu, and Chaomei Chen,                       Association for Artificial Intelligence,
     Storylines: Visual exploration and analysis in        September 21–23, 2005, Milan, Italy, pp.
     latent semantic spaces. Computers &                   381-392.
     Graphics, 31(3) (2007): 338-349.                 [41] Robert Speer, Catherine Havasi, and Henry
[31] Ricardo Olmos, et al, Transforming selected           Lieberman, AnalogySpace: Reducing the
     concepts into dimensions in latent semantic           dimensionality        of       common-sense
     analysis, Discourse Processes, 51(5-6)                knowledge, in: Proceedings of the Twenty-
     (2004): 494-510.                                      Third AAAI Conference on Artificial
[32] Anna Sidorova, Nicholas Evangelopoulos,               Intelligence, Jul 13, 2008, vol. 8, pp. 548-
     Joseph S. Valacich, and Thiagarajan                   553.
     Ramakrishnan, Uncovering the intellectual        [42] Trevor Cohen, Brett Blatter, and Vimla Patel,
     core of the information systems discipline.           Simulating expert clinical comprehension:
     MIS Quarterly (2008): 467-482.                        Adapting latent semantic analysis to
[33] Preslav Nakov, Getting better results with            accurately extract clinical concepts from
     latent semantic indexing, in: Proceedings of          psychiatric narrative, Journal of Biomedical
     the Students Presentations at the 12th                Informatics 41, no. 6 (2008): 1070-1087.
     European Summer School in Logic,