<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Lessons Learned from 20 Years of Implementing LSI Applications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Roger Bradford</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Maxim Analytics</institution>
          ,
          <addr-line>Great Falls, VA 22066</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper summarizes lessons learned, over a period of 20 years, from implementing information systems employing the technique of latent semantic indexing (LSI). The data presented is drawn from 63 projects undertaken over the period 1999 through 2019. Over that period the projects increased in scale from collections of hundreds of thousands of documents to ones involving hundreds of millions of documents. They also increased in sophistication, from simple search and retrieval systems to ones focused on information discovery and automated alerting. This paper summarizes some of the key developments in technology and techniques that enabled those advances in the size and sophistication of the applications. The objective of this paper is to share insights gained from these past two decades of system implementation experience.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Latent Semantic Indexing</kwd>
        <kwd>LSI</kwd>
        <kwd>LSI applications</kwd>
        <kwd>LSA</kwd>
        <kwd>lessons learned</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Latent Semantic Indexing</title>
      <p>
        The technique of latent semantic indexing
(LSI) was invented at Bellcore in the late 1980s
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The original intent was to provide improved
capabilities for retrieval of text. The technique
has, however, proven to be useful in analysis of a
wide variety of information types [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
      </p>
      <p>
        As applied to a collection of documents, the
LSI algorithm consists of the following primary
steps [
        <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
        ]:
1. A term-document matrix is formed, and
(typically) local and global weights are applied
to the elements of this matrix.
2. Singular value decomposition (SVD) is
used to reduce this matrix to a product of three
matrices, one of which is diagonal in the
singular values of the original matrix.
3. Dimensionality is reduced by deleting all
but the k largest singular values, together with
the corresponding columns of the other two
matrices.
4. This truncation process provides a basis
for generating a k-dimensional vector space.
Both terms and documents are represented by
k-dimensional vectors in this vector space.
5. New queries, terms, and documents can
be represented in the space by a process known
as folding-in, which extrapolates from known
vectors.
6. The semantic similarity of any two
objects represented in the space is reflected by
the proximity of their representation vectors,
generally using a cosine measure.
      </p>
      <p>
        Experience from a broad range of academic,
industrial, and governmental testing has shown
that proximity in an LSI space is a remarkably
good proxy for semantic relatedness as judged by
humans [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Early commercial applications of LSI included
identification of people with specific expertise
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], detection of spam in e-mails [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and essay
scoring [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Over time, the technique found wide
application in areas such as patent search and
analysis [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], résumé matching [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], customer
survey analysis [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and fraud detection [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. It
became the dominant paradigm in electronic
document discovery [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. More recently it has
been used in bioinformatics discovery [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ],
recommender systems [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and social media
analysis [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
2. Structure of the Paper
      </p>
      <p>This paper summarizes lessons learned from
63 information system implementation projects
that the author took part in over the period 1999
through 2019. Each of these projects employed
LSI as a key technical component. The systems
addressed a wide range of applications for both
commercial and government customers.</p>
      <p>Over this 20-year period, the systems
increased significantly in both size and
sophistication.2 The earliest systems utilized the
conceptual search, clustering, and categorization
functionality of LSI to implement relatively
simple capabilities for such tasks as customer
survey analysis and matching of résumés with job
openings. Extrapolating from experience gained,
these fundamental capabilities subsequently were
applied in a more abstract fashion to higher-level
considerations in applications such as fraud
detection and patent prior art analysis. Successive
refinement of tools and techniques eventually
enabled advanced applications incorporating
features such as novel information detection and
secure information sharing.</p>
      <p>Section 3 of this paper provides a brief
overview of the principal improvements in
technologies and techniques that enabled solution
of progressively larger and more complex
problems using LSI. Section 4 describes several
implementation principles that have proven useful
in building LSI-based information systems.
Section 5 summarizes some particularly
interesting results and surprises that were
encountered in the course of building these
systems. Section 6 concludes with brief
comments on capabilities being incorporated into
more recent LSI applications.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Enabling Advances in Technology</title>
      <p>and Technique</p>
    </sec>
    <sec id="sec-3">
      <title>3.1. Scaling</title>
      <p>
        Computers in the early days of LSI were not
well-suited for SVD computation and large-scale
2 Other projects undertaken in this time frame applied LSI to data
other than text. However, only systems that focused on text are
addressed here.
matrix manipulation, which limited the scale of
LSI applications. However, hardware
improvements over the past twenty years have
completely changed this situation [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Figure 1
shows the dramatic reduction in the observed
index creation times in nine comparable projects
over the period 2002-2016. The times shown are
those required to build an LSI index for one
million documents (averaging several kilobytes in
size) at 300 dimensions, using computers
typically employed in applications in the given
years.
      </p>
      <p>The early points on the curve correspond to
index creation times from projects using clusters
of processors. Subsequent points are from
projects primarily employing mid-range servers.
Of note, however, the last point shown is for a
laptop computer.</p>
      <p>The dramatic decline in time required to create
an LSI index progressively enabled a wider
variety of applications. At the present time, LSI
applications involving collections of tens of
millions of documents are routine and multiple
applications have been implemented that
encompass full LSI indexing of hundreds of
millions of documents.</p>
      <p>Advances in technology enabled
improvements not only in scale, but also in the
fidelity of the generated LSI spaces in
representing real-world semantic associations.
For LSI, as collection size increases, the larger
number of occurrences of individual terms
diminishes the effects of idiosyncratic
occurrences of those terms in specific documents.
This improves overall representational fidelity, as
shown in Figure 2. The graph displays the
variation in mean reciprocal rank (MRR) of 250
pairs of terms having known real-world semantic
association3, as a function of the size of the
collection. As indicated by the trend line, the
increase in representational fidelity with
collection size is approximately logarithmic. Of
note is the fact that over 80% of all published
literature on LSI deals with collection sizes
smaller than the initial point shown (17 thousand
documents) and 97% deals with collection sizes
less than the second point shown (93 thousand
documents).</p>
      <p>Several of the developed applications included
the identification of specific patterns of activities
and relationships as one of the system objectives.
In many cases, the distinctions between patterns
of interest and normal patterns were quite subtle.
In general, the larger the data collection, the more
effective LSI was in providing indicators of the
existence of patterns of interest. Over time, the
continuing growth in the size of collections that
could be addressed facilitated implementation of
increasingly sophisticated analytic operations.
3.2.</p>
    </sec>
    <sec id="sec-4">
      <title>Parameter Optimization</title>
      <p>
        In the construction of an LSI space, there are a
number of parameter choices that must be made;
for example: number and identity of stopwords,
required number of occurrences for a term to be
included in the processing, and the number of
dimensions for the LSI space. The choices that
are made can have a significant impact on overall
performance in a specific application [
        <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
        ]. As
an example, Figure 3 shows the variation in mean
reciprocal rank of 250 pairs of
semanticallyrelated terms as a function of the number of
3 Over the years, such simple and direct metrics for quality of an LSI
space proved to be very useful in tuning implemented systems. Over
a wide range of applications, evaluations using such metrics
dimensions chosen, for a collection4 of five
million documents [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>The performance of the systems discussed here
benefited greatly from the cumulative experience
gained over 20 years regarding choice of effective
parameters. In nearly all of the systems
developed, at least some testing of parameter
choices was carried out that was designed to
optimize application performance. This is
addressed further in section 4.
3.3.</p>
    </sec>
    <sec id="sec-5">
      <title>Indexing of Named Entities</title>
      <p>In most text applications, named entities
constitute items of particular significance. For
example, names of people are of fundamental
importance in fraud detection. One of the most
important factors contributing to the success of
the programs described here was the fact that
nearly all of them employed entity extraction and
markup as a preprocessing step prior to creating
the LSI spaces involved. Typically, names of
persons, locations, and organizations were
extracted, but in some cases more entity types
were treated. In the LSI preprocessing,
occurrences of a name such as John Kennedy were
marked up as p_john_kennedy_p, and similarly
for other entity types. (This markup was stripped
out prior to presenting results to users.)</p>
      <p>With classical LSI, users can create queries of
the form: What terms are most closely associated
with a given term? In contrast, with entity
markup prior to creating the space, an interface
can be implemented that allows users to enter
queries such as: What people are most closely
associated with a given entity or activity? Such
queries are much more natural in most
correlated well with both performance on application-specific tests
and with human judgment.
4 These are not the same documents as those in the collection
referenced in Figure 2.
applications. The implementation of capabilities
to effectively execute those types of queries was a
major factor contributing to both operational
efficiency and user satisfaction for the systems
described here.</p>
      <p>Even in the limited number of cases where
entities themselves were not of prime importance
for users, entity markup prior to creating the LSI
space was of great importance for improving the
representational fidelity of the space. In most text
collections, failure to treat named entities as
textual units will create vast numbers of spurious
associations. For example, the common English
given name John may be a component of
hundreds of distinct person names. Classical LSI
will conflate all of the occurrences of John,
generating erroneous correlations in the LSI space
produced. In many current LSI applications, the
text collections being addressed contain millions
to tens of millions of named entities. Failing to
treat these entities as textual units when building
an LSI space for such applications would yield
millions of distortions of relations in the space.
3.4.</p>
    </sec>
    <sec id="sec-6">
      <title>Dealing with Phrases</title>
      <p>Many LSI applications involve retrieval of
information of interest based on queries formed
by users. It is well-known that, in many instances,
the use of phrases in queries can significantly aid
in expressing a user’s information needs.
Historically, one frequently-cited criticism of
classical LSI was that it did not provide a viable
mechanism for dealing with phrases in queries. It
was felt that use of phrases required identification
of all phrases of interest prior to creating the LSI
space, so that those phrases could be treated as
terms in the indexing process. This is a problem
in that there are a very large number of phrases in
a text collection of any significant size. Most of
the candidate phrases will never be employed by
users. Moreover, indexing of most phrases will
not significantly improve the representational
fidelity of the LSI space.</p>
      <p>
        We eventually found a two-part solution to this
problem. In order to incorporate phrases that
would improve the representational fidelity of the
space, we employed the following procedure:
• Using a highly productive phrase
generation technique, such as RAKE [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ],
generate a large set of candidate phrases for the
collection of interest.
• Create an initial LSI index for the
collection, with no attempt to extract phrases.
• For each candidate phrase, create an
approximate LSI vector by taking a weighted
average of the representation vectors for the
documents that contain that phrase. (The
folding-in technique of classical LSI applied to
terms [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]).
• Compare the approximate vector for the
phrase with a vector created by simply
combining the terms of the candidate phrase as
an LSI query.
• Create a final LSI space, treating as
textual units the candidate phrases which have
the greatest distance (smallest cosine) between
the approximation vector and the query vector.
      </p>
      <p>
        In order to ensure that users could employ
arbitrary phrases in searches, we developed a
technique that allowed use of phrases in LSI
queries even for LSI spaces where phrases have
not been indexed. The technique is described in
detail in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Table 1 shows the results from
applying this technique in searching a collection
of 1.6 million news articles using the query rare
earth element.
      </p>
      <p>The column labeled NONE shows the ranked
query results (closest terms) when no phrase
processing is applied. In this case, since the terms
rare and element occur in diverse contexts, the
term earth has the most significant effect on the
results. The results are completely dominated by
celestial references; clearly not what a user would
desire.</p>
      <p>The column labeled PRE-PROCESSED shows
the results for the same collection when rare earth
element was marked up as a phrase and treated as
a textual unit in creating the LSI space. The
results are as expected: primarily names of rare
earth elements and those of people and
organizations associated with processing of rare
earth elements.</p>
      <p>
        The column labeled AD HOC shows the
results when the term folding approach of [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] is
applied to the LSI space where there was no initial
phrase processing. The results are quite close to
those obtained for the case where the phrase was
indexed (60% agreement for the top ten terms in a
collection comprising 1.5 million terms). The
adoption of this ad hoc phrase query process in
systems described here resulted in a major
improvement in user satisfaction.
3.5.
      </p>
    </sec>
    <sec id="sec-7">
      <title>User Aids</title>
      <p>Over the years, with the dramatic growth in the
size of the text collections being addressed, it
became increasingly important to provide aids for
users in areas such as creating queries, identifying
topics, interpreting results, and automating
repetitive tasks. The semantic comparison
capabilities of LSI allowed a wide variety of such
aids to be implemented. Some aids were very
simple to implement, but still yielded significant
gains in operational efficiency and user
satisfaction. For example, a popup display of the
most closely associated terms when a user moused
over a given term was of great help in determining
the meaning of newly-encountered terms such as
acronyms and technical terminology. Most of the
users of the systems were knowledge workers, but
typically did not have technical backgrounds.
Providing them with immediate contextual
information regarding technical terms greatly
aided them in understanding the material that they
were working with.</p>
      <p>Over time, such aids became more complex.
One that proved very popular was novelty
detection. Within some systems, tracking
capabilities were implemented to provide an
indication of what information a given user
already was aware of. This included, for example,
monitoring what documents (or other text objects)
that the user had previously displayed, saved,
printed, or incorporated into work products.
Then, in response to a query from that user, the
results could be displayed not just in relevance
order, but in the order of those results that were
relevant but at the same time were least similar to
those previously seen. In many applications there
is significant redundancy in the content of items
collected. In applications with high information
redundancy, the novelty detection feature greatly
improved both efficiency of operations and user
satisfaction.</p>
      <p>Other user aids that proved to enhance both
operational efficiency and user satisfaction
included:
• Generation of document summaries
tailored to users’ interests.
• Automated generation of graphs showing
relationships among entities.
• Automated tracking of topic threads in
long documents and sets of documents.
3.6.</p>
    </sec>
    <sec id="sec-8">
      <title>Secure Information Sharing</title>
      <p>The representation for a given term in an LSI
space is a single point in a vector space that is
derived from what may be hundreds of
occurrences, even for a relatively rare term.
Similarly, the representation for a given document
is derived from large numbers of occurrences of
multiple terms. Even in classical LSI spaces, it is
impossible to work backwards to reconstruct the
actual wording of documents corresponding to
extant document vectors. With slight
modifications to the index creation process, it can
be made impossible to determine even which
words occurred in which documents. These
characteristics enable the use of information in a
secure background mode.</p>
      <p>In many applications there is relevant data
available that cannot be directly shared with users
for proprietary, legal, or privacy reasons. In such
cases, these sensitive documents can be processed
so that the results of operations in the LSI space
for the application can be enhanced by the
contextual implications of the sensitive data,
without risk of disclosure of specific sensitive
data items themselves.</p>
      <p>
        Experience in using LSI in a secure
background mode has shown that even a small
number of documents used in this manner can
have great leverage. In some representative cases,
data treated in background mode has constituted
less than 1% of the total data being examined.
Nevertheless, significant gains in application
efficiency still have been achieved [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
      </p>
    </sec>
    <sec id="sec-9">
      <title>4. Beneficial</title>
    </sec>
    <sec id="sec-10">
      <title>Practices</title>
    </sec>
    <sec id="sec-11">
      <title>Implementation</title>
      <p>Over the years, a number of LSI
implementation practices evolved that
significantly improved the quality and efficiency
of the systems developed.</p>
      <p>Perhaps the most significant implementation
approach adopted was to use analyses in the LSI
spaces themselves to select effective values for all
of the key processing parameters for an
application. Typically, we used the following
approach:
1. Build an initial LSI space from
applicationrelevant data, using standard parameter values
and processing choices.
2. Using a small, representative test set, carry
out analyses in this initial space to determine
the most effective values for the parameters
and choices.
3. Re-build the LSI spaces using those
parameters and processing choices.</p>
      <p>For example, using a test set representative of
an application being addressed, it is possible to
make an effective choice of the number of
dimensions to employ in creating the LSI space
for that application. As long as the initial space
employs a number of dimensions higher than
optimal, the requisite tests can be carried out, and
an optimal value found, with vectors from a single
initial LSI space.</p>
      <p>For other parameter choices, a new LSI space
must be created to test each value. For example,
in many applications, terms are only included in
the LSI processing if they occur at least M times
in the collection and/or in at least N different
documents. Pruning the term set in this manner
often can significantly improve the
representational fidelity of the space. A separate
LSI space must be generated in order to test each
prospective pruning value. However, only a
limited range of values must be tried. In most of
the applications here, values of M and N in the
range of two to five turned out to be optimal. It
should be noted that pruning typically was not
applied to named entities. In many applications,
the occurrence of a name may be of significance
even if it occurs only once.</p>
      <p>The dramatic reduction in the time required to
create LSI spaces made it increasingly feasible to
create trial LSI spaces for optimization testing,
even for parameters that required multiple such
spaces to be created. For very large collections,
optimization analyses typically can be carried out
sufficiently effectively using LSI spaces built
from a randomly selected subset of the overall
collection.</p>
      <p>
        We also employed iterative refinement of LSI
spaces to mitigate the effects of errors in training
data for categorization applications. This
approach led to significant improvement in
categorization accuracy. The technique has broad
applicability for noise mitigation in LSI
applications [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
      <p>The computer employed to carry out analytic
operations in an LSI space does not have to be the
same computer on which the LSI space is created.
It often proved useful to create LSI spaces on a
large server and then distribute the vector spaces
created there to smaller devices for use. We also
found that distribution of shared LSI spaces can
be a powerful enabler for collaborative work.</p>
      <p>Sometimes a conceptual search will retrieve
results that do not appear to be appropriate. Users
may find this disconcerting. However, these often
can be the most important results – ones that
indicate a gap in user understanding of some
aspect of the problem at hand. In multiple systems
we found it useful to highlight terms and passages
in retrieved documents based on semantic
similarity to the user’s query. Users found this
useful in trying to determine why a surprising
result was obtained.</p>
      <p>Other implementation principles that proved
effective included:
• Duplicate and near-duplicate documents
in a collection artificially magnify associated
term relationships. LSI comparisons between
documents of a collection can be used very
effectively to eliminate redundant documents.
• For some applications, removal of
“boilerplate” text can greatly enhance
performance. For example, many legal
documents contain formulaic blocks of text
that appear on many documents. Appearance
of such repeated text creates undesired
associations (i.e., ones that are not related to
the content of the documents).
• In many instances it is useful to use LSI
similarity comparisons to decompose long
documents into conceptually cohesive
segments, which are then indexed as individual
items. This makes it much easier to identify
information on subsidiary topics.
• For large applications, parallel processing
approaches such as MapReduce and more
recent techniques can be employed very
effectively for text preprocessing tasks.
• In analyses involving the LSI vectors of
large collections, use of GPUs for the cosine
comparisons can provide a dramatic speedup
compared to using typical CPUs.
• In many applications, entity-driven
analytic processes can be far more efficient
than document-driven ones.
• Monitoring of user actions often can
provide training data that can be employed to
refine the LSI spaces employed and to yield
improved accuracy of analytic operations.
One particularly effective use of this
techniques was in continuously refining
textual representations of user interests.</p>
    </sec>
    <sec id="sec-12">
      <title>5. Interesting Results and Surprises</title>
      <p>Over the past 20 years there were a number of
aspects of LSI that either came as a surprise or
were unexpectedly useful.</p>
      <p>
        When the work described here began, it was
generally believed that LSI did not scale well.
Academic papers of the time estimated that the
time required to build an LSI space grew as at
least the square of the number of documents
addressed.5 We were pleasantly surprised that
actual measurements showed that the growth was
close to linear [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        Indications of semantic similarity as provided
by LSI turned out to be a remarkably good proxy
for similarity judgments generated by people. In
2007 a review of 30 studies compared LSI and
human judgment in 16 real-world text processing
tasks ranging from synonym matching to
psychological assessment. LSI performed as well
as, or better than, humans in 51% of the cases [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
In more recent, work, covering over 100 studies
and 37 applications, LSI performed as well as, or
better than, humans in 56% of the cases [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. Of
significance is the fact that all of these studies
employed straightforward implementations of
LSI. None of the advanced techniques described
in this paper were used in any of the analyzed
studies. Moreover, the number of documents used
to create the spaces was very small - having a
median value of only 1700. With larger
collections, LSI performance in the reviewed
studies likely would have been significantly
higher. In the 63 information systems considered
here, in the few cases where human and LSI
performance could be directly compared, LSI
results typically were as good as, or in some cases
somewhat better than, average human
performance.
      </p>
      <p>One surprise was the huge effect that treating
named entities as textual units produced. For
collections of text such as news articles, the
representational fidelity of the spaces produced
was dramatically improved. Having the entities
available also set us on a path of implementing
ever more sophisticated entity-driven analysis
capabilities. In most applications, entity-driven
processes turned out to be far more efficient that
document-driven ones.</p>
      <p>
        Many of the applications addressed were
complicated by the fact that the text items of
interest contained multiple variants of names of
individuals. These differences came from
misspellings, phonetic renderings, transliteration
differences, and other sources. Because of these
variations, many relationships of interest were
suppressed. One of the early features that we
implemented was a name variant analyzer. For
any given name it combined eight methods for
generating candidate variants and then used
comparisons in the LSI space to select the most
relevant ones. This capability turned out to be
significantly more effective than the best
competing commercial product. Recall was two
to three times greater and confidence ratings for
candidate equivalent names turned out to be much
more reliable than anticipated [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
      </p>
      <p>We were surprised by how easy it was to
implement ad hoc phrase processing in LSI
spaces. (We also were embarrassed by how long
it took for us to realize how to do it).</p>
      <p>
        It was interesting to observe how easily and
effectively word senses could be disambiguated
using clustering techniques in the LSI spaces [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ].
This allowed markup of occurrences of
polysemous words in much the same way as was
done for named entities, as was described in
section 3.3. The disambiguation can be carried
out in a trial space and then the marked-up senses
of polysemous words treated as separate textual
units in creating the final space to be employed.
Typically, a point of diminishing returns will be
reached after disambiguating only a few
thousands to tens of thousands of words. For
some applications, word sense disambiguation of
general terms did not result in major performance
increases. Where disambiguation was of great
value, however, was in dealing with person
names. In many applications there may be
hundreds of people with the same name and
disambiguation is essential. As with phrases, this
name resolution feature can be incorporated into
5 Early estimates tended to overlook one or more of three key factors.
First, LSI requires calculation of only the first few hundred singular
values and associated vectors, not a complete SVD of the entire
termdocument matrix. Second, term-document matrices are extremely
sparse. For large collections, often only one in ten thousand to one in
one hundred thousand entries is non-zero. Finally, the time required
to read and preprocess the text being indexed generally is greater than
the time required to carry out the SVD.
an application either in bulk during preprocessing
or in an ad hoc fashion at query time.
      </p>
      <p>Applications involving the secure background
mode of dealing with sensitive data often involved
very small amounts of such data (sometimes less
than .01% of the total amount of data). In a
number of cases, the extent to which such very
small amounts of auxiliary data could improve
results was quite remarkable.</p>
      <p>
        Some of the early applications involved text
that was produced by optical character
recognition (OCR) equipment. LSI turned out to
be surprisingly effective in dealing with the many
errors produced by OCR devices of that era. In
one categorization application, performance
degradation only began to be detectable when the
OCR error rate reached a level where two out of
every three words were corrupted [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
      </p>
      <p>
        In cross-lingual applications, it turned out that
many languages can be represented in a single LSI
space without serious performance degradation.
In one case, transitioning from two languages
represented in one LSI space to 13 languages
resulted in a decline in cross-lingual similarity
comparisons of only a few percent [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ].
      </p>
      <p>
        LSI turned out to provide an elegant solution
for combining results from diverse information
systems when employing federated queries [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ].
      </p>
      <p>Combining text with other data types
(especially relational, geographic, and image
data) often generated unique analytic insights.
The combination of such data also supported
implementation of highly effective visual analytic
interfaces.</p>
    </sec>
    <sec id="sec-13">
      <title>6. Recent Developments</title>
      <p>
        Although much has been accomplished over
the past twenty years, there are still exciting
activities underway involving LSI. Many of these
involve implementation of ideas that were
originally suggested in a basic form some years
ago, but are just now being incorporated into
realworld applications. Key examples include:
• Combined analysis of text and relational
data [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ].
• Implementation of semantic vector space
equivalents of Boolean operators [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ] and
negation [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ].
      </p>
      <p>
        6 The components of a typical LSI vector comprise hundreds of
indications of derived relationships. In general, the basis vectors of
an LSI space closely relate to concepts, or mixtures of such, within
the collection of text being addressed. An LSI vector is thus a
• Enhancement of machine translation
capabilities, especially for technical and other
specialized subject matter [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ].
• Functionality based on analysis of
individual LSI vector components.6 [
        <xref ref-type="bibr" rid="ref30 ref31 ref32">30, 31,
32</xref>
        ].
• Use of randomized SVD to dramatically
reduce the computational load when
addressing very large collections [
        <xref ref-type="bibr" rid="ref35 ref36 ref37">35, 36, 37</xref>
        ].
• Extensive use of LSI in discovery
applications, particularly in the area of
bioinformatics [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
• Facilitation of human-robot interaction
[
        <xref ref-type="bibr" rid="ref39 ref40">39, 40</xref>
        ].
      </p>
      <p>
        • Various AI-related efforts [
        <xref ref-type="bibr" rid="ref41 ref42">41,42</xref>
        ].
      </p>
    </sec>
    <sec id="sec-14">
      <title>7. Acknowledgements</title>
      <p>I would like to thank the engineers, scientists,
software developers, and others from SAIC,
Content Analyst Company, and Agilex
Technologies Inc. who participated in building the
systems reviewed here over the past twenty years.
Those individuals took the techniques described
here for improving LSI and transformed them
from nascent concepts into working code and
deployed systems.</p>
    </sec>
    <sec id="sec-15">
      <title>8. References</title>
      <p>complex and information-rich object. To compare two such objects
using a single number (such as a cosine) thus ignores a large amount
of potentially useful information.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>George</surname>
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Furnas</surname>
          </string-name>
          , et al.
          <article-title>Information retrieval using a singular value decomposition model of latent semantic structure</article-title>
          ,
          <source>in: Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 88)</source>
          ,
          <source>May</source>
          <year>1988</year>
          , Grenoble, France, pp.
          <fpage>465</fpage>
          -
          <lpage>480</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Susan</surname>
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Dumais</surname>
          </string-name>
          ,
          <article-title>Latent Semantic Analysis</article-title>
          .
          <source>Annual Review of Information Science and Technology</source>
          ,
          <volume>38</volume>
          (
          <issue>1</issue>
          ),
          <year>2004</year>
          , pp.
          <fpage>188</fpage>
          -
          <lpage>230</lpage>
          , doi:https://asistdl.onlinelibrary.wiley.com/d oi/abs/10.1002/aris.1440380105.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Jerome</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Bellegarda</surname>
          </string-name>
          , Latent Semantic Mapping: Principles &amp; Applications. Morgan &amp; Claypool,
          <year>2007</year>
          doi: https://doi.org/10.2200/S00048ED1V01Y20 0609SAP003.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Thomas</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Landauer</surname>
          </string-name>
          ,
          <string-name>
            <surname>Danielle S. McNamara</surname>
          </string-name>
          ,
          <string-name>
            <surname>Simon Dennis</surname>
          </string-name>
          , and Walter Kintsch, eds.
          <year>2007</year>
          .
          <article-title>Handbook of Latent Semantic Analysis</article-title>
          . Lawrence Erlbaum Associates.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Roger</given-names>
            <surname>Bradford</surname>
          </string-name>
          ,
          <article-title>Comparability of LSI and human judgment in text analysis tasks</article-title>
          ,
          <source>in: Proceedings, Applied Computing Conference, September 28-30</source>
          ,
          <year>2009</year>
          , Athens, Greece, pp.
          <fpage>359</fpage>
          -
          <lpage>366</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Susan</surname>
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Dumais</surname>
          </string-name>
          , George W. Furnas,
          <string-name>
            <surname>Thomas</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Landauer</surname>
            ,
            <given-names>Scott</given-names>
          </string-name>
          <string-name>
            <surname>Deerwester</surname>
          </string-name>
          , and Richard Harshman,
          <article-title>Using latent semantic analysis to improve access to textual information</article-title>
          ,
          <source>in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, May</source>
          <year>1988</year>
          , Washington, DC, pp.
          <fpage>281</fpage>
          -
          <lpage>285</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Tristan</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Essay assessment with latent semantic analysis</article-title>
          ,
          <source>Journal of Educational Computing Research</source>
          ,
          <volume>29</volume>
          (
          <issue>4</issue>
          ), (
          <year>2003</year>
          )
          <fpage>495</fpage>
          -
          <lpage>512</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Lexis-Nexis</surname>
          </string-name>
          ,
          <source>The Evolution of Semantic Search on the Web</source>
          ,
          <year>2009</year>
          . URL: https://www.lexisnexis.co.uk/pdf/brochures/ totalpatent-whitepaper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Jean</given-names>
            <surname>Isson</surname>
          </string-name>
          ,
          <article-title>Unstructured Data Analytics: How to Improve Customer Acquisition, Customer Retention, and Fraud Detection and Prevention</article-title>
          , John Wiley &amp; Sons,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Seraina</given-names>
            <surname>Anagnostopoulou</surname>
          </string-name>
          , et al.,
          <article-title>The impact of online reputation on hotel profitability</article-title>
          ,
          <source>International Journal of Contemporary Hospitality Management September</source>
          <volume>20</volume>
          ,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .1108/IJCHM-03-2019-0247.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Wei</given-names>
            <surname>Dong</surname>
          </string-name>
          , et al.,
          <article-title>The detection of fraudulent financial statements: an integrated language model</article-title>
          ,
          <source>in: Proceeding of the 19th PacificAsia Conference on Information Systems (PACIS</source>
          <year>2014</year>
          ), Article 383.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Roger</surname>
            <given-names>Bradford,</given-names>
          </string-name>
          <article-title>An overview of information discovery using latent semantic indexing</article-title>
          , in: Proceedings, International Conference on Computer Science,
          <source>Applied Mathematics and Applications (ICCSAMA</source>
          <year>2017</year>
          ), June 30 -July 1,
          <year>2017</year>
          , Berlin, Germany, pp.
          <fpage>153</fpage>
          -
          <lpage>164</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Hongyu</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <article-title>Effective use of latent semantic indexing and computational linguistics in biological and biomedical applications</article-title>
          , Frontiers in Physiology,
          <volume>4</volume>
          :
          <fpage>8</fpage>
          . (
          <year>2013</year>
          ) doi: 10.3389/fphys.
          <year>2013</year>
          .
          <volume>00008</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Nguyen</given-names>
            <surname>Ngoc</surname>
          </string-name>
          <string-name>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Walid</given-names>
            <surname>Gaaloul</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Samir</given-names>
            <surname>Tata</surname>
          </string-name>
          ,
          <article-title>A web service recommender system using vector space model and latent semantic indexing</article-title>
          ,
          <source>in: Proceedings, 2011 IEEE International Conference on Advanced Information Networking and Applications, March</source>
          <volume>22</volume>
          -25,
          <year>2011</year>
          , Biopolis, Singapore, pp.
          <fpage>602</fpage>
          -
          <lpage>609</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hashimoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kuboyama</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <article-title>Temporal awareness of changes in afflicted people's needs after East Japan Great Earthquake</article-title>
          ,
          <source>in: Proceedings, IEEE International Conference of IEEE Region 10 (TENCON</source>
          <year>2013</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .1109/TENCON.
          <year>2013</year>
          .
          <volume>6719012</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Roger</surname>
            <given-names>Bradford</given-names>
          </string-name>
          ,
          <article-title>Implementation techniques for large-scale latent semantic indexing applications</article-title>
          ,
          <source>in: Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM)</source>
          ,
          <year>October 2011</year>
          , Glasgow Scotland, pp.
          <fpage>339</fpage>
          -
          <lpage>344</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Zhiqiang</given-names>
            <surname>Cai</surname>
          </string-name>
          et al,
          <article-title>Impact of corpus size and dimensionality of LSA spaces from Wikipedia articles on AutoTutor answer evaluation</article-title>
          ,
          <source>in: Proceedings, 11th International Conference on Educational Data Mining (EDM)</source>
          ,
          <source>Jul 16-20</source>
          ,
          <year>2018</year>
          , Raleigh, NC, pp.
          <fpage>127</fpage>
          -
          <lpage>136</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Thomas</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Landauer</surname>
          </string-name>
          and
          <string-name>
            <surname>Susan T. Dumais</surname>
          </string-name>
          ,
          <article-title>A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge</article-title>
          .
          <source>Psychological Review</source>
          ,
          <volume>104</volume>
          (
          <issue>2</issue>
          ), (
          <year>1997</year>
          )
          <fpage>211</fpage>
          -
          <lpage>240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Roger</surname>
            <given-names>Bradford,</given-names>
          </string-name>
          <article-title>An empirical study of required dimensionality for large-scale latent semantic indexing applications</article-title>
          ,
          <source>in: Proceedings of the 17th ACM conference on Information and knowledge management (CIKM</source>
          <year>2008</year>
          ),
          <source>October 19-23</source>
          ,
          <year>2008</year>
          , Napa Valley, CA, pp.
          <fpage>153</fpage>
          -
          <lpage>162</lpage>
          . doi: https://doi.org/10.1145/ 1458082.1458105.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Stuart</surname>
            <given-names>Rose</given-names>
          </string-name>
          , Dave Engel, Nick Cramer, and
          <string-name>
            <given-names>Wendy</given-names>
            <surname>Cowley</surname>
          </string-name>
          ,
          <article-title>Automatic keyword extraction from individual documents</article-title>
          .
          <source>Text Mining: Applications and Theory</source>
          <volume>1</volume>
          (
          <year>2010</year>
          ):
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Roger</surname>
            <given-names>Bradford</given-names>
          </string-name>
          ,
          <article-title>Incorporating ad hoc phrases in LSI queries</article-title>
          ,
          <source>in: Proceedings, 6th International Conference on Knowledge Discovery and Information Retrieval, October 21-24</source>
          ,
          <year>2014</year>
          , Rome, Italy, pp.
          <fpage>61</fpage>
          -
          <lpage>70</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Roger</surname>
            <given-names>Bradford</given-names>
          </string-name>
          ,
          <article-title>Exploiting sensitive information in background mode using latent semantic indexing</article-title>
          ,
          <source>in: Proceedings of the Sixth Workshop on Link Analysis, Counterterrorism and Security, SIAM Data Mining Conference, April 24-26</source>
          <year>2008</year>
          , Atlanta, Georgia.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Roger</given-names>
            <surname>Bradford and John Pozniak</surname>
          </string-name>
          ,
          <article-title>A systematic approach to design of a text categorizer</article-title>
          ,
          <source>in; Proceedings, 2016 IEEE International Conference on Systems, Man, and Cybernetics</source>
          (SMC),
          <source>October 9-12</source>
          ,
          <year>2016</year>
          , Budapest, Hungary, pp.
          <fpage>509</fpage>
          -
          <lpage>514</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Roger</surname>
            <given-names>Bradford</given-names>
          </string-name>
          ,
          <article-title>Comparability of LSI and human judgment in text analysis tasks - an update</article-title>
          .
          <source>Draft, Mar</source>
          <volume>2</volume>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Roger</surname>
            <given-names>Bradford</given-names>
          </string-name>
          ,
          <article-title>Use of latent semantic indexing to identify name variants in large data collections</article-title>
          ,
          <source>in: Proceedings, 2013 IEEE International Conference on Intelligence and Security Informatics (ISI</source>
          <year>2013</year>
          ), Seattle, WA,
          <fpage>27</fpage>
          -
          <lpage>32</lpage>
          . doi:
          <volume>10</volume>
          .1109/ISI.
          <year>2013</year>
          .
          <volume>6578781</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Roger</surname>
            <given-names>Bradford</given-names>
          </string-name>
          , Word sense disambiguation,
          <year>2008</year>
          . Patent No.
          <volume>7</volume>
          ,
          <issue>415</issue>
          ,
          <fpage>462</fpage>
          , Filed January 20,
          <year>2006</year>
          ,
          <string-name>
            <given-names>Issued</given-names>
            <surname>August</surname>
          </string-name>
          19,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Anthony</given-names>
            <surname>Zukas</surname>
          </string-name>
          and
          <string-name>
            <given-names>Robert</given-names>
            <surname>Price</surname>
          </string-name>
          ,
          <article-title>Document categorization using latent semantic indexing</article-title>
          ,
          <source>in: Proceedings, Fifth Annual Symposium on Document Image Understanding Technology (SDIUT)</source>
          ,
          <year>April</year>
          ,
          <year>2003</year>
          , Greenbelt, MD, pp.
          <fpage>87</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Roger</given-names>
            <surname>Bradford</surname>
          </string-name>
          and John Pozniak,
          <article-title>Combining modern machine translation software with LSI for cross-lingual information processing</article-title>
          ,
          <source>in: Proceedings, 11th International Conference on Information Technology: New Generations (ITNG), April 7-9</source>
          ,
          <year>2014</year>
          ,
          <string-name>
            <given-names>Las</given-names>
            <surname>Vegas</surname>
          </string-name>
          , NV,
          <fpage>65</fpage>
          -
          <lpage>72</lpage>
          . doi:
          <volume>10</volume>
          .1109/ITNG.
          <year>2014</year>
          .
          <volume>52</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Roger</surname>
            <given-names>Bradford</given-names>
          </string-name>
          ,
          <article-title>Federated queries and combined text and relational data</article-title>
          ,
          <year>2006</year>
          . Patent Application No.
          <volume>11434749</volume>
          , Filed May 17,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Weizhong</given-names>
            <surname>Zhu</surname>
          </string-name>
          , and Chaomei Chen,
          <article-title>Storylines: Visual exploration and analysis in latent semantic spaces</article-title>
          .
          <source>Computers &amp; Graphics</source>
          ,
          <volume>31</volume>
          (
          <issue>3</issue>
          ) (
          <year>2007</year>
          ):
          <fpage>338</fpage>
          -
          <lpage>349</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>Ricardo</given-names>
            <surname>Olmos</surname>
          </string-name>
          , et al,
          <article-title>Transforming selected concepts into dimensions in latent semantic analysis</article-title>
          ,
          <source>Discourse Processes</source>
          ,
          <volume>51</volume>
          (
          <issue>5-6</issue>
          ) (
          <year>2004</year>
          ):
          <fpage>494</fpage>
          -
          <lpage>510</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Anna</surname>
            <given-names>Sidorova</given-names>
          </string-name>
          , Nicholas Evangelopoulos, Joseph S. Valacich, and
          <article-title>Thiagarajan Ramakrishnan, Uncovering the intellectual core of the information systems discipline</article-title>
          .
          <source>MIS Quarterly</source>
          (
          <year>2008</year>
          ):
          <fpage>467</fpage>
          -
          <lpage>482</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <surname>Preslav</surname>
            <given-names>Nakov</given-names>
          </string-name>
          ,
          <article-title>Getting better results with latent semantic indexing</article-title>
          ,
          <source>in: Proceedings of the Students Presentations at the 12th European Summer School in Logic, Language and Information (ESSLLI)</source>
          ,
          <year>August</year>
          6-18, Birmingham, UK, pp.
          <fpage>156</fpage>
          -
          <lpage>166</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <surname>Dominic</surname>
            <given-names>Widdows</given-names>
          </string-name>
          ,
          <article-title>Orthogonal negation in vector spaces for modelling word-meanings and document retrieval</article-title>
          ,
          <source>in: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, July</source>
          <volume>7</volume>
          ,
          <year>2003</year>
          , Volume
          <volume>1</volume>
          , pp.
          <fpage>136</fpage>
          -
          <lpage>143</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <surname>Nathan</surname>
            <given-names>Halko</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Per-Gunnar Martinsson</surname>
          </string-name>
          , and Joel Tropp,
          <article-title>Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions</article-title>
          ,
          <source>SIAM Review</source>
          ,
          <volume>2</volume>
          (
          <issue>3</issue>
          ) (
          <year>2011</year>
          ):
          <fpage>217</fpage>
          -
          <lpage>288</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <surname>Ming</surname>
            <given-names>Gu</given-names>
          </string-name>
          ,
          <article-title>Subspace iteration randomization and singular value problems</article-title>
          ,
          <source>SIAM Journal on Scientific Computing</source>
          ,
          <volume>37</volume>
          (
          <issue>3</issue>
          ) (
          <year>2015</year>
          ):
          <fpage>1139</fpage>
          -
          <lpage>1173</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <surname>Per-Gunnar</surname>
            <given-names>Martinsson</given-names>
          </string-name>
          ,
          <article-title>Randomized methods for matrix computations</article-title>
          ,
          <source>in: The Mathematics of Data</source>
          , IAS/Park City Mathematics Series,
          <volume>25</volume>
          (
          <issue>4</issue>
          )
          <year>2018</year>
          , pp.
          <fpage>187</fpage>
          -
          <lpage>231</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <surname>Roger</surname>
            <given-names>Bradford,</given-names>
          </string-name>
          <article-title>Machine translation using vector space representations</article-title>
          ,
          <year>2010</year>
          . Patent No.
          <volume>7</volume>
          ,
          <issue>765</issue>
          ,
          <fpage>098</fpage>
          , Filed April 24,
          <year>2006</year>
          ,
          <string-name>
            <given-names>Issued</given-names>
            <surname>July</surname>
          </string-name>
          27,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <surname>Phoebe</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Dylan Glas, Takayuki Kanda, and Hiroshi Ishiguro,
          <string-name>
            <surname>Data-driven</surname>
            <given-names>HRI</given-names>
          </string-name>
          :
          <article-title>Learning social behaviors by example from human-human interaction</article-title>
          ,
          <source>IEEE Transactions on Robotics</source>
          ,
          <volume>32</volume>
          , no.
          <issue>4</issue>
          , (
          <year>2016</year>
          ):
          <fpage>988</fpage>
          -
          <lpage>1008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>Francesco</given-names>
            <surname>Agostaro</surname>
          </string-name>
          , et al.,
          <article-title>A conversational agent based on a conceptual interpretation of a data driven semantic space</article-title>
          ,
          <source>in: Proceedings, Congress of the Italian Association for Artificial Intelligence, September 21-23</source>
          ,
          <year>2005</year>
          , Milan, Italy, pp.
          <fpage>381</fpage>
          -
          <lpage>392</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <surname>Robert</surname>
            <given-names>Speer</given-names>
          </string-name>
          , Catherine Havasi, and Henry Lieberman,
          <article-title>AnalogySpace: Reducing the dimensionality of common-sense knowledge</article-title>
          ,
          <source>in: Proceedings of the TwentyThird AAAI Conference on Artificial Intelligence, Jul</source>
          <volume>13</volume>
          ,
          <year>2008</year>
          , vol.
          <volume>8</volume>
          , pp.
          <fpage>548</fpage>
          -
          <lpage>553</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <surname>Trevor</surname>
            <given-names>Cohen</given-names>
          </string-name>
          , Brett Blatter, and Vimla Patel,
          <article-title>Simulating expert clinical comprehension: Adapting latent semantic analysis to accurately extract clinical concepts from psychiatric narrative</article-title>
          ,
          <source>Journal of Biomedical Informatics</source>
          <volume>41</volume>
          , no.
          <issue>6</issue>
          (
          <year>2008</year>
          ):
          <fpage>1070</fpage>
          -
          <lpage>1087</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>