<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Flexible Similarity Search of Semantic Vectors Using Fulltext Search Engines</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michal Růžička</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vít Novotný</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Petr Sojka</string-name>
          <email>sojka@fi.muni.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Pomikálek</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Radim Řehůřek</string-name>
          <email>radim@rare-technologies.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Informatics, Masaryk University Botanická 68a</institution>
          ,
          <addr-line>602 00 Brno, Czechia ORCID: 0000-0001-5547-8720, 0000-0002-3303-4130, 0000-0002-5768-4007</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>RaRe Technologies</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Vector representations and vector space modeling (VSM) play a central role in modern machine learning. In our recent research we proposed a novel approach to 'vector similarity searching' over dense semantic vector representations. This approach can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability, scalability and ubiquity. In this paper we validate our method using varied datasets ranging from text representations and embeddings (LSA, doc2vec, GloVe) to SIFT descriptors of image data. We show how our approach handles the indexing and querying in these domains, building a fast and scalable vector database with a tunable trade-of between vector search performance and quality, backed by a standard fulltext engine such as Elasticsearch.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The most important challenge in document similarity searches is the retrieval
of documents that are relevant to the query. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] Responding to this challenge
involves a full understanding and representation of the document’s semantics
and searching based on them. Decades of research and development have led
to mature high-dimensional vector space models [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] for the representation of
various data using distributional semantics and deep learning [
        <xref ref-type="bibr" rid="ref1 ref11 ref2">2,1,11</xref>
        ].
      </p>
      <p>
        In our recent paper [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], we introduced a novel method of encoding semantic
vectors to so-called ‘feature tokens’. These tokens can then be indexed and
searched using standard inverted-index-based information retrieval (IR) systems
such as Sphinx, Lucene, Elasticsearch, or Solr, all of which are freely available.
This allows us to exploit the years of development behind modern fulltext engines,
meeting the scalability and robustness demands of modern IR applications.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], we evaluated our method using Latent Semantic Analysis (LSA) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
on English Wikipedia, which comprises over 4 million articles. Cosine similarity
was used to rank and evaluate retrieval results. In this paper, we evaluate our
method on datasets with diferent representations using new similarity measures.
      </p>
      <p>
        Semantic Vector Encoding for Inverted-Index-Based
IR Systems
The standard representation of documents in the Vector Space Model (VSM) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]
uses term feature vectors of extremely high dimensionality. To map the feature
space onto a smaller and denser latent semantic subspace, one may use a body
of techniques, including Latent Semantic Analysis (LSA) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Latent Dirichlet
Allocation (LDA) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or the many variants of Locality-sensitive hashing (LSH) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
Dimensionality reduction techniques proposed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] allow a consistent speedup
while showing that not all features are equally discriminative and have a diferent
impact on eficiency, due to their density distribution.
      </p>
      <p>
        We decided to combine this knowledge with robust and optimized ready-to-use
IR systems such as Elasticsearch [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. To that end, we encode the vector features
of document vectors into string tokens – so called ‘feature tokens’ – which are
indexed by standard IR systems such as Elasticsearch.
      </p>
      <p>At query time, we extract feature tokens from the query document and use
them to quickly retrieve similar vectors using standard fulltext searches. This
small set of candidate vectors is then reranked by calculating a slow but exact
similarity metric (such as cosine similarity) between every candidate vector and
the query vector, which gives us an ordered list of matching vectors (documents).</p>
      <p>Feature tokens are constructed from the vector features by encoding the
feature position and the feature value at a selected precision, efectively making
some of the previously distinct feature values the same for the search engine.</p>
      <p>To further speed up the search process, we use two ‘high-pass filtering‘
techniques on query feature tokens: Trimming discards features whose values are
below a fixed threshold and best keeps only a fixed number of features with the
greatest absolute values.</p>
      <p>The intuition behind the encoding scheme is a trade-of between increasing
feature sparsity and retaining search quality. We show that some types of
sparsification only lead to a negligible loss in quality, allowing an eficient use of
inverted-index IR engines.</p>
      <p>
        A detailed description of our method is available in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. In the rest of
this paper, we will evaluate our method on datasets with diverse vector-space
representations and using various measures to rerank the candidate vectors. The
results show that our method works reasonably well in all the tested cases.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Tested Datasets and Experimental Setup</title>
      <p>
        To evaluate our method of document similarity search, we used ScaleText [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
based on Elasticsearch [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] as our IR system. Elasticsearch was used in its ‘vanilla’
setup without any advanced features of Elasticsearch, such as custom scoring,
tokenization, or -gram analyzers. Our method requires only basic text retrieval
functionality and using a fulltext engine other than Elasticsearch is quite
straightforward.
      </p>
      <p>In our experiments, we used the following datasets:
en-wiki The English Wikipedia dataset, which consists of 4,181,352 articles.</p>
      <p>
        We converted all the documents into vectors using (1) LSA [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] with 400
dimensions,3 and (2) doc2vec [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] with 400 dimensions.
wiki-2014+gigaword-5 A dataset of pre-trained word vectors using the GloVe
representation on the export of Wikipedia from early 2014, and on the English
Gigaword Fifth Edition4 data as provided by the GloVe project. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
The vectors were trained on a corpus of 6 billion tokens (uncased) with
400 thousand vocabulary entries. The dataset is provided in four variants
with 50, 100, 200, and 300 dimensions per vector.
common-crawl A dataset of pre-trained word vectors using the GloVe
representation on the Common Crawl project5 data as provided by the GloVe
project.
      </p>
      <p>The dataset was provided in two variants, each with a diferent training
corpus: (1) 42 billion tokens (uncased) with 1.9 million vocabulary entries,
and (2) 840 billion tokens (cased) with 2.2 million vocabulary entries. In both
cases, 300 dimensions per vector were used.
twitter A dataset of pre-trained word vectors using the GloVe representation on
the export of 2 billion short tweet messages from the Twitter social networkas
provided by the GloVe project.</p>
      <p>
        The vectors were trained on a corpus of 27 billion tokens (uncased) with
1.2 million vocabulary entries. The dataset is provided in four variants with
25, 50, 100, and 200 dimensions per vector.
texmex A dataset of vectors for the SIFT descriptors of image data [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] as
provided by the TEXMEX project6.
      </p>
      <p>We have used two datasets provided by the project: (1) ANN_SIFT10K dataset
consisting of 10 thousand vectors, and (2) ANN_SIFT1M dataset consisting of
1 million vectors. In both cases, 128 vector dimensions were used.</p>
      <p>Please note that, except for the en-wiki-lsa and en-wiki-doc2vec datasets,
we did not generate the semantic vectors ourselves. Rather, we used the
precomputed vectors as provided. The unmodified vectors were indexed using the
ScaleText IR system.
3.1</p>
      <p>Evaluation Method
The aim of our evaluation was to investigate how well our approximate nearest
neighbor search performs in comparison with an exact brute-force search on the
above datasets. We decided to use cosine similarity as the similarity metric for
the reranking of candidate vectors.</p>
      <p>
        From each dataset, we randomly selected 1,000 document vectors to act as
our query vectors. By doing an exact brute force search over all vectors in the
3 The same dataset was used in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
4 https://catalog.ldc.upenn.edu/LDC2011T07
5 https://commoncrawl.org/
6 http://corpus-texmex.irisa.fr/
dataset, we identified the 10 most similar document vectors for each query vector.
This became our ‘gold standard’.
      </p>
      <p>For each of the 1,000 query vectors, we performed a fulltext search. We
measured the relevance of the retrieved documents using three metrics: Precision
at 10 (Precision@10), which measures the overlap between the retrieved documents
and the gold standard, the Normalized Discounted Cumulative Gain at 10
(NDCG@10) [10, Section 8.4], where the relevance value of a retrieved document
vector was taken to be its cosine similarity to the query vector, and the mean
cumulative loss at 10 (Avg. dif.@10), which compares cosine similarities between
the top 10 retrieved document vectors and the query vector with cosine similarities
between the 10 most similar document vectors and the query vector.</p>
      <p>During the evaluation, we experimented with various configurations of the
high-pass filtering ( trimming and best parameters as described in Section 2),
and with the number of vectors retrieved from Elasticsearch (the page parameter)
to see their impact on the quality of results.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>In this section we present multiple graphs showing the impact of the number
of best features selected (with no trimming) and the page size (the number
of search results retrieved from Elasticsearch) on Precision@10, NDCG@10 and
Avg. dif.@10 for numerous datasets (see Section 3).
0.8
100.6
@
son
ii
rce0.4
P
0.2</p>
      <p>All of the graphs show that the documents retrieved with no high-pass filtering
of query features closely approximate the gold standard, whereas reducing the
number of features decreases the releavance of the retrieved documents by all
metrics.</p>
      <p>Comparing the results for the wiki-2014+gigaword-5-glove and twitter-glove
datasets, where multiple variants with diferent dimensionalities are available
(see Figures 3, 4, 5, 6, and Figures 9, 10, 11, 12), it seems that the dimensionality
of the semantic vectors should not be significantly lower than 200 to achieve the
best results. High-pass filtering using only 90 best features results in only a small
(a) Precision@10 (b) NDCG@10 (c) Avg. dif.@10
Fig. 2. The impact of the configurations for the en-wiki-doc2vec-400d dataset. The
markedly low Precision@10 compared to other datasets may indicate a suboptimal
setting of the doc2vec algorithm configuration parameters. Compare this to the results
for LSA in Figure 1.
(a) Precision@10 (b) NDCG@10 (c) Avg. dif.@10
Fig. 3. The impact of the configurations for the wiki-2014+gigaword-5-glove-300d
dataset. Except for the slightly lower Precision@10, our search method achieves
comparable results for the GloVe representation and for the LDA representation of a similar
dataset en-wiki-lsa-400d in Figure 1.
(a) Precision@10 (b) NDCG@10 (c) Avg. dif.@10
Fig. 4. The impact of the configurations for the wiki-2014+gigaword-5-glove-200d
dataset. Lowering the dimensionality of the dataset from 300 to 200 (cf. Figure 3) has
virtually no impact on the results.
0.00
0.25
0.20
0
i.ffv1A@00..1150
gd
0.05
0.00
# of best feat6ures used
17
40
90
al
100 200 300</p>
      <p>Page size 400 500 600
100 200 300</p>
      <p>Page size 400 500 600
(a) Precision@10 (b) NDCG@10 (c) Avg. dif.@10
Fig. 5. The impact of the configurations for the wiki-2014+gigaword-5-glove-100d
dataset. Lowering the dimensionality of the dataset from 200 to 100 (cf. Figure 4)
results in a small decrease in the relevance of the retrieved documents.
(a) Precision@10 (b) NDCG@10 (c) Avg. dif.@10
Fig. 6. The impact of the configurations for the wiki-2014+gigaword-5-glove-50d dataset.
Lowering the dimensionality of the dataset from 100 to 50 (cf. Figure 5) results in a
further decrease in the relevance of the retrieved documents. A comparison of the curves
reveals that when more dimensions are used (cf. Figures 3, 4, and 5), the documents
retrieved with a smaller page size are more afected by the decrease in the dimensionality
than the documents retrieved with a larger page size.
(a) Precision@10 (b) NDCG@10 (c) Avg. dif.@10
Fig. 7. The impact of the configurations for the common-crawl-glove-840B-300d dataset.
Using the GloVe representation in 300 dimensions on the Common Crawl project data
gives us results that are almost identical to the results for the GloVe representation of
the wiki-2014+gigaword-5-glove dataset with the same dimensionality (cf. Figure 3).
This showcases the stability of our search method.
0.00
0.25
100 200 300</p>
      <p>Page size 400 500 600
100 200 300</p>
      <p>Page size 400 500 600
(a) Precision@10 (b) NDCG@10 (c) Avg. dif.@10
Fig. 8. The impact of the configurations for the common-crawl-glove-42B-300d dataset.
Compared to the results for the common-crawl-glove-840B-300d dataset with 840 billion
cased tokens in Figure 7, a slightly lower Precision@10 is reached with 42 billion uncased
tokes used in this dataset.
(a) Precision@10 (b) NDCG@10 (c) Avg. dif.@10
Fig. 9. The impact of the configurations for the twitter-glove-200d dataset. Compared
to the results for the wiki-2014+gigaword-5-glove-200d dataset with the same
dimensionality in Figure 4, the relevance of the retrieved documents is slightly lower for this
dataset. The documents retrieved with a lower page size are more afected.
(a) Precision@10 (b) NDCG@10 (c) Avg. dif.@10
Fig. 10. The impact of the configurations for the twitter-glove-100d dataset. The impact
of the decrease in the dimensionality from 200 to 100 (cf. Figure 9) is comparable to
the same decrease for the wiki-2014+gigaword-5-glove dataset (cf. Figures 4 and 5).
100 200 300</p>
      <p>Page size 400 500 600
100 200 300</p>
      <p>Page size 400 500 600
(a) Precision@10 (b) NDCG@10 (c) Avg. dif.@10
Fig. 11. The impact of the configurations for the twitter-glove-50d dataset. Since the
vectors in the dataset consist of only 50 features, selecting 90 best features leads to
the same results as using all features. Selecting 40 best features then leads to only a
small decrease in the relevance of the retrieved documents.
100 200 300</p>
      <p>Page size 400 500 600
100 200 300</p>
      <p>Page size 400 500 600
(a) Precision@10 (b) NDCG@10 (c) Avg. dif.@10
Fig. 12. The impact of the configurations for the twitter-glove-25d dataset. Using only
25 features results in a slight decrease in Precision@10 compared to using 50 features
(cf. Figure 11).
(a) Precision@10 (b) NDCG@10 (c) Avg. dif.@10
Fig. 13. The impact of the configurations for the texmex-sift-1M-128d dataset.
Despite using SIFT descriptors of image data rather than vectors derived from text
documents, the curves are shaped similarly to the curves for the text datasets of
Wikipedia+Gigaword and Twitter with comparable dimensionality (cf. Figures 5 and 10).</p>
      <p>However, the relevance of the retrieved documents is significantly higher for this dataset.
decrease in the relevance of retrieved documents while speeding up the search.
This observation holds for datasets with a higher (≥ 200) original dimensionality
(see Figures 1, 2, 3, 7, or 8). Even with these datasets, there is no clear benefit
to using more than 90 best features for querying.</p>
      <p>The absolute quality of the results improves all the way up to and including
the maximum tested page size. This is expected, as increasing the number of
candidate vectors preselected by Elasticsearch increases the chance of retrieving
relevant documents. It is expected that a further increase in the page size would
lead to further improvements in the relevance of the retrieved documents at the
cost of performance.</p>
      <p>The absolute values of Precision@10 show a significant drop in the relevance of
the retrieved documents for datasets with low (≤ 100) dimensionalities. However,
NDCG@10 and Avg. dif.@10 are almost unafected. We expect that this is the
result of the higher mutual similarity of the low-dimensional vectors, which
makes it dificult to retrieve precisely the gold standard documents. Precision@10
penalizes the retrieval of all documents with the exception of those in the gold
standard, whereas NDCG@10 and Avg. dif@10 will rate any document vectors
similar to those in the gold standard as relevant.</p>
      <p>We have concluded that the unusually low Precision@10 for the 400-dimensional
en-wiki-doc2vec dataset 2) has a similar cause. Investigating the document
vectors in the dataset, we discovered that the feature values lack any ‘peaks’ that
determine the main topics of the documents in other datasets such as en-wiki-lsa
(see Figures 15a and 15b). This gives all the en-wiki-doc2vec document vectors a
strong mutual similarity. Tuning the doc2vec parameters to better diferentiate
between the documents could be expected solve this issue.</p>
      <p>
        In general, the results on the other datasets confirm the behaviour of our
method tested on the single en-wiki-lsa dataset in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], which is what we set out
to verify in this paper.
(a) doc2vec
Since the texmex dataset is provided together with sets of query vectors and the
corresponding ground truth for approximate nearest neighbor search evaluation,
we wanted to take advantage of these independent relevance judgements to further
validate our method. To do so, we needed to modify our evaluation method to
be compatible with the provided query vectors and the ground truth.
      </p>
      <p>The texmex vectors consist only of integers, so we divided the vectors by
the Euclidean norm of the largest vector in the dataset prior to indexing them
in ScaleText. This was to ensure that the ScaleText string representation of
vectors (feature tokens), which expects all feature values to have a fractional
part that can be rounded to a given precision, works properly. This precondition
is typically ensured by the normalization of the indexed vectors by ScaleText,
which we disabled here to retain Euclidean distances between the vectors. The
only exception is trimming, which still uses a normalized version of the vectors,
so that vectors close to the origin are not completely ‘trimmed away’ in this step.</p>
      <p>
        Instead of the cosine similarity (12/ (‖1‖‖2‖)), the candidate vectors are
reranked using the generalized Radial Basis Function (RBF) kernel with  = 1
(exp(−  ‖1 − 2‖2)), which is a form of Euclidean similarity [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The generalized
RBF kernel is also used for the computation of NDCG@10 and Avg. dif.@10
instead of cosine similarity. Therefore the absolute values of NDCG@10 and Avg.
dif.@10 for the cosine and Euclidean similarities cannot be directly compared.
Precision@10 does not use similarities and the results for the cosine and Euclidean
similarities can therefore be compared.
      </p>
      <p>Results are shown in Figures 16 and 17. The shapes and the relative positions
(in the case of NDCG@10 and Avg. Dif.@10) and even the absolute positions (in
the case of Precision@10) of the curves are very similar to those for the cosine
similarity (cf. Figures 13 and 14).</p>
      <p>For the large (1M) SIFT dataset, using Euclidean similarity leads to slighly
lower Precision@10 compared to cosine similarity. For the small (10k) SIFT
dataset, there is no clear diference.
# of best features used
al
90
40
17
6
100 200 300</p>
      <p>
        Page size 400 500 600
In this paper, we have demonstrated that our novel method of semantic vector
encoding for inverted-index IR systems proposed in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] works well not only on
text corpora using traditional latent semantic techniques such as LSA with cosine
similarity, but also on datasets from other domains, such as SIFT descriptors of
image data, and using diferent representations such as doc2vec and GloVe with
new similarity measures. Our method of vector encoding and filtering speeds up
the search process, achieving an excellent approximation of the gold standard
for all tested datasets, making our method suitable for diverse data sources and
application domains.
      </p>
      <p>Acknowledgments Funding by TA ČR Omega grant TD03000295 is gratefully
acknowledged.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laferty</surname>
          </string-name>
          , J.:
          <article-title>Latent Dirichlet Allocation</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>3</volume>
          ,
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          (
          <year>Mar 2003</year>
          ), http://dl.acm.org/ citation.cfm?id=
          <volume>944919</volume>
          .
          <fpage>944937</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Deerwester</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumais</surname>
            ,
            <given-names>S.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Furnas</surname>
            ,
            <given-names>G.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Landauer</surname>
            ,
            <given-names>T.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harshman</surname>
          </string-name>
          , R.:
          <article-title>Indexing by Latent Semantic Analysis</article-title>
          .
          <source>Journal of the American Society for Information Science</source>
          <volume>41</volume>
          (
          <issue>6</issue>
          ),
          <fpage>391</fpage>
          -
          <lpage>407</lpage>
          (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Digout</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nascimento</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Coman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Similarity search and dimensionality reduction: Not all dimensions are equally useful</article-title>
          . In: Lee,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Whang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.Y.</given-names>
            ,
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>D</surname>
          </string-name>
          . (eds.)
          <source>Proc. of Database Systems for Advanced Applications: 9th Int. Conf., DASFAA</source>
          <year>2004</year>
          ,
          <string-name>
            <given-names>Jeju</given-names>
            <surname>Island</surname>
          </string-name>
          , Korea, March
          <volume>17</volume>
          -19,
          <year>2003</year>
          . pp.
          <fpage>831</fpage>
          -
          <lpage>842</lpage>
          . Springer (
          <year>2004</year>
          ), http://dx.doi.org/10.1007/978-3-
          <fpage>540</fpage>
          -24571-1_
          <fpage>73</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gionis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Indyk</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motwani</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Similarity search in high dimensions via hashing</article-title>
          .
          <source>In: VLDB '99, Proceedings of 25th International Conference on Very Large Data Bases, September</source>
          <volume>7</volume>
          -
          <issue>10</issue>
          ,
          <year>1999</year>
          , Edinburgh, Scotland, UK. pp.
          <fpage>518</fpage>
          -
          <lpage>529</lpage>
          (
          <year>1999</year>
          ), http: //www.vldb.org/conf/1999/P49.pdf
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gormley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tong</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Elasticsearch: The Definitive Guide. O'Reilly Media</surname>
          </string-name>
          , Inc., 1st edn. (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Haasdonk</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bahlmann</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Learning with Distance Substitution Kernels</article-title>
          , pp.
          <fpage>220</fpage>
          -
          <lpage>227</lpage>
          . Springer Berlin Heidelberg, Berlin, Heidelberg (
          <year>2004</year>
          ), https://doi.org/ 10.1007/978-3-
          <fpage>540</fpage>
          -28649-3_
          <fpage>27</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Jegou</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Douze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Product quantization for nearest neighbor search</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>33</volume>
          (
          <issue>1</issue>
          ),
          <fpage>117</fpage>
          -
          <lpage>128</lpage>
          (
          <year>Jan 2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of sentences and documents</article-title>
          .
          <source>CoRR abs/1405</source>
          .4053 (
          <year>2014</year>
          ), http://arxiv.org/abs/1405.4053
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <article-title>Semantic matching in search</article-title>
          .
          <source>Foundations and Trends○ R in Information Retrieval</source>
          <volume>7</volume>
          (
          <issue>5</issue>
          ),
          <fpage>343</fpage>
          -
          <lpage>469</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schütze</surname>
          </string-name>
          , H.: Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          . pp.
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.: Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In: Empirical Methods in Natural Language Processing (EMNLP)</source>
          . pp.
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          (
          <year>2014</year>
          ), http://www.aclweb.org/anthology/D14-1162
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Rygl</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pomikálek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Řehůřek</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Růžička</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Novotný</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sojka</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Semantic vector encoding and similarity search using fulltext search engines (</article-title>
          <year>2017</year>
          ), https: //arxiv.org/abs/1706.00957, preprint of the paper accepted to the
          <source>ACL 2017 workshop RepL4NLP</source>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Rygl</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sojka</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Růžička</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Řehůřek</surname>
          </string-name>
          , R.:
          <article-title>ScaleText: The Design of a Scalable, Adaptable and User-Friendly Document System for Similarity Searches: Digging for Nuggets of Wisdom in Text</article-title>
          . In: Horák,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Rychlý</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Rambousek</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . (eds.)
          <source>Proceedings of the Tenth Workshop on Recent Advances in Slavonic Natural Language Processing</source>
          ,
          <string-name>
            <surname>RASLAN</surname>
          </string-name>
          <year>2016</year>
          . pp.
          <fpage>79</fpage>
          -
          <lpage>87</lpage>
          .
          <string-name>
            <surname>Tribun</surname>
            <given-names>EU</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brno</surname>
          </string-name>
          (
          <year>2016</year>
          ), https: //nlp.fi.muni.cz/raslan/2016/paper08-Rygl_Sojka_etal.pdf
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buckley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Term-weighting approaches in automatic text retrieval</article-title>
          .
          <source>Information Processing and Management</source>
          <volume>24</volume>
          ,
          <fpage>513</fpage>
          -
          <lpage>523</lpage>
          (
          <year>1988</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>C.S.:</given-names>
          </string-name>
          <article-title>A vector space model for automatic indexing</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>18</volume>
          (
          <issue>11</issue>
          ),
          <fpage>613</fpage>
          -
          <lpage>620</lpage>
          (
          <year>1975</year>
          ), https://doi.org/10.1145/ 361219.361220
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Tange</surname>
            ,
            <given-names>O.: GNU</given-names>
          </string-name>
          <string-name>
            <surname>Parallel - The Command-Line Power Tool</surname>
          </string-name>
          .
          <source>The USENIX Magazine</source>
          <volume>36</volume>
          (
          <issue>1</issue>
          ),
          <fpage>42</fpage>
          -
          <lpage>47</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>