<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Selection of Optimal Parameters in the Fast K-Word Proximity Search Based on Multi-component Key Indexes</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Ural Federal University</institution>
          ,
          <addr-line>620002 Mira street, Yekaterinburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>336</fpage>
      <lpage>350</lpage>
      <abstract>
        <p>Proximity full-text search is commonly implemented in contemporary full-text search systems. Let us assume that the search query is a list of words. It is natural to consider a document as relevant if the queried words are near each other in the document. The proximity factor is even more signi cant for the case where the query consists of frequently occurring words. Proximity full-text search requires the storage of information for every occurrence in documents of every word that the user can search. For every occurrence of every word in a document, we employ additional indexes to store information about nearby words, that is, the words that occur in the document at distances from the given word of less than or equal to the M axDistance parameter. We showed in previous works that these indexes can be used to improve the average query execution time by up to 130 times for queries that consist of words occurring with high-frequency. In this paper, we consider how both the search performance and the search quality depend on the value of M axDistance and other parameters. Well-known GOV2 text collection is used in the experiments for reproducibility of the results. We propose a new index schema after the analysis of the results of the experiments.</p>
      </abstract>
      <kwd-group>
        <kwd>Full-text search</kwd>
        <kwd>Inverted indexes</kwd>
        <kwd>Proximity search</kwd>
        <kwd>Term proximity</kwd>
        <kwd>Query processing</kwd>
        <kwd>DAAT</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In full-text search, a query is a list of words. The result of the search is a list
of documents containing these words. Consider a query that consists of words
occurring with high-frequency. In [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], we improved the average query processing
time by a factor of 130 for these queries relative to traditional inverted indexes.
A methodology for high-performance proximity full-text searches that covers
di erent types of queries was discussed in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        The factor of proximity or nearness between the queried words in the indexed
document plays an important role in modern information retrieval [
        <xref ref-type="bibr" rid="ref12 ref19 ref3">19, 12, 3</xref>
        ].
We assume that a document should contain query words near each other to be
relevant for the user in the context of the search query. Taking this factor into
account is essential if the query consists of frequently occurring words.
      </p>
      <p>
        Some words occur in texts signi cantly more frequently than others. We can
illustrate this [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] by referring to Zipf's law [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. An example of a typical word
occurrence distribution is presented in Fig. 1. The horizontal axis represents
di erent words in decreasing order of their occurrence in texts. On the vertical
axis, we plot the number of occurrences of each word. This peculiarity of language
has a strong in uence on the search performance.
      </p>
      <p>
        The inverted index [
        <xref ref-type="bibr" rid="ref2 ref21">21, 2</xref>
        ] is a commonly used data structure for full-text
searches. The traditional inverted index contains records (ID; P ), where ID is
the identi er of the document and P is the position of the word in the document,
for example, an ordinal number of the word in the document. This record
corresponds to an occurrence of a word in a document. These (ID; P ) records are
called \postings". The inverted index enables us to obtain a list of postings that
corresponds to the given word. These lists of postings are used for the search.
      </p>
      <p>
        For proximity full-text searches, we need to store the (ID; P ) record for every
occurrence of every word in the indexed document [
        <xref ref-type="bibr" rid="ref15 ref8 ref9">15, 8, 9</xref>
        ]. In other words, for
proximity searches, we require a word-level inverted index instead of a
documentlevel index [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Consequently, if a word occurs frequently in the texts, then its
list of postings is long [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The query search time is proportional to the number
of occurrences of the queried words in the indexed documents. To process a
search query that contains words occurring with high-frequency, a search system
requires much more time, as shown on the left side of Fig. 1, than a query that
contains only ordinary words, as shown on the right side of Fig. 1.
      </p>
      <p>
        According to [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], we can consider a full-text query as a \simple inquiry"
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. In this case, we may require that the search results be provided within two
seconds, as stated in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], to prevent the interruption of the thought continuity
of the user. To enhance the performance, the following approaches can be used.
      </p>
      <p>
        Early-termination approaches can be employed for full-text searches [
        <xref ref-type="bibr" rid="ref1 ref11">1, 11</xref>
        ].
However, these methods are not e ective in the case of proximity full-text
searches [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Usually, early-termination approaches are applied to
documentlevel indexes. It is di cult to combine the early-termination approach with the
incorporation of term-proximity data into relevance models.
      </p>
      <p>
        Additional indexes can improve the search performance. In [
        <xref ref-type="bibr" rid="ref14 ref18">14, 18</xref>
        ],
additional indexes were used to improve phrase searches. However, the approaches
reported in [
        <xref ref-type="bibr" rid="ref14 ref18">14, 18</xref>
        ] cannot be used for proximity full-text searches. Their area
of application is limited by phrase searches. We have overcome this limitation.
      </p>
      <p>
        With our additional indexes, an arbitrary query can be executed very fast
and quickly [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        In an example given in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], we indexed a subset of the Project Gutenberg
web site using Apache Lucene and Apache Tika and performed some searches. A
query that contained words occurring with high-frequency was evaluated within
21 sec. On the other hand, another query that contained ordinary words was
evaluated within 172 milliseconds. This di erence is considerable. However, our
algorithms reported in [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ] can help address this issue.
      </p>
      <p>The goals and questions of this paper are as follows:
1) We need to examine how search performance is improved with consideration
of di erent values of M axDistance and other parameters.
2) We need to investigate search performance on the commonly used text
collection.
3) Does the search performance depend on the document size?
4) What are other factors that can a ect the performance?
5) We evaluate the performance with respect to both short and long queries.</p>
      <p>
        Key points of our research are the following. We use the word-level index.
We use the DAAT approach [
        <xref ref-type="bibr" rid="ref10 ref6">6, 10</xref>
        ]. We include in the indexes information about
all lemmas. Our indexes support incremental updates. We can store one posting
list in several data streams. We read entire posting lists when searching and no
early-termination are used.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Lemmatization and Lemma Type</title>
      <p>For this paper, we use an English dictionary with approximately 92 thousand
English lemmas. This dictionary is used by a morphology analyzer. The analyzer
produces a list of numbers of lemmas, that is, basic or canonical forms, for every
word from the dictionary. Usually, a word has one lemma, but some words have
several lemmas. For example, the word \mine" has two lemmas, namely, \mine"
and \my".</p>
      <p>
        Consider an array of all lemmas. Let us sort all lemmas in decreasing order
of their occurrence frequency in the texts. We call the result of sorting the F
Llist [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The number of a lemma w in the F L-list is called its F L-number [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
and is denoted by F L(w). Let us say that lemma \earth" &gt; \day" because
F L(earth) = 309, F L(day) = 199, and 309 &gt; 199. We use the F L-numbers to
de ne the order of the lemmas in the collection of all lemmas.
      </p>
      <p>
        In our search methodology [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], we de ned three types of lemmas, namely,
stop lemmas, frequently used lemmas and ordinary lemmas.
      </p>
      <p>The rst SW Count most frequently occurring lemmas are stop lemmas.
Examples include \earth", \yes", \who", \day", \war", \time", \man" and \be".</p>
      <p>The second F U Count most frequently occurring lemmas are frequently used
lemmas. Examples include \red", \beautiful", and \mountain".</p>
      <p>All other lemmas are ordinary lemmas, e.g., \ ber" and \undersea".</p>
      <p>SW Count and F U Count are the parameters. We use SW Count = 500 and
F U Count = 1050 in the experiments described here. Let M axDistance be a
parameter that can take a value of 5, 7, 9 or even greater.</p>
      <p>
        The value of SW Count is very near 421 from [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. However, as we include
information about all lemmas in the indexes, we can use very di erent values of
the parameters. If an ordinary lemma, q, occurs in the text so rarely that F L(q)
is irrelevant, then we can say that F L(q) = . Here, \ " denotes a large number.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Can Stop Words be Skipped?</title>
      <p>
        Let us discuss the stop-word approach, in which some words occurring with
highfrequency or their lemmas are excluded from the search. We do not agree with
this approach. A word cannot be excluded from the search because even a word
occurring with high-frequency can have a speci c meaning in the context of a
speci c query [
        <xref ref-type="bibr" rid="ref16 ref18">16, 18</xref>
        ]. Therefore, excluding some words from the search can lead
to search quality degradation or unpredictable e ects [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Additionally, stop
words are often employed in higher-order term proximity feature models [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Consider the query \who are you who" [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The Who are an English rock
band, and \Who are You" is one of their works. The word \Who" has a speci c
meaning in the context of this query. Therefore, in our approach, we include
information about all of the words in the indexes. Moreover, we can easily see
that modern search systems, such as Google, do not skip stop words in a search.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>The Types of Additional Indexes</title>
      <p>
        The three-component key (f; s; t) index [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] is the list of the occurrences of the
lemma f for which lemmas s and t both occur in the text at distances that
are less than or equal to the M axDistance from f . Each posting includes the
distance between f and s in the text and the distance between f and t in the
text. An (f; s; t) index is created only for the case in which f s t. Here, f ,
s, and t are all stop lemmas.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], we considered queries consisting of high-frequency occurring words. In
addition, we showed that the average query execution time can be improved with
three-component key indexes by up to 15.6 times relative to the time necessary
using two-component key indexes only. Therefore, (f; s; t) indexes are required
if we need to search queries that contain high-frequently occurring words.
      </p>
      <p>
        The two-component key (w; v) index [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] is the list of occurrences of the
lemma w for which lemma v occurs in the text at a distance that is less than or
equal to the M axDistance from w. Each posting includes the distance between
w and v in the text. Here, w denotes a frequently used lemma, and v denotes a
frequently used or ordinary lemma.
      </p>
      <p>
        Let us consider the traditional index with near stop word (NSW) records. For
each occurrence of each ordinary or frequently used lemma in each document, we
include a posting record (ID; P; NSW) in the index. ID can be the ordinal
number of the speci c document, and P is the position of the word in the document,
e.g., the ordinal number of the word in the document. The NSW record contains
information about all high-frequency lemmas, that is, stop lemmas, occurring
near position P in the document (at a distance that is less than or equal to the
M axDistance from P ). Examples of the indexes are given in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        The posting list of a key is stored in several data streams [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. For the
traditional index with NSW records, we can use up to three streams for one key: one
stream for (ID), one for (P ) and one for (NSW). On the other hand, for this
index, we can use two streams: the rst is (ID; P ), and the second is (NSW).
The actual choice depends on the length of the posting list. For short lists, we
use two streams; for long lists, we use three streams. This architecture allows
us to skip NSW records when they are not required. For the (w; v) and (f; s; t)
indexes, we use one or two data streams for every key.
5
5.1
      </p>
    </sec>
    <sec id="sec-5">
      <title>Experiment</title>
      <sec id="sec-5-1">
        <title>Indexes, Collection and Environment</title>
        <sec id="sec-5-1-1">
          <title>We create the following indexes:</title>
          <p>Idx0: the traditional inverted index without any enhancements, such as NSW
records. The total size is 143 GB. This value includes the total size of indexed
texts in compressed form, which is 57.3 GB.</p>
          <p>Idx5: our indexes, including the traditional inverted index with the NSW
records and the (w; v) and (f; s; t) indexes, where M axDistance = 5. The total
size is 1.29 TB, the total size of the (w; v) index is 104 GB, the total size of the
(f; s; t) index is 727 GB, and the total size of the traditional index with NSW
records is 192 GB.</p>
          <p>Idx7: our indexes, where M axDistance = 7. The total size is 2.16 TB, the
total size of the (w; v) index is 148 GB, the total size of the (f; s; t) index is 1.422
TB, and the total size of the traditional index with NSW records is 239 GB.</p>
          <p>Idx9: our indexes, where M axDistance = 9. The total size is 3.27 TB, the
total size of the (w; v) index is 191 GB, the total size of the (f; s; t) index is 2.349
TB, and the total size of the traditional index with NSW records is 283 GB.</p>
          <p>
            For the experiment, GOV2 [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] text collection and the following queries are
used: title queries from TREC Robust Task 2004 (with 250 queries in total),
title queries from TREC Terabyte Task from 2004 to 2006 (with 150 queries in
total), title queries from TREC Web Task from 2009 to 2014 (with 300 queries
in total), queries from TREC 2007 Million Query Track (10000 queries in total).
          </p>
          <p>The total size of the query set after duplicate removal is 10 665 queries. GOV2
text collection contains 25 million documents. The total size of the collection is
approximately 426 GB, and after HTML tag removal, there is approximately
167 GB of plain text. The average document text size is approximately 7 KB.</p>
          <p>We used the following computational resources: CPU: Intel(R) Core(TM) i7
CPU 920 @ 2.67 GHz. HDD: 7200 RPM. RAM: 24 GB. OS: Microsoft Windows
2008 R2 Enterprise.</p>
          <p>
            The query set can be divided into the following subsets depending on lemmas
in a concrete query [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ]. All queries are evaluated within one program thread.
          </p>
          <p>If we have a query with a length greater than M axDistance, then we should
divide it into several parts. For example, when the value of M axDistance is 5,
then the query \to be or not to be that is the question" should be rewritten
as \(to be or not to) (be that is the question)", and these two parts should be
evaluated independently; then, the results should be combined.
5.2</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Q1. Only Stop Lemmas</title>
        <p>Every query in the subset contains only stop lemmas. There are 119 queries
in this subset. Examples include the following: to be or not to be that is the
question, kids earth day activities. With F L-numbers, we have the following
queries: [to: 9] [be: 7] [or: 38] [not: 64] [to: 9] [be: 7] [that: 40] [be: 7] [the: 1]
[question: 305] and [kid: 447] [earth: 309] [day: 199] [activity: 247].</p>
        <p>
          For these queries, the (f; s; t) indexes are used [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
        <p>Average query processing times:
Idx0: 51.4 s, Idx5: 0.82 s, Idx7: 0.86 s, Idx9: 1.05 s (see Fig. 2).
Average data read sizes per query:
Idx0: 1.3 GB, Idx5: 11.1 MB, Idx7: 15.6 MB, Idx9: 20.1 MB.</p>
        <p>Average numbers of postings per query:
Idx0: 317.8 million, Idx5: 0.88 million, Idx7: 1.15 million, Idx9: 1.5 million.</p>
        <p>We improved the query time by a factor of 62.7 with Idx5, by a factor of
59.4 with Idx7, and by a factor of 48.7 with Idx9 in comparison with Idx0.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Q2. Stop and Frequently Used and/or Ordinary Lemmas</title>
        <p>Every query in the subset contains one or several stop lemmas. The query also
contains some other lemmas that may be frequently used or ordinary. There are
7 244 queries in the subset. Examples include the following: History of Physicians
in America. With F L-numbers, we have the following query:
[history: 598] [of: 4] [physician: 1760] [in: 14] [America: 1391]</p>
        <p>
          For these queries, we need to read one posting list with NSW records from
the traditional index for some query lemma. This lemma is designated as the
\main" lemma of the query. If there is a frequently used lemma in the query,
then we can use two-component key indexes, like (physician, history), for other
lemmas. If the query consists of only stop and ordinary lemmas, then we use
posting lists from the traditional index for the remaining ordinary lemmas but
without reading the NSW records. The details are described in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>Average query times: Idx0: 50 s, Idx5: 1.9 s, Idx7: 2 s, Idx9: 2.14 s.
Average data read sizes per query:
Idx0: 1.3 GB, Idx5: 64.3 MB, Idx7: 68.7 MB, Idx9: 76 MB.</p>
        <p>Average numbers of postings per query:
Idx0: 324.7 million, Idx5: 3.7 million, Idx7: 3.4 million, Idx9: 3.3 million.</p>
        <p>The number of postings for Idx7 is less than that for Idx5 because fewer
queries were divided into parts. Additionally, the sizes of the NSW records are
di erent for Idx5, Idx7 and Idx9. We improved the query processing time by a
factor of 25.6 with Idx5, by a factor of 24.7 with Idx7, and by a factor of 23.3
with Idx9 in comparison with Idx0.
5.4</p>
      </sec>
      <sec id="sec-5-4">
        <title>Q3. Only Frequently Used Lemmas</title>
        <p>Every query in the subset contains only frequently used lemmas. There are 79
queries in this subset. Examples include the following: california mountain pass.</p>
        <p>With F L-numbers, we have the following query:
[california: 518] [mountain: 704] [pass: 528]</p>
        <p>Two-component (w; v) key indexes are used. For example, we can use the
(california, pass) and (*pass, mountain) two-component key indexes.</p>
        <p>Average query times: Idx0: 2.5 s, Idx5: 0.32 s, Idx7: 0.33 s, Idx9: 0.34 s.
Average data read sizes per query:
Idx0: 53.3 MB, Idx5: 4.69 MB, Idx7: 4.79 MB, Idx9: 4.93 MB.
Average numbers of postings per query:
Idx0: 9.4 million, Idx5: 0.59 million, Idx7: 0.6 million, Idx9: 0.61 million.</p>
        <p>We improved the query time by a factor of 7.83 with Idx5, by a factor of
7.58 with Idx7, and by a factor of 7.44 with Idx9 in comparison with Idx0.
5.5</p>
      </sec>
      <sec id="sec-5-5">
        <title>Q4. Frequently Used Lemmas and Ordinary Lemmas</title>
        <p>Every query in the subset contains frequently used lemmas and ordinary lemmas.
There are 1 388 queries in this subset.</p>
        <p>Examples include the following: Scalable Vector Graphics.</p>
        <p>With F L-numbers, we have the following query:
[scalable: ] [vector: 2953] [graphics: 921]</p>
        <p>Each query contains a frequently used lemma; therefore, two-component key
indexes can be used. For example, we can use the (graphics, scalable) and
(graphics, vector) two-component key indexes.</p>
        <p>Average query times: Idx0: 2.2 s, Idx5: 0.36 s, Idx7: 0.34 s, Idx9: 0.32 s.
Average data read sizes per query:
Idx0: 41.7 MB, Idx5: 1.5 MB, Idx7: 1.6 MB, Idx9: 1.8 MB.</p>
        <p>Average numbers of postings per query:
Idx0: 8.1 million, Idx5: 0.17 million, Idx7: 0.17 million, Idx9: 0.17 million.</p>
        <p>We improved the query time by a factor of 6 with Idx5, by a factor of 6.4
with Idx7, and by a factor of 6.8 with Idx9 in comparison with Idx0.
5.6</p>
      </sec>
      <sec id="sec-5-6">
        <title>Q5. Only Ordinary Lemmas</title>
        <p>Every query in the subset contains only ordinary lemmas. There are 1 835 queries
in this subset. Examples include the following: Undersea Fiber Optic Cable.</p>
        <p>With F L-numbers, we have the following query:
[undersea: 15873] [ ber: 3127] [optic: 2986] [cable: 2771]</p>
        <p>We use the traditional index only for these queries. We do not need to read
the NSW records because they are stored in separated streams of data.</p>
        <p>Average query times: Idx0: 0.789 s, Idx5: 0.819 s, Idx7: 0.8 s, Idx9: 0.81 s.
Average data read sizes per query:
Idx0: 14.2 MB, Idx5: 15.6 MB, Idx7: 15.7 MB, Idx9: 15.8 MB.
Average numbers of postings per query:
Idx0: 2.89 million, Idx5: 2.89 million, Idx7: 2.89 million, Idx9: 2.89 million.</p>
        <p>The query processing time for these queries does not need any improvement.
5.7</p>
      </sec>
      <sec id="sec-5-7">
        <title>Results for Entire Query Set</title>
        <p>Average query times: Idx0: 34.9 s, Idx5: 1.51 s, Idx7: 1.57 s, Idx9: 1.66 s.</p>
        <p>Average data read sizes per query:
Idx0: 0.93 GB, Idx5: 46.4 MB, Idx7: 49.5 MB, Idx9: 54.5 MB.
Average numbers of postings per query:
Idx0: 225.7 million, Idx5: 3.08 million, Idx7: 2.85 million, Idx9: 2.76 million.</p>
        <p>We improved the query time by a factor of 23.1 with Idx5, by a factor of
22.4 with Idx7, and by a factor of 21 with Idx9 in comparison with Idx0.
5.8</p>
      </sec>
      <sec id="sec-5-8">
        <title>Analysis of the Results</title>
        <p>The average time that is needed for the search for Idx5, Idx7, and Idx9, is
approximately 1-2 sec. for every query type.</p>
        <p>For Idx0, we need 2-2.5 sec. for the search if the query consists only of
ordinary or frequently used lemmas. For the queries that contain any stop lemma
(Q1, Q2), Idx0 requires approximately 50 sec. for the search on average.</p>
        <p>This means that the search in Idx5, Idx7, and Idx9 is stable from the
performance point of view. However, for Idx0, the search system has a performance
problem if the query contains any high-frequency occurring lemma. The di
erence in performance between Idx0 and multi-component key indexes Idx5, Idx7
and Idx9 is larger in cases when more high-frequency occurring lemmas occur
in the queries.</p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], we improved the average query execution time by up to 130 times for
queries that consist of words occurring with high-frequency. For TREC GOV2,
we improved the average query execution time by up to 62.7 times for these
queries (i.e., the query set Q1). Perhaps the di erence in the \improvement
factor" is dependent on the average document text size, which is approximately
7 KB for TREG GOV2 and 384 KB for the text collection that was used in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
The other factor is the length of the query. However, an improvement of 62.7
times in the average query execution time also seems good.
        </p>
        <p>Let us analyze this in more detail. Let us consider the following example. We
have two documents, D0 and D1. Let us consider a word w. For w, we have the
following posting list in the traditional index: (0; 1); (0; 5); (0; 7); (1; 2); (1; 5):</p>
        <p>We use two very common encoding schemes. The idea of the rst scheme
as follows. We group the records that are related to a speci c document. We
convert the original posting list as follows: (0; (1; 5; 7)); (1; (2; 5)): For other kinds
of indexes, we have the same.</p>
        <p>Therefore, we store in the index the ID of the document and then the list of
word's positions. The second scheme is a delta-encoding scheme. Consequently,
we have the following. (0; (1; 4; 2)); (1; (2; 3)): For example, consider (1; 5; 7).
Instead of 5, we store 4 = 5 1, where 1 is the previous value in the list.</p>
        <p>The rst scheme is much more e ective for text collections that consist of
large documents. This explains the di erence in \improvement factor" in the
experiments with two aforementioned collections. However, with our method, we
can use any other encoding scheme and any other inverted index organization.</p>
        <p>
          In addition, the \improvement factor" can depend on the structure of the
query set. To analyze this question, we performed additional experiment. We
formed another query set, using the method from [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Consequently, we
selected a document from the TREC GOV2 collection. We used the content of
the document to produce a set of 3500 queries. We performed our experiments
again, using this query set. The \improvement factor" was similar to the
foregoing results that we have for TREC GOV2.
        </p>
        <p>
          The next question is, how can we select the values of our parameters? The
value of M axDistance a ects relevance and should be determined according
to the selected relevance function [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. The results of experiments allows us to
predict how the change of the M axDistance a ects the search time and index
size. Let us discuss now, how the values of SW Count and F U Count a ect these
metrics.
        </p>
        <p>
          The value of SW Count is more important than the value of F U Count,
because it a ects Q1 and Q2 queries, which are most complex from the performance
point of view. Let us consider SW Count now. Let us consider Q2 query subset.
We use Zipf's law [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] as a representation of our word occurrence distribution.
According to this, the second most frequently occurring lemma, occurs half as
often as the rst. The third most frequently occurring lemma, occurs 1/3 as
often as the rst, and so on. Let V be the number of occurrences of the most
frequently occurring lemma. Therefore, our lemmas have the following number
of occurrences: V; V =2; V =3; :::
        </p>
        <p>Let us consider a query q = (q1; q2; :::; qn) from Q2. Here, qi is the number
of corresponding lemma in the F L-list. The query contains one or several stop
lemmas and some other lemmas that may be frequently used or ordinary. To
evaluate the query using traditional index, we need to read the following number
n
of postings from the index: P V =qi. Let, without loss of generality, q1 and q2
i=1
be stop lemmas and qn is the main lemma of the query. With the use of NSW
n
records, we need to read the following number of postings. P V =qi + (V =qn)
i=3
(N SW F actor 1). That means, we do not need to read the postings list for q1
and q2. However, for qn we need to read its posting list with N SW records.</p>
        <p>The size of the posting (ID; P; N SW ) in bytes is up to N SW F actor = 4:5
times large than the size of (ID; P ). The experiments with M axDistance = 5
show this. Therefore, we can calculate the "planned performance gain" as follows:
n n
P P G(q) = P 1=qi = P 1=qi + (1=qn) (N SW F actor 1) :
i=1 i=3</p>
        <p>If we have a query set Q, then we can calculate the average planned
performance gain, AP P G(Q) = 1 P P P G(q). Let now use only Q2 queries
jQj q2Q
for estimations. If a query q is not a Q2 query, then let P P G(q) be 1. Let
AP P G(Q; SW Count) be the average planned performance gain that is
calculated for a speci c value of SW Count.</p>
        <p>For the query set that we use in this paper, i.e., 10 665 queries, we have the
following: AP P G(Q; 100) = 43, AP P G(Q; 500) = 105, AP P G(Q; 1000) = 153:
However, this model do not take into account (w; v) and (f; s; t) indexes. In
future, we plan to develop more precise models. However, from this model, we
can predict that SW Count should not be increased.</p>
        <p>The foregoing results need a more detailed examination. Let us consider a
search query. The query consists of some set of lemmas. Let M in-F L-number be
the minimum F L-number among all lemmas of the query. A lower F L-number
corresponds to a more frequently occurring lemma. If the M in-F L-number of
a query is a small number, then the query can induce performance problems
because the query contains some high-frequency occurring lemma.</p>
        <p>Then, we divide the entire query set into subsets based on the M in-F
Lnumber of queries. We select 100 as the division step. In the rst subset, we
include all queries with M in-F L-numbers from 0 to 99; in the second subset,
we include all queries with M in-F L-numbers from 100 to 199; and so on. In the
following diagrams, we consider the rst 21 subsets.</p>
        <p>In Fig. 3, the average query execution time for Idx0 in seconds is displayed
for every subset. The rst bar is signi cantly larger than the other bars, showing
that the rst subset induces some performance problems. The rst 8 subsets
have an average query execution time of more than two seconds.</p>
        <p>In Fig. 4, the average query execution time for Idx5 in seconds is displayed
for every subset. For the rst subset, the average query execution time is about
two seconds. For each following subset, the average query execution time is less
than two seconds.</p>
        <p>In Fig. 5, we show the improvement factor for Idx5 in comparison with Idx0.
The rst bar value is 33, which means that the average query execution time for
the rst subset is improved by a factor of 33. We also see how the performance
is changed when the M in-F L-number of a query crosses the SW Count of 500.</p>
        <p>By this diagram, we can propose that the value of SW Count can be lowered
to 100. Consequently, we created another index Idx5=SW 100, with SW Count =
100 and F U Count = 1450.</p>
        <p>In Fig. 6, the average query execution time for Idx5/SW100 in seconds is
displayed for every subset. For each subset, the average query execution time is
less than two seconds. The results look more promising than those for Idx5.</p>
        <p>However, let us consider Q1 queries when the value of SW Count is 500, which
are the queries that consist only of lemmas with F L-number &lt; 500. There are 119
queries in this subset, as discussed above. It was established that for this subset,
the average query search time with Idx5 is 0.8 sec. but with Idx5=SW 100, it is
6.7 sec., which is signi cantly larger. When this subset is evaluated with Idx5,
three-component key indexes are used.</p>
        <p>Let us divide query set Q1 into ve subsets based on the M in-F L-number
value of the concrete query. In Fig. 7, the average query execution time for Idx0,
Idx5 and Idx5/SW100 in seconds is displayed for every subset. We see that the
search in Idx5/SW100 works signi cantly more slowly than that in Idx5.</p>
        <p>Therefore, we come to the following conclusions for di erent kinds of queries.
Consequently, we propose the new index schema.</p>
        <p>T1 A query consists of extreme high-frequency occurring lemmas and
highfrequency occurring lemmas. This means that all lemmas of the query are
extreme high-frequency occurring or high-frequently occurring. E.g., for
every lemma of the query, F L-number &lt; 500.</p>
        <p>The (f; s; t) indexes work better for these queries than do other types of
indexes, as shown by Fig. 7. The search in Idx5=SW 100 is signi cantly slower
than that in Idx5. When the search is performed using Idx5, the (f; s; t)
indexes are used for every query in T1. When the search is performed using
Idx5=SW 100, the (f; s; t) indexes are used only if F L(f ) &lt; 100, F L(s) &lt;
100, and F L(t) &lt; 100.</p>
        <p>T2 A query contains a low-frequency occurring lemma (ordinary or frequently
used). Additionally, the query may contain high-frequency occurring lemmas,
but for every query lemma, F L-number 100.</p>
        <p>The (w; v) indexes allow one to achieve a better performance improvement,
as shown in Fig. 6. For subsets starting from F L-number = 100, no (f; s; t)
indexes or NSW records are used when we do the search using Idx5=SW 100.</p>
        <p>Here, Idx5=SW 100 is better than Idx5.</p>
        <p>T3 A query contains a low-frequency occurring lemma (ordinary or frequently
used) and some extreme high-frequency occurring lemmas with F L-number
&lt; 100.</p>
        <p>
          The (w; v) indexes and NSW records are required. The rst bar in Figures
3, 4 and 6 supports this, and our previous experiments in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] con rm it.
5.9
        </p>
      </sec>
      <sec id="sec-5-9">
        <title>The New Index Schema</title>
        <sec id="sec-5-9-1">
          <title>The original schema can be represented by the following rules:</title>
          <p>1) (f; s; t) indexes, where F L(f ); F L(s); F L(t) &lt; SW Count.
2) (w; v) indexes, where</p>
          <p>SW Count F L(w) &lt; SW Count + F U Count; and SW Count F L(v):
3) Traditional indexes (x) with NSW records, SW Count F L(x); the NSW
records contain information about all lemmas y with the condition F L(y) &lt;
SW Count that occur near lemma x in the text.</p>
          <p>For the new schema, we use the following parameters with example values.
EHF Count = 100 { for extreme high-frequency occurring lemmas.
HF Count = 400 { for high-frequency occurring lemmas.</p>
          <p>F U Count = 1050 { for frequently used lemmas.</p>
          <p>We propose using the following indexes.</p>
        </sec>
        <sec id="sec-5-9-2">
          <title>1) (f; s; t) indexes that can be used for T1 queries, where</title>
          <p>F L(f ); F L(s); F L(t) &lt; EHF Count + HF Count = 500;
2) (w; v) indexes that can be used for T2 and T3 queries, where
100 = EHF Count F L(w) &lt; EHF Count + HF Count + F U Count = 1450;
100 = EHF Count F L(v):
3) Traditional indexes (x) with NSW records, 100 = EHF Count F L(x);
the NSW records contain information about all lemmas y with the condition
F L(y) &lt; EHF Count = 100 that occur near lemma x in the text. These
indexes can be used for T3 queries.</p>
          <p>The concrete values of the parameters are provided only for example and can
be di erent for di erent languages and text collections.
6</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and Future Work</title>
      <p>In this paper, we investigated how multi-component key indexes help to improve
search performance. We used well-known GOV2 text collection. We proposed
a method of analyzing the search performance by considering di erent types
of queries. By following this method, we found that the performance can be
improved further and proposed a new index schema.</p>
      <p>We analyzed how the value of M axDistance a ects the search performance.
With an increase in the value of M axDistance from 5 to 9, the average search
time using multi-component key indexes was increased from 1.51 sec. to 1.66
sec. Therefore, the value of M axDistance can be increased even further, and
the main limitations here are the disk space and the time of indexing. Our
multi-component indexes are relatively large for large values of M axDistance.
However, large hard disk drives are now available. In many cases, it would be
preferable to spend several TB of disk space but to increase the search speed by
a factor of 20 times or more.</p>
      <p>We found that multi-component key indexes work signi cantly better on text
collections with large documents (e.g., documents with sizes of approximately
several hundred KB or more) than on text collection that consists of small
documents; thus, an improvement factor of 20 can be considered as a minimum
improvement factor. In the future, it is important to consider how di erent
compression technologies can reduce the index total size, which will increase the
search speed even more.</p>
      <p>The proposed indexes with multi-component keys have one limitation. If we
have a document that contains queried words and the distance between these
words is greater than M axDistance, then this document can be absent in the
search results. This is usually not a problem if the average document size in
the text collection is relatively large, e.g., several hundreds of kilobytes. In this
case, after the proximity search with multi-component key indexes, we can run
a search without distance. When the former requires the word-level index, the
latter needs only the document-level index and works signi cantly faster.</p>
      <p>
        Second, in the majority of modern relevance models, it is de ned that the
weight of the document is inversely proportional to the square of the distance
between queried words in the document [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. With a relatively large value of
M axDistance, we can be sure that all relevant documents will occur in the search
results. When the rst consideration for GOV2 collection is under question,
because the documents are small, the second is still valid.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Anh</surname>
            ,
            <given-names>V.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de Kretser</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mo</surname>
            <given-names>at</given-names>
          </string-name>
          , A.:
          <article-title>Vector-space ranking with e ective early termination</article-title>
          .
          <source>In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          . pp.
          <volume>35</volume>
          {
          <fpage>42</fpage>
          . SIGIR '01,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2001</year>
          ). https://doi.org/10.1145/383952.383957
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Borodin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mirvoda</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Porshnev</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ponomareva</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Improving generalized inverted index lock wait times</article-title>
          .
          <source>Journal of Physics: Conference Series</source>
          <volume>944</volume>
          (
          <issue>1</issue>
          ),
          <volume>012022</volume>
          (jan
          <year>2018</year>
          ). https://doi.org/10.1088/
          <fpage>1742</fpage>
          -6596/944/1/012022
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Broschart</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schenkel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>High-performance processing of text queries with tunable pruned term and term pair indexes</article-title>
          .
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>30</volume>
          (
          <issue>1</issue>
          ) (
          <year>Mar 2012</year>
          ). https://doi.org/10.1145/2094072.2094077
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Buttcher,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Soboro</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          :
          <article-title>The trec 2006 terabyte track</article-title>
          .
          <source>In: Proceedings of the Fifteenth Text REtrieval Conference</source>
          ,
          <string-name>
            <surname>TREC</surname>
          </string-name>
          <year>2006</year>
          . pp.
          <volume>128</volume>
          {
          <issue>141</issue>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Crane</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Culpepper</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mackenzie</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trotman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A comparison of document-at-a-time and score-at-a-time query evaluation</article-title>
          .
          <source>In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining</source>
          . pp.
          <volume>201</volume>
          {
          <fpage>210</fpage>
          . WSDM 17,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>2017</year>
          ). https://doi.org/10.1145/3018661.3018726
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Daoud</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          , Silva de Moura, E.,
          <string-name>
            <surname>Carvalho</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Soares da Silva,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Fernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            :
            <surname>Fast</surname>
          </string-name>
          top
          <article-title>-k preserving query processing using two-tier indexes</article-title>
          .
          <source>Inf. Process. Manage</source>
          .
          <volume>52</volume>
          (
          <issue>5</issue>
          ),
          <volume>855</volume>
          {872 (Sep
          <year>2016</year>
          ). https://doi.org/10.1016/j.ipm.
          <year>2016</year>
          .
          <volume>03</volume>
          .005
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Fox</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A stop list for general text</article-title>
          .
          <source>SIGIR Forum</source>
          <volume>24</volume>
          (
          <issue>1-2</issue>
          ),
          <volume>19</volume>
          {21 (Sep
          <year>1989</year>
          ). https://doi.org/10.1145/378881.378888
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Gall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brost</surname>
          </string-name>
          , G.:
          <article-title>K-word proximity search on encrypted data</article-title>
          .
          <source>In: 30th International Conference on Advanced Information Networking and Applications Workshops (WAINA)</source>
          . pp.
          <volume>365</volume>
          {
          <fpage>372</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2016</year>
          ). https://doi.org/10.1109/WAINA.
          <year>2016</year>
          .104
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Imran Ra que, M.,
          <string-name>
            <surname>Hassan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Utilizing distinct terms for proximity and phrases in the document for better information retrieval</article-title>
          .
          <source>In: 2014 International Conference on Emerging Technologies (ICET)</source>
          . pp.
          <volume>100</volume>
          {
          <fpage>105</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2014</year>
          ). https://doi.org/10.1 109/ICET.
          <year>2014</year>
          .7021024
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leung</surname>
            ,
            <given-names>K.W.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Teii: Topic enhanced inverted index for top-k document retrieval</article-title>
          .
          <source>Know.-Based Syst. 89(C)</source>
          ,
          <volume>346</volume>
          {358 (Nov
          <year>2015</year>
          ). https://doi.org/10.1016/j.knosys.
          <year>2015</year>
          .
          <volume>07</volume>
          .014
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trotman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Anytime ranking for impact-ordered indexes</article-title>
          .
          <source>In: Proceedings of the 2015 International Conference on The Theory of Information Retrieval</source>
          . pp.
          <volume>301</volume>
          {
          <fpage>304</fpage>
          . ICTIR 15,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>2015</year>
          ). https://doi.org/10.1145/2808194.2809477
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mo</surname>
            <given-names>at</given-names>
          </string-name>
          , A.,
          <string-name>
            <surname>Culpepper</surname>
            ,
            <given-names>J.S.:</given-names>
          </string-name>
          <article-title>E cient and e ective higher order proximity modeling</article-title>
          .
          <source>In: Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval</source>
          . pp.
          <volume>21</volume>
          {
          <fpage>30</fpage>
          . ICTIR '16,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2016</year>
          ). https://doi.org/10.1145/2970398.2970404
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>R.B.</given-names>
          </string-name>
          :
          <article-title>Response time in man-computer conversational transactions</article-title>
          .
          <source>In: Proceedings of the December 9-11</source>
          ,
          <year>1968</year>
          , Fall Joint Computer Conference, Part I. pp.
          <volume>267</volume>
          {
          <fpage>277</fpage>
          .
          <string-name>
            <surname>AFIPS</surname>
          </string-name>
          <article-title>68 (Fall, part I), ACM</article-title>
          , USA (
          <year>1968</year>
          ). https://doi.org/10.1145/1476589.1476628
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Petri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mo</surname>
            <given-names>at</given-names>
          </string-name>
          , A.:
          <article-title>On the cost of phrase-based ranking</article-title>
          .
          <source>In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          . p.
          <fpage>931934</fpage>
          . SIGIR 15,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>2015</year>
          ). https://doi.org/10.1145/2766462.2767769
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Sadakane</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Imai</surname>
          </string-name>
          , H.:
          <article-title>Fast algorithms for k-word proximity search</article-title>
          .
          <source>In: IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences</source>
          . vol.
          <volume>84</volume>
          , pp.
          <volume>2311</volume>
          {
          <issue>2318</issue>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Veretennikov</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          :
          <article-title>Proximity full-text search by means of additional indexes with multi-component keys: In pursuit of optimal performance</article-title>
          . In: Manolopoulos Y.,
          <string-name>
            <surname>Stupnikov</surname>
            <given-names>S</given-names>
          </string-name>
          . (eds)
          <article-title>Data Analytics and Management in Data Intensive Domains</article-title>
          . DAMDID/RCDL 2018.
          <article-title>Communications in Computer and Information Science</article-title>
          . vol.
          <volume>1003</volume>
          , pp.
          <volume>111</volume>
          {
          <fpage>130</fpage>
          . Springer. https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -23584-0 7
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Veretennikov</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          :
          <article-title>Proximity full-text search with a response time guarantee by means of additional indexes</article-title>
          . In:
          <string-name>
            <surname>Arai</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kapoor</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhatia</surname>
            <given-names>R</given-names>
          </string-name>
          . (eds)
          <article-title>Intelligent Systems and Applications</article-title>
          .
          <source>IntelliSys 2018. Advances in Intelligent Systems and Computing</source>
          . vol.
          <volume>868</volume>
          , pp.
          <volume>936</volume>
          {
          <fpage>954</fpage>
          . Springer, Cham (
          <year>2019</year>
          ). https://doi.org/10.1007/978- 3-
          <fpage>030</fpage>
          -01054-6 66
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>H.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zobel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bahle</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Fast phrase querying with combined indexes</article-title>
          .
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>22</volume>
          (
          <issue>4</issue>
          ),
          <volume>573</volume>
          {594 (Oct
          <year>2004</year>
          ). https://doi.org/10.1145/1028099.1028102
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suel</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wen</surname>
            ,
            <given-names>J.R.:</given-names>
          </string-name>
          <article-title>E cient term proximity search with term-pair indexes</article-title>
          .
          <source>In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management</source>
          . pp.
          <volume>1229</volume>
          {
          <fpage>1238</fpage>
          . CIKM '10,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2010</year>
          ). https://doi.org/10.1145/1871437.1871593
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Zipf</surname>
            ,
            <given-names>G.K.</given-names>
          </string-name>
          :
          <article-title>Selected Studies of the Principle of Relative Frequency in Language</article-title>
          . Harvard University Press (
          <year>1932</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Zobel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Mo at, A.:
          <article-title>Inverted les for text search engines</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>38</volume>
          (
          <issue>2</issue>
          ) (
          <year>Jul 2006</year>
          ). https://doi.org/10.1145/1132956.1132959
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>