<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Proximity Full-Text Search with a Response Time Guarantee by Means of Additional Indexes with Multi- Component Keys</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Figure 1 Example of a word frequency distribution</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Proceedings of the XX International Conference “Data Analytics and Management in Data Intensive Domains” (DAMDID/RCDL'2018)</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Alexander B. Veretennikov Ural Federal University</institution>
          ,
          <addr-line>Yekaterinburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>123</fpage>
      <lpage>130</lpage>
      <abstract>
        <p>Full-text search engines are important tools for information retrieval. In a proximity full-text search, a document is relevant if it contains query terms near each other, especially if the query terms are frequently occurring words. For each word in the text, we use additional indexes to store information about nearby words at distances from the given word of less than or equal to MaxDistance, which is a parameter. We had shown that additional indexes with three-component keys can be used to improve the average query execution time up to 94.7 times if the queries consist of high-frequency used words. In this paper, we present a new search algorithm with even more performance gains. We also present results of search experiments, which show that three-component key indexes enable much faster searches in comparison with twocomponent key indexes.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>A search query consists of several words. The search
result is a list of documents containing these words.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], we discussed a methodology for
highperformance proximity full-text searches and a search
algorithm. With the application of additional indexes
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], we improved the average query processing time by
a factor of 94.7 for the case when the queries consist of
high-frequency used words.
      </p>
      <p>
        In this paper, we present the following new results.
1. We present a new search algorithm, with which we
can improve the performance even more than it is
improved in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
2. We present the results of search experiments that
prove that three-component key indexes can be used
to improve the average query execution time up to
12.93 times in comparison with two-component key
indexes for the case when the queries consist of
highfrequency used words.
      </p>
      <p>
        In modern full-text approaches, it is important for a
document to contain search query words near each other
in order to be relevant to the context of the query,
especially if the query contains frequently used words.
The impact of the term-proximity is integrated into
modern information retrieval models [
        <xref ref-type="bibr" rid="ref19 ref2 ref8 ref9">2, 8, 9, 19</xref>
        ].
      </p>
      <p>
        Words appear in texts at different frequencies. The
typical word frequency distribution is described by
Zipf’s law [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. An example of words’ occurrence
distribution is shown in Figure 1. The horizontal axis
represents different words in decreasing order of their
occurrence in texts. On the vertical axis, we plot the
number of occurrences of each word.
      </p>
      <p>
        The full-text search task can be solved with inverted
indexes [
        <xref ref-type="bibr" rid="ref10 ref21 ref6">6, 10, 21</xref>
        ]. With ordinary inverted indexes, for
each word in the indexed document, we store in the index
the record (ID, P), where ID is the identifier of the
document and P is the position of the word in the
document. Let P be an ordinal number of the word in the
document.
      </p>
      <p>For proximity full-text searches, we need to store the
(ID, P) record for all occurrences of any word in the
indexed document. These (ID, P) records are called
“postings”.
continuity of the user, the query results must be produced
within two seconds. In this context, we present the
following problem. It is common to have a full-text
search engine that can usually evaluate a query within 1
sec. of time. However, it works very slowly, for example,
requiring 10-30 sec., for a query that contains frequently
occurring words.</p>
      <p>We can illustrate this problem by the following
example. We downloaded pgdvd042010.iso from the
Project Gutenberg web page, which contains their files
as of April 2010, and indexed its content using Apache
Lucene 7.4.0 and Apache Tika 1.18. We indexed
approximately 64 thousand documents with a total length
of approximately 13 milliard characters (a relatively
small number). We indexed all words. Then, we
evaluated the following queries using the equipment
from section 4.1 of the current paper:</p>
      <p>"Prince Hamlet"~4 – this search took 172
milliseconds, and
"to be or not to be"~4 – this search took 21 seconds.</p>
      <p>The suffix “~4” instructs Lucene to search such texts
in which the queried words contain no more than 4 other
words between them.</p>
      <p>
        To improve the search performance,
earlytermination approaches can be applied [
        <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
        ]. However,
early-termination methods are not effective in the case of
proximity full-text searches [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. It is difficult to
combine the early-termination approach with the
integration of term-proximity information into relevance
models.
      </p>
      <p>
        Another approach is to create additional indexes. In
[
        <xref ref-type="bibr" rid="ref18 ref3">3, 18</xref>
        ], the authors introduced some additional indexes
to improve the search performance, but they only
improved phrase searches.
      </p>
      <p>With our additional indexes, an arbitrary query can
be evaluated very fast.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Lemmatization and lemma type</title>
      <sec id="sec-2-1">
        <title>2.1 Word type</title>
        <p>
          In [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], we defined three types of words.
        </p>
        <p>Stop words: Examples include “and”, “at”, “or”,
“not”, “yes”, “who”, “to”, and “be”. In a stop-words
approach, these words are excluded from consideration,
but we do not do so. In our approach, we include
information about all words in the indexes.</p>
        <p>
          We cannot exclude a word from the search because a
high-frequency occurring word can have a specific
meaning in the context of a specific query [
          <xref ref-type="bibr" rid="ref11 ref18">11, 18</xref>
          ];
therefore, excluding some words from consideration can
induce search quality degradation or unpredictable
effects [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
        </p>
        <p>Let us consider the query example “who are you
who”. The Who are an English rock band, and “Who are
You” is one of their songs. Therefore, the word “Who”
has a specific meaning in the context of this query.</p>
        <p>Frequently used words: These words are frequently
encountered but convey meaning. These words always
need to be included in the index.</p>
        <p>Ordinary words: This category contains all other
words.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.1 Lemmatization</title>
        <p>We employ a morphological analyzer for lemmatization.
For each word in the dictionary, the analyzer provides a
list of numbers of lemmas (i.e., basic or canonical forms).
For a word that does not exist in the dictionary, its lemma
is the same as the word itself.</p>
        <p>Some words have several lemmas. For example, the
word “mine” has two lemmas, namely, “mine” and “my”.</p>
        <p>We define three types of lemmas: stop lemmas,
frequently used lemmas and ordinary lemmas. We sort
all lemmas in decreasing order of their occurrence
frequency in the texts. We call this sorted list the FL-list.
The number of a lemma in the FL-list is called its
FLnumber. Let the FL-number of a lemma w be denoted by
FL(w).</p>
        <p>The first SWCount most frequently occurring lemmas
are stop lemmas. The second FUCount most frequently
occurring lemmas are frequently used lemmas. All other
lemmas are ordinary lemmas.</p>
        <p>SWCount and FUCount are the parameters. We use
SWCount = 700 and FUCount = 2100 in the experiments
presented.</p>
        <p>If an ordinary lemma q occurs in the text so rarely
that FL(q) is irrelevant, then we can say that FL(q) = ~.
We denote by “~” some large number.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.2 Index type</title>
        <p>We create indexes of different types for different types
of lemmas. Let MaxDistance be a parameter that can take
a value of 5, 7 or even greater.</p>
        <p>
          The expanded (f, s, t) index or three-component key
index [
          <xref ref-type="bibr" rid="ref11 ref16">11, 16</xref>
          ] is the list of occurrences of the lemma f for
which lemmas s and t both occur in the text at distances
less than or equal to MaxDistance from f.
        </p>
        <p>We create an expanded (f, s, t) index only for the case
in which f ≤ s ≤ t. Here, f, s, and t are all stop lemmas.</p>
        <p>Each posting includes the distance between f and s
in the text and the distance between f and t in the text.</p>
        <p>
          The expanded (w, v) index or two-component key
index [
          <xref ref-type="bibr" rid="ref13 ref14 ref15">13, 14, 15</xref>
          ] is the list of occurrences of the lemma
w for which lemma v occurs in the text at a distance less
than or equal to MaxDistance from w.
        </p>
        <p>The lemma types considered are as follows: for w,
frequently used, and for v, frequently used or ordinary.
Each posting includes the distance between w and v in
the text.</p>
        <p>
          Other types of additional indexes are described in
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 A new search algorithm</title>
      <sec id="sec-3-1">
        <title>3.1 The search algorithm general structure</title>
        <p>Our search algorithm is described in Figure 2.</p>
        <p>Let us consider the search query “who are you who”.
After lemmatization, we have the following query:
[who] [are, be] [you] [who]. The word “are” has two
lemmas in our dictionary.</p>
        <p>
          With FL-numbers:
[who: 293] [are: 268, be: 21] [you: 47] [who: 293].
To use three-component key indexes, this query must be
divided into two subqueries [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]:
Q1: [who: 293] [are: 268], [you: 47] [who: 293], and
Q2: [who: 293] [be: 21], [you: 47] [who: 293].
        </p>
        <p>We can say that lemma “who” &gt; “you” because
FL(who) = 293, FL(you) = 47, and 293 &gt; 47. We use the
FL-numbers to establish the order of the lemmas in the
set of all lemmas.</p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], we defined several query types depending on
the types of lemmas they contain and different search
algorithms for these query types. The query does not
need to be divided into a set of subqueries for all query
types.
        </p>
        <p>In this paper, we consider subqueries that consist only
of stop lemmas.</p>
        <p>After step 2, we evaluate the subqueries in the loop.</p>
        <p>After all subqueries are evaluated, their results need
to be combined into the final result set.
The algorithm of the subquery evaluation, when the
subquery consists of only stop lemmas, is described in
Figure 3.</p>
        <p>We need to select the three-component key indexes
required to evaluate the subquery.</p>
        <p>For all selected indexes, we need to create an iterator
object. The iterator object for the key (f, s, t) is used to
read the posting list of the (f, s, t) key from the start to
the end. The iterator object IT has the method IT.Next,
which reads the next record from the posting list.</p>
        <p>The iterator object IT has the property IT.Value that
contains the current record (ID, P, D1, D2).
Consequently, IT.Value.ID is the ID of the document
containing the key, and IT.Value.P is the position of the
key in the document.</p>
        <p>For two postings A = (A.ID, A.P, A.D1, A.D2) and B
= (B.ID, B.P, B.D1, B.D2), we define that A &lt; B when
one of the following conditions are met: A.ID &lt; B.ID or
(A.ID = B.ID and A.P &lt; B.P).</p>
        <p>The records (ID, P, D1, D2) are stored in the posting
list for the given key in increasing order.</p>
        <p>The goal of the Equalize procedure is to ensure that
all iterators have an equal value of Value.ID = DID.
After that, we can perform the search in the document</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3 Index selection</title>
        <p>To evaluate the subquery, we need to select keys for the
three-component key indexes.</p>
        <p>The query can be divided into a set of
threecomponent keys. Let the first three lemmas of the query
define the first key. Let the next three lemmas of the
query define the second key, and so on.</p>
        <p>For the cases when the length of the query is not an
exact multiple of 3, the last key is always defined by the
last three lemmas of the query.</p>
        <p>All selected keys must be normalized.</p>
        <p>For example, let us consider the subquery [who] [are]
[you] [who]. We can use keys (who, are, you) and (are*,
you*, who).</p>
        <p>For any three stop lemmas f, s and t, we have the (f,
s, t) index only for the case in which f ≤ s ≤ t. We call the
(f, s, t) key with the aforementioned condition the
normalized key.</p>
        <p>The normalized keys here are (you, are, who) and
(you*, are*, who).</p>
        <p>Let us consider the search query “Who are you and
why did you say what you did” and its subquery [who]
[are] [you] [and] [why] [do] [you] [say] [what] [you]
[do].</p>
        <p>In fact, we can find this query in Cecil Forester
Scott’s novel “Lord Hornblower”.</p>
        <p>We can use the (who, are, you), (and, why, do), (you,
say, what), and (what*, you, do) indexes.</p>
        <p>The normalized keys are (you, are, who), (and, do,
why), (you, what, say), and (you, what*, do).</p>
        <p>We mark “what” by “*” in the last key to denote that
this lemma has already been taken into account by a
previous key.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3 Search in the document</title>
        <p>The algorithm of searching in the document is described
in Figure 4.</p>
        <p>Let DID be an argument of the “Search in the
document” procedure. Let us define that DID is the
identifier of the current document.</p>
        <p>
          The main difference between the search algorithm
from [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and the new search algorithm is described here.
        </p>
        <p>For any lemma of the search query, we create an
intermediate list of postings in memory.</p>
        <p>For example, let us consider the three-component
index (you, are, who) and its iterator object.</p>
        <p>We create three intermediate posting lists: IL(you),
IL(are), and IL(who).</p>
        <p>To fill these intermediate posting lists, we need to
read postings from the (you, are, who) iterator object.</p>
        <p>A record from the (you, are, who) iterator object has
the format (ID, P, D1, D2), where ID is the identifier of
the document, P is the position of “you” in the document,
D1 is the distance from “are” to “you” in the text, and D2
is the distance from “who” to “you” in the text.</p>
        <p>If the lemma “are” occurs in the text after the lemma
“you”, then D1 &gt; 0; otherwise, D1 &lt; 0.</p>
        <p>If the lemma “who” occurs in the text after the lemma
“you”, then D2 &gt; 0; otherwise, D2 &lt; 0.</p>
        <p>We need to read from the iterator object all records
with ID = DID.</p>
        <p>We can produce three records from the (ID, P, D1,
D2) record.</p>
        <p>We need to store the (P) record in the IL(you)
intermediate posting list.</p>
        <p>We need to store the (P + D1) record in the IL(are)
intermediate posting list.</p>
        <p>We need to store the (P + D2) record in the IL(who)
intermediate posting list.</p>
        <p>Let us consider the key (you, what*, do) with a
lemma marked by “*”. In this case, we create only two
intermediate posting lists, namely, IL(you) and IL(do).
The (what*) component is already taken into account by
a previous key.</p>
        <p>Since for each lemma of the subquery, we have the
intermediate posting list, the search is straightforward
and similar to the search in the ordinary inverted file.</p>
        <p>We can also say that an intermediate posting list is a
kind of iterator object. The intermediate posting list
object IL has the method IL.Next, which reads the next
record from the posting list.</p>
        <p>The intermediate posting list object IL has the
property IL.Value that contains the current record (P),
where P is the position of its lemma in the document.</p>
        <p>Let IL.Value be equal to SIZE_MAX when all records
are read from the IL object, where SIZE_MAX is some
large number.</p>
        <p>In the loop, we perform the following steps.
1. Let MinIL be the intermediate posting list with the
minimum value of Value. Let S = MinIL.Value.
2. Let MaxIL be the intermediate posting list with the
maximum value of Value. Let E = MaxIL.Value.
3. If there are no more records in MinIL, then exit from
the search.
4. Execute MinIL.Next().
5. If MinIL.Value &gt; E, then execute Process(S, E).
6. Go to step 1.</p>
        <p>The Process(S, E) procedure adds the (DID, S, E)
record into the result set. S is the position of the start of
the fragment of text within the document that contains
the query. E is the position of the end of the fragment of
text within the document that contains the query.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4 Intermediate posting list data ordering</title>
        <p>The records (P) must be stored in an intermediate posting
list for the given lemma in increasing order.</p>
        <p>From this requirement, the following problem arises.
Consider the text “to be or not to be or”.</p>
        <p>Let the position of a word in the text be its ordinal
number starting with zero.</p>
        <p>When we create the three-component key index, the
following records must be stored for the key (to, be, or).
The records in the format (position of “to”, position of
“be”, position of “or”) are presented below.</p>
        <p>(to, be, or): (0, 1, 2), (0, 5, 6), (4, 1, 2), and (4, 5, 6).
From this posting list, we can create the following three
intermediate posting lists.
(to): 0, 0, 4, 4; (be): 1, 5, 1, 5; and (or): 2, 6, 2, 6.</p>
        <p>Only for the first component of the key is the
intermediate posting list ordered in increasing order.</p>
        <p>Please note that the postings in the three-component
key index will actually be encoded in the (ID, P, D1, D2)
format. For the (to, be, or) key, we will write the
following posting list.</p>
        <p>(to, be, or): (ID, 0, 1, 2), (ID, 0, 5, 6), (ID, 4, –3, –2),
(ID, 4, 1, 2).</p>
        <p>
          To solve the aforementioned problem, we create two
binary heaps [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. The first binary heap we create for the
second component of the key. The second binary heap
we create for the third component of the key.
        </p>
        <p>Therefore, we will create the (be) binary heap and the
(or) binary heap.</p>
        <p>We limit the binary heap length by MaxDistance × 2.</p>
        <p>When we need to read postings from the (to, be, or)
posting list, we do the following in a loop.
1. Read the next posting (ID, P, D1, D2) from the
posting list (to, be, or).
2. Write (P) into the (to) intermediate posting list.
3. Write (P + D1) into the (be) binary heap.
4. If the length of the (be) binary heap &gt; MaxDistance
× 2, then remove the first element (the minimum
element) from this binary heap and write it into the
(be) intermediate posting list.
5. Write (P + D2) into the (or) binary heap.
6. If the length of the (or) binary heap &gt; MaxDistance ×
2, then remove the first element (the minimum
element) from this binary heap and write it into the
(or) intermediate posting list.
7. Go to step 1.</p>
        <p>Let us consider a key (f, s, t) and its posting list L. We
create three intermediate posting lists and two binary
heaps to proceed as follows.
1. Intermediate posting list F for f.
2. Intermediate posting list S and binary heap SH for s.
3. Intermediate posting list T and binary heap TH for t.</p>
        <p>Let us introduce the methods PopMin and Length of
a binary heap object. The PopMin method returns the
minimum element from the binary heap and removes this
element from the binary heap. The Length method
returns the length of the binary heap.</p>
        <p>In Figure 5, we present the UML diagram of the
posting list L reading process.</p>
        <p>After all postings from L are read, we need to write
all elements from the binary heaps to their intermediate
posting lists.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5 Advantages of the new algorithm</title>
        <p>
          The new algorithm may require a smaller amount of
posting lists to evaluate a search query in comparison
with the algorithm from [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
Let Q be a subquery of m lemmas. Let n be the total
number of postings to read for the query evaluation.
        </p>
        <p>
          For each posting, we need to use it in the Equalize
procedure. In [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], the author states that the cost of such
usage is O(log(m)).
        </p>
        <p>For each posting, we need to add it to the three
intermediate posting lists. The cost of this process is
O(log(MaxDistance)) (see 3.4).</p>
        <p>For each posting, we need to use it when searching in
a document. The cost of this process is O(log(m)) (see
3.3).</p>
        <p>The final cost of the subquery evaluation is
O(n ∙ (log(m) + log(MaxDistance)) =</p>
        <p>O(n ∙ log (max(m, MaxDistance)).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4 Search experiments</title>
      <sec id="sec-4-1">
        <title>4.1 Search experiment environment</title>
        <p>
          All search experiments were conducted using a
collection of texts from [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. The total size of the text
collection is 71.5 GB. The text collection consists of
195 000 documents of plain text, fiction and magazine
articles. We use MaxDistance = 5, SWCount = 700, and
FUCount = 2100. The search experiments were
conducted using the experimental methodology from
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>
          We assume that in typical texts, words are distributed
similarly, in accordance with Zipf’s law [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. Therefore,
the results obtained with our text collection will be
relevant to other collections.
        </p>
        <p>We used the following computational resources:
CPU: Intel(R) Core(TM) i7 CPU 920 @ 2.67 GHz.
HDD: 7200 RPM. RAM: 24 GB.</p>
        <p>OS: Microsoft Windows 2008 R2 Enterprise.
We created the following indexes.</p>
        <p>
          Idx1: the ordinary inverted index without any
improvements, such as NSW records [
          <xref ref-type="bibr" rid="ref11 ref14">11, 14</xref>
          ]. The total
size is 95 GB.
        </p>
        <p>Idx2: our indexes, including the ordinary inverted
index with the NSW records and the (w, v) and (f, s, t)
indexes, where MaxDistance = 5. The total size is 746
GB.</p>
        <p>Please note that the total size of each type of index
includes the size of the repository (indexed texts in
compressed form), which is 47.2 GB.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2 Search results</title>
        <p>
          There are 975 queries, and all queries consisted only of
stop lemmas. The query set was selected as in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. All
searches were performed in a single program thread. We
searched all queries from the query set with different
types of indexes to estimate the performance gains of our
indexes.
        </p>
        <p>The query length was from 3 to 5 words.</p>
        <p>
          Studies by Spink et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] have shown that queries
with lengths greater than 5 are very rare. In [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], query
logs of a search system were analyzed, and it was
established that queries with a length of 6 represent
approximately 1% of all queries and that fewer than 4%
of all queries had more than 6 terms.
        </p>
        <p>We performed the following experiments.</p>
        <p>SE1: all queries are evaluated using the standard inverted
index Idx1.</p>
        <p>
          SE2.1: all queries are evaluated using Idx2 and the
algorithm from [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>SE2.2: all queries are evaluated using Idx2 and the new
algorithm presented in this paper.</p>
        <p>Average query times:
SE1: 31.27 sec., SE2.1: 0.33 sec., and SE2.2: 0.29 sec.</p>
        <p>Average data read sizes per query:
SE1: 745 MB, SE2.1: 8.45 MB, and SE2.2: 6.82 MB.</p>
        <p>Average numbers of postings per query:
SE1: 193 million, SE2.1: 765 thousand, and SE2.2: 559
thousand.</p>
        <p>We improved the query processing time by a factor
of 94.7 with the SE2.1 algorithm and by a factor of 107.8
with the SE2.2 algorithm (see Figure 6).</p>
        <p>The left-hand bar shows the average query execution
time with the standard inverted indexes. The subsequent
bars show the average query execution times with our
indexes with the SE2.1 and SE2.2 algorithms. Our bars
are much smaller than the left-hand bar because our
searches are very quick.</p>
        <p>We improved the data read size per query by a factor
of 88 with SE2.1 and by a factor of 109.2 with SE2.2 (see
Figure 7).</p>
        <p>Figure 7 Average data read sizes per query for SE1,
SE2.1, and SE2.2 (MB).</p>
        <p>The left-hand bar shows the average data read size per
query with SE1. The subsequent bars show the average
data read size per query with SE2.1 and SE2.2.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3 Comparison between three-component key indexes and two-component key indexes.</title>
        <p>We created another additional index especially for this
experiment.</p>
        <p>Idx3: two-component key indexes (w, v), where
MaxDistance = 5, SWCount = 0, and FUCount = 700.</p>
        <p>The total index size is 275 GB.</p>
        <p>In this case, for any two lemmas w and v, where w ≤
v, FL(w) &lt; 700, and FL(v) &lt; 700, we have a
twocomponent key index (w, v).</p>
        <p>Each posting in this index includes the distance
between w and v in the text.</p>
        <p>Such w and v lemmas are stop lemmas for Idx2.</p>
        <p>We performed the following experiment:
SE3: all 975 aforementioned queries are evaluated using
Idx3 and the new algorithm presented in this paper is
adapted for two-component key indexes.</p>
        <p>We processed in SE3 the same query set that we
already processed in SE2.1 and SE2.2 but used
twocomponent key indexes instead of three-component key
indexes.</p>
        <p>Average query times: SE3: 3.75 sec. (see Figure 8).
Average data read sizes per query: SE3: 105.17 MB.
Average number of postings per query:
SE3: 12 million 761 thousand.</p>
        <p>In this experiment, we compared SE2.1 and SE2.2
against SE3. We improved the query processing time by
a factor of 11.36 with the SE2.1 algorithm and by a factor
of 12.93 with the SE2.2 algorithm in comparison with the
two-component key index (SE3) case (see Figure 8).</p>
        <p>
          The left-hand bar shows the average query execution
time with the three-component key indexes using the
algorithm from [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. The center bar shows the average
query execution time with the three-component key
indexes using the new algorithm from this paper. The
right-hand bar shows the average query execution time
with the two-component key indexes.
        </p>
        <p>The bars that related to the three-component key
indexes are much smaller than the right-hand bar because
the three-component key indexes enable much quicker
searches than the two-component key indexes.</p>
        <p>This experiment shows that three-component key
indexes by an order of magnitude are more effective
than the two-component indexes when the queries that
consist of stop lemmas are evaluated.</p>
        <p>We improved the data read size per query by a factor
of 12.44 with SE2.1 and by a factor of 15.42 with SE2.2
in comparison with the two-component key index (SE3)
case (see Figure 9).</p>
        <p>The left-hand bar shows the average data read size per
query with SE2.1. The subsequent bars show the average
data read size per query with SE2.2 and SE3.</p>
        <p>We show the average query execution time for all
experiments in Figure 10.</p>
        <p>The left-hand bar shows the average query execution
time with the standard inverted indexes. The two
subsequent bars show the average query execution times
with the three-component key indexes with the SE2.1
and SE2.2 algorithms. The right-hand bar shows the
average query execution time with the two-component
key indexes in experiment SE3.</p>
        <p>The author decided not to use logarithmic scales in
the diagrams because users do not think in logarithmic
scales when evaluating the search speed of a search
system.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5 Conclusion and future work</title>
      <p>A query that contains high-frequency occurring words
induces performance problems. To solve these
performance problems and to satisfy the fastidious
demands of the users, we developed and elaborated
three-component key indexes.</p>
      <p>
        In this paper, we investigated searches with queries
that contain only stop lemmas. Other query types are
studied in [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ].
      </p>
      <p>
        As we discussed in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], three-component key
indexes are an important and integral part of our
comprehensive full-text search methodology, which
comprises three-component key index search methods
and other search methods from [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ].
      </p>
      <p>
        In this paper, we have introduced an optimized
algorithm for full-text searches in comparison with [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
These algorithms are novel, and no alternative
implementations exist.
      </p>
      <p>We have presented the results of experiments
showing that when queries contain only stop lemmas, the
average time of the query execution with our indexes is
107.8 times less (with the value of MaxDistance = 5) than
that required when using ordinary inverted indexes.</p>
      <p>We have presented the results of experiments
showing that when queries contain only stop lemmas, the
average time of the query execution with our indexes is
12.93 times less (with the value of MaxDistance = 5) than
that required when using two-component key indexes.</p>
      <p>Using the last experiment, we diligently prove that
three-component indexes are stupendous and cannot be
replaced by two-component key indexes. This is the
reason why we implemented three-component indexes to
solve the full-text search task.</p>
      <p>
        In the future, it will be interesting to investigate other
types of queries in more detail and to optimize index
creation algorithms for larger values of MaxDistance. It
will also be important to investigate how the proposed
indexing structure can be used by modern ranking
algorithms. The author assumes that based on Zipf’s law
[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], our test text collection is sufficient and acceptable
for evaluating the search performance. Nevertheless, to
investigate ranking algorithms’ behavior we plan to use
collections such as TREC GOV and GOV2, which are
intended to analyze the search quality.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V.N.</given-names>
            <surname>Anh</surname>
          </string-name>
          , O. de Kretser, A. Moffat:
          <article-title>Vector-Space Ranking with Effective Early Termination</article-title>
          .
          <source>In: SIGIR '01 Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , New Orleans, Louisiana, USA, September 9-
          <issue>12</issue>
          ,
          <year>2001</year>
          , pp.
          <fpage>35</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Buttcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          <article-title>Lushman: Term proximity scoring for ad-hoc retrieval on very large text collections</article-title>
          .
          <source>In: SIGIR'</source>
          <year>2006</year>
          , pp.
          <fpage>621</fpage>
          -
          <lpage>622</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.E.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          .
          <article-title>Efficient Phrase Querying with an Auxiliary Index</article-title>
          .
          <source>In: SIGIR '02 Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Tampere, Finland, August 11-15</source>
          ,
          <year>2002</year>
          , pp.
          <fpage>215</fpage>
          -
          <lpage>221</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.E.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Cannane: AccessOrdered Indexes</article-title>
          .
          <source>In: ACSC '04 Proceedings of the 27th Australasian Conference on Computer Science</source>
          , Dunedin, New Zealand,
          <source>January 18-22</source>
          ,
          <year>2004</year>
          , pp.
          <fpage>7</fpage>
          -
          <lpage>14</lpage>
          ,.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.J.</given-names>
            <surname>Jansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Spink</surname>
          </string-name>
          , T. Saracevic:
          <article-title>Real life, real users and real needs: A study and analysis of user queries on the Web</article-title>
          .
          <source>Information Processing and Management</source>
          ,
          <volume>36</volume>
          (
          <issue>2</issue>
          ):
          <fpage>207</fpage>
          -
          <lpage>227</lpage>
          (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R. W. P.</given-names>
            <surname>Luk</surname>
          </string-name>
          <article-title>: Scalable, statistical storage allocation for extensible inverted file construction</article-title>
          .
          <source>Journal of Systems and Software</source>
          ,
          <volume>84</volume>
          (
          <issue>7</issue>
          ):
          <fpage>1082</fpage>
          -
          <lpage>1088</lpage>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.B.</given-names>
            <surname>Miller</surname>
          </string-name>
          :
          <article-title>Response Time in Man-Computer Conversational Transactions</article-title>
          .
          <source>In Proceedings: AFIPS Fall Joint Computer Conference</source>
          . San Francisco, California, December
          <volume>09</volume>
          -
          <issue>11</issue>
          ,
          <year>1968</year>
          , vol
          <volume>33</volume>
          , pp.
          <fpage>267</fpage>
          -
          <lpage>277</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rasolofo</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>Savoy: Term Proximity Scoring for Keyword-Based Retrieval Systems</article-title>
          .
          <source>In: European Conference on Information Retrieval (ECIR) 2003: Advances in Information Retrieval</source>
          , pp.
          <fpage>207</fpage>
          -
          <lpage>218</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Schenkel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Broschart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Theobald</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Weikum: Efficient text proximity search</article-title>
          .
          <source>In: String processing and information retrieval. 14th International Symposium. SPIRE 2007. October 29-31</source>
          ,
          <year>2007</year>
          . Lecture notes in computer science. Santiago de Chile, Chile, Springer, Berlin, Heidelberg, vol.
          <volume>4726</volume>
          , pp.
          <fpage>287</fpage>
          -
          <lpage>299</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tomasic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shoens</surname>
          </string-name>
          .
          <article-title>Incremental updates of inverted lists for text document retrieval</article-title>
          .
          <source>SIGMOD '94 Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data. Minneapolis</source>
          , Minnesota, May
          <volume>24</volume>
          -27,
          <year>1994</year>
          , pp.
          <fpage>289</fpage>
          -
          <lpage>300</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>A.B. Veretennikov</surname>
          </string-name>
          :
          <article-title>Proximity full-text search with response time guarantee by means of three component keys</article-title>
          .
          <source>Bulletin of the South Ural State University. Series: Computational Mathematics and Software Engineering</source>
          .
          <volume>7</volume>
          (
          <issue>1</issue>
          ):
          <fpage>60</fpage>
          -
          <lpage>77</lpage>
          (
          <year>2018</year>
          )
          <article-title>(in Russian)</article-title>
          . doi: http://dx.doi.org/10.14529/cmse180105.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>A.B. Veretennikov</surname>
          </string-name>
          :
          <article-title>O poiske fraz i naborov slov v polnotekstovom indekse (About phrases search in full-text index)</article-title>
          .
          <source>Control systems and information technologies</source>
          ,
          <volume>48</volume>
          (
          <issue>2</issue>
          .1):
          <fpage>125</fpage>
          -
          <lpage>130</lpage>
          (
          <year>2012</year>
          )
          <article-title>(In Russian)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>A.B. Veretennikov</surname>
          </string-name>
          :
          <article-title>Ispol'zovanie dopolnitel'nykh indeksov dlya bolee bystrogo polnotekstovogo poiska fraz, vklyuchayushchikh chasto vstrechayushchiesya slova (Using additional indexes for fast full-text searching phrases that contains frequently used words)</article-title>
          .
          <source>Control Systems and Information Technologies</source>
          ,
          <volume>52</volume>
          (
          <issue>2</issue>
          ):
          <fpage>61</fpage>
          -
          <lpage>66</lpage>
          (
          <year>2013</year>
          )
          <article-title>(In Russian)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>A.B. Veretennikov</surname>
          </string-name>
          :
          <article-title>Effektivnyi polnotekstovyi poisk s ispol'zovaniem dopolnitel'nykh indeksov chasto vstrechayushchikhsya slov (Efficient fulltext search by means of additional indexes of frequently used words)</article-title>
          .
          <source>Control Systems and Information Technologies</source>
          ,
          <volume>66</volume>
          (
          <issue>4</issue>
          ):
          <fpage>52</fpage>
          -
          <lpage>60</lpage>
          (
          <year>2016</year>
          )
          <article-title>(In Russian)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>A.B. Veretennikov</surname>
          </string-name>
          :
          <article-title>Sozdanie dopolnitel'nykh indeksov dlya bolee bystrogo polnotekstovogo poiska fraz, vklyuchayushchikh chasto vstrechayushchiesya slova (Creating additional indexes for fast full-text searching phrases that contains frequently used words)</article-title>
          .
          <source>Control systems and information technologies</source>
          ,
          <volume>63</volume>
          (
          <issue>1</issue>
          ):
          <fpage>27</fpage>
          -
          <lpage>33</lpage>
          (
          <year>2016</year>
          )
          <article-title>(In Russian)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>A.B. Veretennikov</surname>
          </string-name>
          :
          <article-title>Effektivnyi polnotekstovyi poisk s uchetom blizosti slov pri pomoshchi trekhkomponentnykh klyuchei (Efficient full-text proximity search by means of three component keys)</article-title>
          .
          <source>Control systems and information technologies</source>
          ,
          <volume>69</volume>
          (
          <issue>3</issue>
          ):
          <fpage>25</fpage>
          -
          <lpage>32</lpage>
          (
          <year>2017</year>
          )
          <article-title>(In Russian)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.W.J.</given-names>
            <surname>Williams</surname>
          </string-name>
          : Algorithm 232 - Heapsort.
          <source>Communications of the ACM</source>
          .
          <volume>7</volume>
          (
          <issue>6</issue>
          ):
          <fpage>347</fpage>
          -
          <lpage>348</lpage>
          , (
          <year>1964</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.E.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahle</surname>
          </string-name>
          .
          <article-title>Fast Phrase Querying with Combined Indexes</article-title>
          .
          <source>ACM Transactions on Information Systems (TOIS)</source>
          ,
          <volume>22</volume>
          (
          <issue>4</issue>
          ):
          <fpage>573</fpage>
          -
          <lpage>594</lpage>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>H.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , T. Suel,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          <article-title>: Efficient Term Proximity Search with Term-Pair Indexes</article-title>
          .
          <source>In: CIKM '10 Proceedings of the 19th ACM International Conference on Information and Knowledge Management</source>
          , Toronto, ON, Canada,
          <source>October 26-30</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>1229</fpage>
          -
          <lpage>1238</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>G.</given-names>
            <surname>Zipf</surname>
          </string-name>
          <article-title>: Relative Frequency as a Determinant of Phonetic Change</article-title>
          .
          <source>Harvard Studies in Classical Philology</source>
          ,
          <volume>40</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>95</lpage>
          ,
          <year>1929</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Moffat: Inverted files for text search engines</article-title>
          .
          <source>ACM computing surveys</source>
          ,
          <year>2006</year>
          ,
          <volume>38</volume>
          (
          <issue>2</issue>
          ), Article 6.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>