=Paper=
{{Paper
|id=Vol-2790/paper30
|storemode=property
|title=
Selection of Optimal Parameters in the Fast K-Word Proximity Search Based on Multi-component Key Indexes
|pdfUrl=https://ceur-ws.org/Vol-2790/paper30.pdf
|volume=Vol-2790
|authors=Alexander B. Veretennikov
|dblpUrl=https://dblp.org/rec/conf/rcdl/Veretennikov20
}}
==
Selection of Optimal Parameters in the Fast K-Word Proximity Search Based on Multi-component Key Indexes
==
Selection of Optimal Parameters in the Fast K-Word Proximity Search Based on Multi-component Key Indexes Alexander B. Veretennikov1[0000−0002−3399−1889] Ural Federal University, 620002 Mira street, Yekaterinburg, Russia alexander@veretennikov.ru Abstract. Proximity full-text search is commonly implemented in con- temporary full-text search systems. Let us assume that the search query is a list of words. It is natural to consider a document as relevant if the queried words are near each other in the document. The proximity factor is even more significant for the case where the query consists of frequently occurring words. Proximity full-text search requires the stor- age of information for every occurrence in documents of every word that the user can search. For every occurrence of every word in a document, we employ additional indexes to store information about nearby words, that is, the words that occur in the document at distances from the given word of less than or equal to the M axDistance parameter. We showed in previous works that these indexes can be used to improve the aver- age query execution time by up to 130 times for queries that consist of words occurring with high-frequency. In this paper, we consider how both the search performance and the search quality depend on the value of M axDistance and other parameters. Well-known GOV2 text collection is used in the experiments for reproducibility of the results. We propose a new index schema after the analysis of the results of the experiments. Keywords: Full-text search · Inverted indexes · Proximity search · Term proximity · Query processing · DAAT. 1 Introduction In full-text search, a query is a list of words. The result of the search is a list of documents containing these words. Consider a query that consists of words occurring with high-frequency. In [16], we improved the average query processing time by a factor of 130 for these queries relative to traditional inverted indexes. A methodology for high-performance proximity full-text searches that covers different types of queries was discussed in [17]. The factor of proximity or nearness between the queried words in the indexed document plays an important role in modern information retrieval [19, 12, 3]. We assume that a document should contain query words near each other to be relevant for the user in the context of the search query. Taking this factor into account is essential if the query consists of frequently occurring words. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 336 Some words occur in texts significantly more frequently than others. We can illustrate this [17] by referring to Zipf’s law [20]. An example of a typical word occurrence distribution is presented in Fig. 1. The horizontal axis represents different words in decreasing order of their occurrence in texts. On the vertical axis, we plot the number of occurrences of each word. This peculiarity of language has a strong influence on the search performance. The inverted index [21, 2] is a commonly used data structure for full-text searches. The traditional inverted index contains records (ID, P ), where ID is the identifier of the document and P is the position of the word in the document, for example, an ordinal number of the word in the document. This record cor- responds to an occurrence of a word in a document. These (ID, P ) records are called “postings”. The inverted index enables us to obtain a list of postings that corresponds to the given word. These lists of postings are used for the search. For proximity full-text searches, we need to store the (ID, P ) record for every occurrence of every word in the indexed document [15, 8, 9]. In other words, for proximity searches, we require a word-level inverted index instead of a document- level index [8]. Consequently, if a word occurs frequently in the texts, then its list of postings is long [17]. The query search time is proportional to the number of occurrences of the queried words in the indexed documents. To process a search query that contains words occurring with high-frequency, a search system requires much more time, as shown on the left side of Fig. 1, than a query that contains only ordinary words, as shown on the right side of Fig. 1. Fig. 1. Example of a word frequency distribution. According to [13], we can consider a full-text query as a “simple inquiry” [17]. In this case, we may require that the search results be provided within two seconds, as stated in [13], to prevent the interruption of the thought continuity of the user. To enhance the performance, the following approaches can be used. Early-termination approaches can be employed for full-text searches [1, 11]. However, these methods are not effective in the case of proximity full-text searches [16]. Usually, early-termination approaches are applied to document- level indexes. It is difficult to combine the early-termination approach with the incorporation of term-proximity data into relevance models. Additional indexes can improve the search performance. In [14, 18], addi- tional indexes were used to improve phrase searches. However, the approaches reported in [14, 18] cannot be used for proximity full-text searches. Their area of application is limited by phrase searches. We have overcome this limitation. 337 With our additional indexes, an arbitrary query can be executed very fast and quickly [17]. In an example given in [16], we indexed a subset of the Project Gutenberg web site using Apache Lucene and Apache Tika and performed some searches. A query that contained words occurring with high-frequency was evaluated within 21 sec. On the other hand, another query that contained ordinary words was evaluated within 172 milliseconds. This difference is considerable. However, our algorithms reported in [16, 17] can help address this issue. The goals and questions of this paper are as follows: 1) We need to examine how search performance is improved with consideration of different values of M axDistance and other parameters. 2) We need to investigate search performance on the commonly used text col- lection. 3) Does the search performance depend on the document size? 4) What are other factors that can affect the performance? 5) We evaluate the performance with respect to both short and long queries. Key points of our research are the following. We use the word-level index. We use the DAAT approach [6, 10]. We include in the indexes information about all lemmas. Our indexes support incremental updates. We can store one posting list in several data streams. We read entire posting lists when searching and no early-termination are used. 2 Lemmatization and Lemma Type For this paper, we use an English dictionary with approximately 92 thousand English lemmas. This dictionary is used by a morphology analyzer. The analyzer produces a list of numbers of lemmas, that is, basic or canonical forms, for every word from the dictionary. Usually, a word has one lemma, but some words have several lemmas. For example, the word “mine” has two lemmas, namely, “mine” and “my”. Consider an array of all lemmas. Let us sort all lemmas in decreasing order of their occurrence frequency in the texts. We call the result of sorting the F L- list [17]. The number of a lemma w in the F L-list is called its F L-number [17] and is denoted by F L(w). Let us say that lemma “earth” > “day” because F L(earth) = 309, F L(day) = 199, and 309 > 199. We use the F L-numbers to define the order of the lemmas in the collection of all lemmas. In our search methodology [16], we defined three types of lemmas, namely, stop lemmas, frequently used lemmas and ordinary lemmas. The first SW Count most frequently occurring lemmas are stop lemmas. Ex- amples include “earth”, “yes”, “who”, “day”, “war”, “time”, “man” and “be”. The second F U Count most frequently occurring lemmas are frequently used lemmas. Examples include “red”, “beautiful”, and “mountain”. All other lemmas are ordinary lemmas, e.g., “fiber” and “undersea”. 338 SW Count and F U Count are the parameters. We use SW Count = 500 and F U Count = 1050 in the experiments described here. Let M axDistance be a parameter that can take a value of 5, 7, 9 or even greater. The value of SW Count is very near 421 from [7]. However, as we include information about all lemmas in the indexes, we can use very different values of the parameters. If an ordinary lemma, q, occurs in the text so rarely that F L(q) is irrelevant, then we can say that F L(q) =∼. Here, “∼” denotes a large number. 3 Can Stop Words be Skipped? Let us discuss the stop-word approach, in which some words occurring with high- frequency or their lemmas are excluded from the search. We do not agree with this approach. A word cannot be excluded from the search because even a word occurring with high-frequency can have a specific meaning in the context of a specific query [16, 18]. Therefore, excluding some words from the search can lead to search quality degradation or unpredictable effects [18]. Additionally, stop words are often employed in higher-order term proximity feature models [5]. Consider the query “who are you who” [17]. The Who are an English rock band, and “Who are You” is one of their works. The word “Who” has a specific meaning in the context of this query. Therefore, in our approach, we include information about all of the words in the indexes. Moreover, we can easily see that modern search systems, such as Google, do not skip stop words in a search. 4 The Types of Additional Indexes The three-component key (f, s, t) index [17] is the list of the occurrences of the lemma f for which lemmas s and t both occur in the text at distances that are less than or equal to the M axDistance from f . Each posting includes the distance between f and s in the text and the distance between f and t in the text. An (f, s, t) index is created only for the case in which f ≤ s ≤ t. Here, f , s, and t are all stop lemmas. In [16], we considered queries consisting of high-frequency occurring words. In addition, we showed that the average query execution time can be improved with three-component key indexes by up to 15.6 times relative to the time necessary using two-component key indexes only. Therefore, (f, s, t) indexes are required if we need to search queries that contain high-frequently occurring words. The two-component key (w, v) index [17] is the list of occurrences of the lemma w for which lemma v occurs in the text at a distance that is less than or equal to the M axDistance from w. Each posting includes the distance between w and v in the text. Here, w denotes a frequently used lemma, and v denotes a frequently used or ordinary lemma. Let us consider the traditional index with near stop word (NSW) records. For each occurrence of each ordinary or frequently used lemma in each document, we include a posting record (ID, P, NSW) in the index. ID can be the ordinal num- ber of the specific document, and P is the position of the word in the document, 339 e.g., the ordinal number of the word in the document. The NSW record contains information about all high-frequency lemmas, that is, stop lemmas, occurring near position P in the document (at a distance that is less than or equal to the M axDistance from P ). Examples of the indexes are given in [17]. The posting list of a key is stored in several data streams [17]. For the tradi- tional index with NSW records, we can use up to three streams for one key: one stream for (ID), one for (P ) and one for (NSW). On the other hand, for this index, we can use two streams: the first is (ID, P ), and the second is (NSW). The actual choice depends on the length of the posting list. For short lists, we use two streams; for long lists, we use three streams. This architecture allows us to skip NSW records when they are not required. For the (w, v) and (f, s, t) indexes, we use one or two data streams for every key. 5 Experiment 5.1 Indexes, Collection and Environment We create the following indexes: Idx0: the traditional inverted index without any enhancements, such as NSW records. The total size is 143 GB. This value includes the total size of indexed texts in compressed form, which is 57.3 GB. Idx5: our indexes, including the traditional inverted index with the NSW records and the (w, v) and (f, s, t) indexes, where M axDistance = 5. The total size is 1.29 TB, the total size of the (w, v) index is 104 GB, the total size of the (f, s, t) index is 727 GB, and the total size of the traditional index with NSW records is 192 GB. Idx7: our indexes, where M axDistance = 7. The total size is 2.16 TB, the total size of the (w, v) index is 148 GB, the total size of the (f, s, t) index is 1.422 TB, and the total size of the traditional index with NSW records is 239 GB. Idx9: our indexes, where M axDistance = 9. The total size is 3.27 TB, the total size of the (w, v) index is 191 GB, the total size of the (f, s, t) index is 2.349 TB, and the total size of the traditional index with NSW records is 283 GB. For the experiment, GOV2 [4] text collection and the following queries are used: title queries from TREC Robust Task 2004 (with 250 queries in total), title queries from TREC Terabyte Task from 2004 to 2006 (with 150 queries in total), title queries from TREC Web Task from 2009 to 2014 (with 300 queries in total), queries from TREC 2007 Million Query Track (10000 queries in total). The total size of the query set after duplicate removal is 10 665 queries. GOV2 text collection contains 25 million documents. The total size of the collection is approximately 426 GB, and after HTML tag removal, there is approximately 167 GB of plain text. The average document text size is approximately 7 KB. We used the following computational resources: CPU: Intel(R) Core(TM) i7 CPU 920 @ 2.67 GHz. HDD: 7200 RPM. RAM: 24 GB. OS: Microsoft Windows 2008 R2 Enterprise. The query set can be divided into the following subsets depending on lemmas in a concrete query [17]. All queries are evaluated within one program thread. 340 If we have a query with a length greater than M axDistance, then we should divide it into several parts. For example, when the value of M axDistance is 5, then the query “to be or not to be that is the question” should be rewritten as “(to be or not to) (be that is the question)”, and these two parts should be evaluated independently; then, the results should be combined. 5.2 Q1. Only Stop Lemmas Every query in the subset contains only stop lemmas. There are 119 queries in this subset. Examples include the following: to be or not to be that is the question, kids earth day activities. With F L-numbers, we have the following queries: [to: 9] [be: 7] [or: 38] [not: 64] [to: 9] [be: 7] [that: 40] [be: 7] [the: 1] [question: 305] and [kid: 447] [earth: 309] [day: 199] [activity: 247]. For these queries, the (f, s, t) indexes are used [16]. Average query processing times: Idx0: 51.4 s, Idx5: 0.82 s, Idx7: 0.86 s, Idx9: 1.05 s (see Fig. 2). Average data read sizes per query: Idx0: 1.3 GB, Idx5: 11.1 MB, Idx7: 15.6 MB, Idx9: 20.1 MB. Average numbers of postings per query: Idx0: 317.8 million, Idx5: 0.88 million, Idx7: 1.15 million, Idx9: 1.5 million. We improved the query time by a factor of 62.7 with Idx5, by a factor of 59.4 with Idx7, and by a factor of 48.7 with Idx9 in comparison with Idx0. Fig. 2. The average query execution times for Idx0, Idx5, Idx7, and Idx9 (seconds); the query subsets Q1, Q2, Q3. 5.3 Q2. Stop and Frequently Used and/or Ordinary Lemmas Every query in the subset contains one or several stop lemmas. The query also contains some other lemmas that may be frequently used or ordinary. There are 7 244 queries in the subset. Examples include the following: History of Physicians in America. With F L-numbers, we have the following query: [history: 598] [of: 4] [physician: 1760] [in: 14] [America: 1391] For these queries, we need to read one posting list with NSW records from the traditional index for some query lemma. This lemma is designated as the “main” lemma of the query. If there is a frequently used lemma in the query, then we can use two-component key indexes, like (physician, history), for other 341 lemmas. If the query consists of only stop and ordinary lemmas, then we use posting lists from the traditional index for the remaining ordinary lemmas but without reading the NSW records. The details are described in [17]. Average query times: Idx0: 50 s, Idx5: 1.9 s, Idx7: 2 s, Idx9: 2.14 s. Average data read sizes per query: Idx0: 1.3 GB, Idx5: 64.3 MB, Idx7: 68.7 MB, Idx9: 76 MB. Average numbers of postings per query: Idx0: 324.7 million, Idx5: 3.7 million, Idx7: 3.4 million, Idx9: 3.3 million. The number of postings for Idx7 is less than that for Idx5 because fewer queries were divided into parts. Additionally, the sizes of the NSW records are different for Idx5, Idx7 and Idx9. We improved the query processing time by a factor of 25.6 with Idx5, by a factor of 24.7 with Idx7, and by a factor of 23.3 with Idx9 in comparison with Idx0. 5.4 Q3. Only Frequently Used Lemmas Every query in the subset contains only frequently used lemmas. There are 79 queries in this subset. Examples include the following: california mountain pass. With F L-numbers, we have the following query: [california: 518] [mountain: 704] [pass: 528] Two-component (w, v) key indexes are used. For example, we can use the (california, pass) and (*pass, mountain) two-component key indexes. Average query times: Idx0: 2.5 s, Idx5: 0.32 s, Idx7: 0.33 s, Idx9: 0.34 s. Average data read sizes per query: Idx0: 53.3 MB, Idx5: 4.69 MB, Idx7: 4.79 MB, Idx9: 4.93 MB. Average numbers of postings per query: Idx0: 9.4 million, Idx5: 0.59 million, Idx7: 0.6 million, Idx9: 0.61 million. We improved the query time by a factor of 7.83 with Idx5, by a factor of 7.58 with Idx7, and by a factor of 7.44 with Idx9 in comparison with Idx0. 5.5 Q4. Frequently Used Lemmas and Ordinary Lemmas Every query in the subset contains frequently used lemmas and ordinary lemmas. There are 1 388 queries in this subset. Examples include the following: Scalable Vector Graphics. With F L-numbers, we have the following query: [scalable: ∼] [vector: 2953] [graphics: 921] Each query contains a frequently used lemma; therefore, two-component key indexes can be used. For example, we can use the (graphics, scalable) and (graph- ics, vector) two-component key indexes. Average query times: Idx0: 2.2 s, Idx5: 0.36 s, Idx7: 0.34 s, Idx9: 0.32 s. Average data read sizes per query: Idx0: 41.7 MB, Idx5: 1.5 MB, Idx7: 1.6 MB, Idx9: 1.8 MB. Average numbers of postings per query: Idx0: 8.1 million, Idx5: 0.17 million, Idx7: 0.17 million, Idx9: 0.17 million. 342 We improved the query time by a factor of 6 with Idx5, by a factor of 6.4 with Idx7, and by a factor of 6.8 with Idx9 in comparison with Idx0. 5.6 Q5. Only Ordinary Lemmas Every query in the subset contains only ordinary lemmas. There are 1 835 queries in this subset. Examples include the following: Undersea Fiber Optic Cable. With F L-numbers, we have the following query: [undersea: 15873] [fiber: 3127] [optic: 2986] [cable: 2771] We use the traditional index only for these queries. We do not need to read the NSW records because they are stored in separated streams of data. Average query times: Idx0: 0.789 s, Idx5: 0.819 s, Idx7: 0.8 s, Idx9: 0.81 s. Average data read sizes per query: Idx0: 14.2 MB, Idx5: 15.6 MB, Idx7: 15.7 MB, Idx9: 15.8 MB. Average numbers of postings per query: Idx0: 2.89 million, Idx5: 2.89 million, Idx7: 2.89 million, Idx9: 2.89 million. The query processing time for these queries does not need any improvement. 5.7 Results for Entire Query Set Average query times: Idx0: 34.9 s, Idx5: 1.51 s, Idx7: 1.57 s, Idx9: 1.66 s. Average data read sizes per query: Idx0: 0.93 GB, Idx5: 46.4 MB, Idx7: 49.5 MB, Idx9: 54.5 MB. Average numbers of postings per query: Idx0: 225.7 million, Idx5: 3.08 million, Idx7: 2.85 million, Idx9: 2.76 million. We improved the query time by a factor of 23.1 with Idx5, by a factor of 22.4 with Idx7, and by a factor of 21 with Idx9 in comparison with Idx0. 5.8 Analysis of the Results The average time that is needed for the search for Idx5, Idx7, and Idx9, is approximately 1-2 sec. for every query type. For Idx0, we need 2-2.5 sec. for the search if the query consists only of ordinary or frequently used lemmas. For the queries that contain any stop lemma (Q1, Q2), Idx0 requires approximately 50 sec. for the search on average. This means that the search in Idx5, Idx7, and Idx9 is stable from the per- formance point of view. However, for Idx0, the search system has a performance problem if the query contains any high-frequency occurring lemma. The differ- ence in performance between Idx0 and multi-component key indexes Idx5, Idx7 and Idx9 is larger in cases when more high-frequency occurring lemmas occur in the queries. In [16], we improved the average query execution time by up to 130 times for queries that consist of words occurring with high-frequency. For TREC GOV2, we improved the average query execution time by up to 62.7 times for these queries (i.e., the query set Q1). Perhaps the difference in the “improvement 343 factor” is dependent on the average document text size, which is approximately 7 KB for TREG GOV2 and 384 KB for the text collection that was used in [16]. The other factor is the length of the query. However, an improvement of 62.7 times in the average query execution time also seems good. Let us analyze this in more detail. Let us consider the following example. We have two documents, D0 and D1. Let us consider a word w. For w, we have the following posting list in the traditional index: (0, 1), (0, 5), (0, 7), (1, 2), (1, 5). We use two very common encoding schemes. The idea of the first scheme as follows. We group the records that are related to a specific document. We convert the original posting list as follows: (0, (1, 5, 7)), (1, (2, 5)). For other kinds of indexes, we have the same. Therefore, we store in the index the ID of the document and then the list of word’s positions. The second scheme is a delta-encoding scheme. Consequently, we have the following. (0, (1, 4, 2)), (1, (2, 3)). For example, consider (1, 5, 7). In- stead of 5, we store 4 = 5 − 1, where 1 is the previous value in the list. The first scheme is much more effective for text collections that consist of large documents. This explains the difference in “improvement factor” in the experiments with two aforementioned collections. However, with our method, we can use any other encoding scheme and any other inverted index organization. In addition, the “improvement factor” can depend on the structure of the query set. To analyze this question, we performed additional experiment. We formed another query set, using the method from [16]. Consequently, we se- lected a document from the TREC GOV2 collection. We used the content of the document to produce a set of 3500 queries. We performed our experiments again, using this query set. The “improvement factor” was similar to the fore- going results that we have for TREC GOV2. The next question is, how can we select the values of our parameters? The value of M axDistance affects relevance and should be determined according to the selected relevance function [17]. The results of experiments allows us to predict how the change of the M axDistance affects the search time and index size. Let us discuss now, how the values of SW Count and F U Count affect these metrics. The value of SW Count is more important than the value of F U Count, be- cause it affects Q1 and Q2 queries, which are most complex from the performance point of view. Let us consider SW Count now. Let us consider Q2 query subset. We use Zipf’s law [20] as a representation of our word occurrence distribution. According to this, the second most frequently occurring lemma, occurs half as often as the first. The third most frequently occurring lemma, occurs 1/3 as often as the first, and so on. Let V be the number of occurrences of the most frequently occurring lemma. Therefore, our lemmas have the following number of occurrences: V, V /2, V /3, ... Let us consider a query q = (q1 , q2 , ..., qn ) from Q2. Here, qi is the number of corresponding lemma in the F L-list. The query contains one or several stop lemmas and some other lemmas that may be frequently used or ordinary. To evaluate the query using traditional index, we need to read the following number 344 n P of postings from the index: V /qi . Let, without loss of generality, q1 and q2 i=1 be stop lemmas and qn is the main lemma of the query. With the use of NSW n P records, we need to read the following number of postings. V /qi + (V /qn ) × i=3 (N SW F actor − 1). That means, we do not need to read the postings list for q1 and q2 . However, for qn we need to read its posting list with N SW records. The size of the posting (ID, P, N SW ) in bytes is up to N SW F actor = 4.5 times large than the size of (ID, P ). The experiments with M axDistance = 5 show this. Therefore, n we can calculate the ”planned performance gain” as follows: P n P P P G(q) = 1/qi / 1/qi + (1/qn ) × (N SW F actor − 1) . i=1 i=3 If we have a query set Q, then P we can calculate the average planned per- 1 formance gain, AP P G(Q) = |Q| P P G(q). Let now use only Q2 queries q∈Q for estimations. If a query q is not a Q2 query, then let P P G(q) be 1. Let AP P G(Q, SW Count) be the average planned performance gain that is calcu- lated for a specific value of SW Count. For the query set that we use in this paper, i.e., 10 665 queries, we have the following: AP P G(Q, 100) = 43, AP P G(Q, 500) = 105, AP P G(Q, 1000) = 153. However, this model do not take into account (w, v) and (f, s, t) indexes. In future, we plan to develop more precise models. However, from this model, we can predict that SW Count should not be increased. The foregoing results need a more detailed examination. Let us consider a search query. The query consists of some set of lemmas. Let M in-F L-number be the minimum F L-number among all lemmas of the query. A lower F L-number corresponds to a more frequently occurring lemma. If the M in-F L-number of a query is a small number, then the query can induce performance problems because the query contains some high-frequency occurring lemma. Then, we divide the entire query set into subsets based on the M in-F L- number of queries. We select 100 as the division step. In the first subset, we include all queries with M in-F L-numbers from 0 to 99; in the second subset, we include all queries with M in-F L-numbers from 100 to 199; and so on. In the following diagrams, we consider the first 21 subsets. Fig. 3. The average query execution times for Idx0 (seconds); the query set is divided based on the M in-F L-number with a step of 100. In Fig. 3, the average query execution time for Idx0 in seconds is displayed for every subset. The first bar is significantly larger than the other bars, showing 345 that the first subset induces some performance problems. The first 8 subsets have an average query execution time of more than two seconds. Fig. 4. The average query execution times for Idx5 (seconds); the query set is divided based on the M in-F L-number with a step of 100. In Fig. 4, the average query execution time for Idx5 in seconds is displayed for every subset. For the first subset, the average query execution time is about two seconds. For each following subset, the average query execution time is less than two seconds. Fig. 5. The improvement factor for Idx5 in comparison with Idx0 (times); the query set is divided based on the M in-F L-number with a step of 100. In Fig. 5, we show the improvement factor for Idx5 in comparison with Idx0. The first bar value is 33, which means that the average query execution time for the first subset is improved by a factor of 33. We also see how the performance is changed when the M in-F L-number of a query crosses the SW Count of 500. By this diagram, we can propose that the value of SW Count can be lowered to 100. Consequently, we created another index Idx5/SW 100, with SW Count = 100 and F U Count = 1450. Fig. 6. The average query execution times for Idx5/SW 100 (seconds); the query set is divided based on the M in-F L-number with a step of 100. In Fig. 6, the average query execution time for Idx5/SW100 in seconds is displayed for every subset. For each subset, the average query execution time is less than two seconds. The results look more promising than those for Idx5. 346 However, let us consider Q1 queries when the value of SW Count is 500, which are the queries that consist only of lemmas with F L-number < 500. There are 119 queries in this subset, as discussed above. It was established that for this subset, the average query search time with Idx5 is 0.8 sec. but with Idx5/SW 100, it is 6.7 sec., which is significantly larger. When this subset is evaluated with Idx5, three-component key indexes are used. Fig. 7. The average query execution times for Idx0, Idx5, and Idx5/SW 100 (seconds); the query set Q1 is divided based on the M in-F L-number with a step of 100. Let us divide query set Q1 into five subsets based on the M in-F L-number value of the concrete query. In Fig. 7, the average query execution time for Idx0, Idx5 and Idx5/SW100 in seconds is displayed for every subset. We see that the search in Idx5/SW100 works significantly more slowly than that in Idx5. Therefore, we come to the following conclusions for different kinds of queries. Consequently, we propose the new index schema. T1 A query consists of extreme high-frequency occurring lemmas and high- frequency occurring lemmas. This means that all lemmas of the query are extreme high-frequency occurring or high-frequently occurring. E.g., for ev- ery lemma of the query, F L-number < 500. The (f, s, t) indexes work better for these queries than do other types of indexes, as shown by Fig. 7. The search in Idx5/SW 100 is significantly slower than that in Idx5. When the search is performed using Idx5, the (f, s, t) indexes are used for every query in T1. When the search is performed using Idx5/SW 100, the (f, s, t) indexes are used only if F L(f ) < 100, F L(s) < 100, and F L(t) < 100. T2 A query contains a low-frequency occurring lemma (ordinary or frequently used). Additionally, the query may contain high-frequency occurring lemmas, but for every query lemma, F L-number ≥ 100. The (w, v) indexes allow one to achieve a better performance improvement, as shown in Fig. 6. For subsets starting from F L-number = 100, no (f, s, t) indexes or NSW records are used when we do the search using Idx5/SW 100. Here, Idx5/SW 100 is better than Idx5. T3 A query contains a low-frequency occurring lemma (ordinary or frequently used) and some extreme high-frequency occurring lemmas with F L-number < 100. The (w, v) indexes and NSW records are required. The first bar in Figures 3, 4 and 6 supports this, and our previous experiments in [16] confirm it. 347 5.9 The New Index Schema The original schema can be represented by the following rules: 1) (f, s, t) indexes, where F L(f ), F L(s), F L(t) < SW Count. 2) (w, v) indexes, where SW Count ≤ F L(w) < SW Count + F U Count, and SW Count ≤ F L(v). 3) Traditional indexes (x) with NSW records, SW Count ≤ F L(x); the NSW records contain information about all lemmas y with the condition F L(y) < SW Count that occur near lemma x in the text. For the new schema, we use the following parameters with example values. EHF Count = 100 – for extreme high-frequency occurring lemmas. HF Count = 400 – for high-frequency occurring lemmas. F U Count = 1050 – for frequently used lemmas. We propose using the following indexes. 1) (f, s, t) indexes that can be used for T1 queries, where F L(f ), F L(s), F L(t) < EHF Count + HF Count = 500, 2) (w, v) indexes that can be used for T2 and T3 queries, where 100 = EHF Count ≤ F L(w) < EHF Count + HF Count + F U Count = 1450, 100 = EHF Count ≤ F L(v). 3) Traditional indexes (x) with NSW records, 100 = EHF Count ≤ F L(x); the NSW records contain information about all lemmas y with the condition F L(y) < EHF Count = 100 that occur near lemma x in the text. These indexes can be used for T3 queries. The concrete values of the parameters are provided only for example and can be different for different languages and text collections. 6 Conclusions and Future Work In this paper, we investigated how multi-component key indexes help to improve search performance. We used well-known GOV2 text collection. We proposed a method of analyzing the search performance by considering different types of queries. By following this method, we found that the performance can be improved further and proposed a new index schema. We analyzed how the value of M axDistance affects the search performance. With an increase in the value of M axDistance from 5 to 9, the average search time using multi-component key indexes was increased from 1.51 sec. to 1.66 sec. Therefore, the value of M axDistance can be increased even further, and the main limitations here are the disk space and the time of indexing. Our multi-component indexes are relatively large for large values of M axDistance. However, large hard disk drives are now available. In many cases, it would be preferable to spend several TB of disk space but to increase the search speed by a factor of 20 times or more. 348 We found that multi-component key indexes work significantly better on text collections with large documents (e.g., documents with sizes of approximately several hundred KB or more) than on text collection that consists of small doc- uments; thus, an improvement factor of 20 can be considered as a minimum improvement factor. In the future, it is important to consider how different com- pression technologies can reduce the index total size, which will increase the search speed even more. The proposed indexes with multi-component keys have one limitation. If we have a document that contains queried words and the distance between these words is greater than M axDistance, then this document can be absent in the search results. This is usually not a problem if the average document size in the text collection is relatively large, e.g., several hundreds of kilobytes. In this case, after the proximity search with multi-component key indexes, we can run a search without distance. When the former requires the word-level index, the latter needs only the document-level index and works significantly faster. Second, in the majority of modern relevance models, it is defined that the weight of the document is inversely proportional to the square of the distance between queried words in the document [19]. With a relatively large value of M axDistance, we can be sure that all relevant documents will occur in the search results. When the first consideration for GOV2 collection is under question, because the documents are small, the second is still valid. References 1. Anh, V.N., de Kretser, O., Moffat, A.: Vector-space ranking with effective early termination. In: Proceedings of the 24th Annual International ACM SIGIR Con- ference on Research and Development in Information Retrieval. pp. 35–42. SIGIR ’01, ACM, New York, NY, USA (2001). https://doi.org/10.1145/383952.383957 2. Borodin, A., Mirvoda, S., Porshnev, S., Ponomareva, O.: Improving generalized in- verted index lock wait times. Journal of Physics: Conference Series 944(1), 012022 (jan 2018). https://doi.org/10.1088/1742-6596/944/1/012022 3. Broschart, A., Schenkel, R.: High-performance processing of text queries with tun- able pruned term and term pair indexes. ACM Trans. Inf. Syst. 30(1) (Mar 2012). https://doi.org/10.1145/2094072.2094077 4. Büttcher, S., Clarke, C., Soboroff, I.: The trec 2006 terabyte track. In: Proceedings of the Fifteenth Text REtrieval Conference, TREC 2006. pp. 128–141 (2006) 5. Crane, M., Culpepper, J.S., Lin, J., Mackenzie, J., Trotman, A.: A comparison of document-at-a-time and score-at-a-time query evaluation. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. pp. 201–210. WSDM 17, Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3018661.3018726 6. Daoud, C.M., Silva de Moura, E., Carvalho, A., Soares da Silva, A., Fernandes, D., Rossi, C.: Fast top-k preserving query processing us- ing two-tier indexes. Inf. Process. Manage. 52(5), 855–872 (Sep 2016). https://doi.org/10.1016/j.ipm.2016.03.005 7. Fox, C.: A stop list for general text. SIGIR Forum 24(1-2), 19–21 (Sep 1989). https://doi.org/10.1145/378881.378888 349 8. Gall, M., Brost, G.: K-word proximity search on encrypted data. In: 30th International Conference on Advanced Information Network- ing and Applications Workshops (WAINA). pp. 365–372. IEEE (2016). https://doi.org/10.1109/WAINA.2016.104 9. Imran Rafique, M., Hassan, M.: Utilizing distinct terms for proximity and phrases in the document for better information retrieval. In: 2014 International Conference on Emerging Technologies (ICET). pp. 100–105. IEEE (2014). https://doi.org/10.1 109/ICET.2014.7021024 10. Jiang, D., Leung, K.W.T., Yang, L., Ng, W.: Teii: Topic enhanced inverted in- dex for top-k document retrieval. Know.-Based Syst. 89(C), 346–358 (Nov 2015). https://doi.org/10.1016/j.knosys.2015.07.014 11. Lin, J., Trotman, A.: Anytime ranking for impact-ordered indexes. In: Proceedings of the 2015 International Conference on The Theory of Information Retrieval. pp. 301–304. ICTIR 15, Association for Computing Machinery, New York, NY, USA (2015). https://doi.org/10.1145/2808194.2809477 12. Lu, X., Moffat, A., Culpepper, J.S.: Efficient and effective higher order proximity modeling. In: Proceedings of the 2016 ACM International Conference on the The- ory of Information Retrieval. pp. 21–30. ICTIR ’16, ACM, New York, NY, USA (2016). https://doi.org/10.1145/2970398.2970404 13. Miller, R.B.: Response time in man-computer conversational transactions. In: Proceedings of the December 9-11, 1968, Fall Joint Computer Confer- ence, Part I. pp. 267–277. AFIPS 68 (Fall, part I), ACM, USA (1968). https://doi.org/10.1145/1476589.1476628 14. Petri, M., Moffat, A.: On the cost of phrase-based ranking. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. p. 931934. SIGIR 15, Association for Computing Machinery, New York, NY, USA (2015). https://doi.org/10.1145/2766462.2767769 15. Sadakane, K., Imai, H.: Fast algorithms for k-word proximity search. In: IEICE Transactions on Fundamentals of Electronics Communications and Computer Sci- ences. vol. 84, pp. 2311–2318 (2001) 16. Veretennikov, A.B.: Proximity full-text search by means of additional indexes with multi-component keys: In pursuit of optimal performance. In: Manolopoulos Y., Stupnikov S. (eds) Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2018. Communications in Computer and Information Science. vol. 1003, pp. 111–130. Springer. https://doi.org/10.1007/978-3-030-23584-0 7 17. Veretennikov, A.B.: Proximity full-text search with a response time guarantee by means of additional indexes. In: Arai K., Kapoor S., Bhatia R. (eds) Intelligent Sys- tems and Applications. IntelliSys 2018. Advances in Intelligent Systems and Com- puting. vol. 868, pp. 936–954. Springer, Cham (2019). https://doi.org/10.1007/978- 3-030-01054-6 66 18. Williams, H.E., Zobel, J., Bahle, D.: Fast phrase querying with com- bined indexes. ACM Trans. Inf. Syst. 22(4), 573–594 (Oct 2004). https://doi.org/10.1145/1028099.1028102 19. Yan, H., Shi, S., Zhang, F., Suel, T., Wen, J.R.: Efficient term proximity search with term-pair indexes. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management. pp. 1229–1238. CIKM ’10, ACM, New York, NY, USA (2010). https://doi.org/10.1145/1871437.1871593 20. Zipf, G.K.: Selected Studies of the Principle of Relative Frequency in Language. Harvard University Press (1932) 21. Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2) (Jul 2006). https://doi.org/10.1145/1132956.1132959 350