-

TOIS

Selective Query processing: A Risk Sensitive Approach - Abstract

Josiane Mothe

Md Zia Ullah

1 0 INSPE , UT2J, Univ. de Toulouse, IRIT UMR5505 CNRS, Toulouse , France 1 Univ. de Toulouse, IRIT UMR5505 CNRS , Toulouse , France

2022

37 2018 04 07

While search engines apply a single optimised search strategy to any user query, selective search, like selective query expansion, aims to apply an adapted search strategy to each query. Search phases include query expansion, search-weighting model, and document ranking. A search strategy is defined by the combination of components and their hyperparameters in these phases. The number of possible search strategies is huge. In this paper, we describe a risk-sensitive approach to optimise the set of search strategies that should be included in a selective search approach. It solves the problem of which and how many search strategies to include in the system. We found that using 20 search strategies is an appropriate trade-of between efectiveness and system complexity. Significant efectiveness improvement is about 23% when compared to L2R documents and about 10% when compared to other selective approaches. This paper is an extended abstract of our paper at CIKM 2021 1.

eol>Information Systems Information Retrieval Models and ranking System efectiveness Adaptive information retrieval Query processing Per-query processing Query driven parameterisation Search engine parameters System selection Risk sensitive systems

• Search strategy pool: The system has a pool of search strategies it can choose from; • Selective search strategy: The system selects the search strategy from the pool to be applied to the current query using the “best fit" principle.

In related work, either the search strategy pool is limited from 2 to 8 search strategies, or it contains from 20,000 to 80,000 search strategies. In the latter case, to be applicable in real world systems, the pool needs to be restricted.

The main purpose of this paper is to limit the number of search strategies for the most appropriate ones. The risk sensitive approach we developed in this paper aims to define the search strategies and their number to be considered in a selective search approach. This risk sensitive approach is grounded on Wang et al.’s measure for learning to rank documents 2. purpose is to decide on the document ranking for a given query. The risk here is defined as, “The risk [for the system] of performing a given particular query less efectively than a given baseline system" 3.

We have adapted and defined ( ) as a function for selecting candidate search strategies. This function measures the risk associated with selecting the search strategy for a given query rather than a reference search strategy . The risk relates to being greater than .

The risk function ( ) accumulates the risk relative to queries in terms of efectiveness for the training query set. ( ) is defined in Eq. (1) where is the training query set consisting of queries, (, ) is the performance (efectiveness) of the reference search strategy for the query , and is a search strategy from the initial pool. ( ) hence accumulates the loss in efectiveness in relation to queries when the search strategy is selected, whereas the reference search strategy would have been the better choice: it corresponds to the maximum possible risk.

( ) = 1 ∑︁ ∈ 1 ∑︁ ( ) =

∈

In addition to the risk function, we have defined the corresponding reward function. It is based on the potential increase in overall efectiveness (Eq. 2) using the search strategy . The reward function is defined as follows: The reward function aggregates the improvement in efectiveness that would happen if the system only used for the queries, and performs better than the reference search strategy .

We adapted formulas of Eqs. 1 and 2 to fit the problem of selecting a set of search strategies to keep. 2L. Wang, P. N. Bennett, K. Collins-Thompson, Robust ranking models via risk-sensitive optimization, in: SIGIR, 2012, pp. 761–770 3B. T. Dinçer, C. Macdonald, I. Ounis, Risk-sensitive evaluation and learning to rank using multiple baselines, in: SIGIR, 2016, pp. 483–492 max(0, (, ) − ( , )) max(0, ( , ) − (, )) (1) (2) .

Let ℛ be the entire initial search strategy pool and − 1 be the set of search strategies that have already been selected at step with an initial 0 = {}, which is a point of reference or the first selected search strategy. is the set of training queries and (, ) denotes the retrieval efectiveness (e.g., nDCG@10) for the query ∈ processed by the search strategy be added to − 1 at step using Eq. 1 in terms of efectiveness as follows:

Given ℛ and − 1, we define the risk for selecting the new search strategy ∈ ℛ ∖ − 1 to 1

∑︁ | | ∈ (, − 1) =

∈− 1 max(0, max (( , )) − (, )) queries: it therefore adapts Eq. 1. where max∈− 1 ( , ) is the maximum efectiveness for the query in relation to the set of search strategies that have already been selected in − 1. In Eq. 3, the risk of adding the search strategy is measured as the cumulative decrease in efectiveness which the meta-system can achieve if it chooses rather than the best search strategy in − 1 for each of the training Likewise, we define the reward function using Eq. 2 in terms of efectiveness as follows: (, − 1) = 1

∑︁ | | ∈ max(0, (, ) − ∈− 1 max ( , )) and already selected search strategies − 1 is defined as:

The overall gain for the search strategy ∈ ℛ ∖ − 1 in relation to the set of training queries (, − 1) = (, − 1) − (1 + )(, − 1) where the functions (, − 1) and (, − 1) refer to the efectiveness-based Eqs. 3 and 4, respectively. The ≥

0 is a risk sensitive parameter that controls the trade-of to the following equation: between risk and reward. In our case, we set as 0 to weight risk and reward equally. We keep a statistical analysis of this risk sensitive parameter for future work.

Finally, at step we select the search strategy * which maximises the overall gain according We then update − 1 as follows: * = argmax ∈ℛ∖− 1 ︂( (, − 1)

︂) = − 1 ∪ {*} where is the set of risk sensitive search strategies selected for a set of training queries .

The risk sensitive criteria model we propose is generic enough to be applied to any selective search strategy approach.

For the selective search strategy part, we use learning to rank (L2R) algorithms to rank the search strategies as suggested in Deveaud et al. 4. The principle is thus to train a ranking model 4R. Deveaud, J. Mothe, M. Z. Ullah, J.-Y. Nie, Learning to adaptively rank document retrieval system configurations, (3) (4) (5) (6) (7) (, ) = (, ) to assign a score to a given query-search strategy pair (, ) i.e., a given feature vector , . More generally, the ranking model can rank all the search strategies for a given query . In this case, the ranking model is (, ) = (fi), where is the set of search strategies and f = (,1, ,2, · · · , ,||) is the set of feature vectors for the query-search strategy pairs. Like the L2R documents model, the ranking model (, ) is learned from the training data by minimising the loss function ℒ(; f , ).

To evaluate our contributions, we considered three standard TREC collections from the Adhoc tasks. TREC78 (about 500K newspaper articles), WT10G (1.6 million Web/blog page documents), and GOV2 (25 million web pages).

Search strategies were built by varying the IR components and some of their hyperparameters. We considered the term weighting model and automatic query expansion components from which we considered diferent variants from the literature. We also considered diferent values of the hyperparameter related to query expansion. This results in more than 20, 000 search strategies which were the input for the selection model used to reduce this set based on training queries.

Query-search strategy training examples follow a vector-based representation: the features , depend on the query (), the search strategy ( ), and a label. We opted for LETOR features that have been used successfully for document ranking models. We calculated LETOR features directly from an initial search that we performed using a reference system (BM25). Finally, the query-search strategy vectors were labelled by the efectiveness of the search strategy when treating that query.

We made , the number of search strategies, vary and found that 20 is appropriate for real world environments (see Figure 1).

Table 1 shows additional results on GOV2 TREC adhoc collections. The first block shows baselines that use a single search strategy for all queries; the second block shows trained selective search strategy (SelSS) including ours ( = 20 search strategies); the latest shows oracles. The first oracle is the Best conf. row where the best configuration of the pool is selected a posteriori. The second line in this block is when for each query we select the best configuration for that specific query; this is also an a posteriori selection. Efectiveness in absolute value, averaged on 3 draws plus standard deviation in square brackets. The best values (excluding Oracle) are in bold font. △ (resp. ↑) indicates statistically significant improvement compared to the L2R documents (resp. Deveaud et al.), two-tailed paired t-test ( < 0.05). Results on the two other TREC collections were consistent.

From these results, we can see that our method is better than any single query processing strategy (i.e. system configuration), even if we could select it automatically (which is not the case in real life). Our strategy is also better than selective query expansion, where the query processing difer from one query to the other but with just two possible choices. It is also slightly and statistically better than when using 20 000 configurations in the initial pool. This is the most interesting and original results since it shows that we need a certain number of query processing strategies to be efective but not necessarily a huge number which makes the approach more realistic and feasible in real IR systems.

The method we present here uses 20 search strategies that the system learned to choose according to query features. This risk sensitive selective search is efective to increase overall efectiveness. Significant efectiveness improvement is about 23% when compared to L2R documents and about 10% when compared to other selective approaches on 3 adhoc TREC collections. Selective query expansion used only two search strategies, which limits the system options. On the other hand, in another study, we used 20, 000 search strategies which limits its practical usability. We show that the E-risk approach we presented is more appropriate, both to provide enough options to the system to choose among and to keep it manageable in terms of maintenance. This paper is an extended abstract of Mothe and Ullah CIKM 2021 paper 1. Moreover, this research is patented. 5 5EP3771996A1: Information retrieval device and method using set of search configurations pre-selected using eficiency and risk functions. https://worldwide.espacenet.com/patent/search/family/067956648/publication/ EP3771996A1?q=19305984.7 Josiane Mothe & Md Zia Ullah