=Paper=
{{Paper
|id=Vol-2950/paper-06
|storemode=property
|title=What Makes a Query Semantically Hard?
|pdfUrl=https://ceur-ws.org/Vol-2950/paper-06.pdf
|volume=Vol-2950
|authors=Guglielmo Faggioli,Stefano Marchesin
|dblpUrl=https://dblp.org/rec/conf/desires/Faggioli021
}}
==What Makes a Query Semantically Hard?==
<pdf width="1500px">https://ceur-ws.org/Vol-2950/paper-06.pdf</pdf>
<pre>
What Makes a Query Semantically Hard?
Guglielmo Faggioli1 , Stefano Marchesin1
1
    University of Padova, Padova, Italy


                                       Abstract
                                       Traditional Information Retrieval (IR) models, also known as lexical models, are hindered by the semantic gap, which refers
                                       to the mismatch between different representations of the same underlying concept. To address this gap, semantic models
                                       have been developed. Semantic and lexical models exploit complementary signals that are best suited for different types of
                                       queries. For this reason, these model categories should not be used interchangeably, but should rather be properly alternated
                                       depending on the query. Therefore, it is important to identify queries where the semantic gap is prominent and thus semantic
                                       models prove effective. In this work, we quantify the impact of using semantic or lexical models on different queries, and
                                       we show that the interaction between queries and model categories is large. Then, we propose a labeling strategy to classify
                                       queries into semantically hard or easy, and we deploy a prototype classifier to discriminate between them.


1. Introduction                                                                       less for other queries [7]. Thus, it becomes necessary to
                                                                                      identify what category of models – between lexical and
The semantic gap is a long-standing problem in semantic – best suits a user query given the document
Information Retrieval (IR) that refers to the difference be- collection at hand. In other words, we need to understand
tween the machine-level description of document and what are the inherent features of query and documents
query contents and the human-level interpretation of that make lexical or semantic models more effective.
their meanings [1]. In other words, it represents the                                    To this end, we address the following research ques-
mismatch between users’ queries and the way retrieval tions:
models understand such queries [2].
   The semantic gap affects any domain, but it is promi- RQ1 How and to what extent does the semantic gap
nent in medical search [3, 4, 1]. Within biomedical lit-                                      impact query performance?
erature, the large presence of (quasi-)synonymous and
                                                                                       RQ2 What features determine the prominence of the
polysemous terms – along with the use of acronyms and
                                                                                              semantic gap within queries?
terminological variants – represents a critical challenge
for retrieval models. In this regard, a query containing For RQ1, we investigate and compare the impact of lexi-
the word “tumor” might not be effectively answered if cal and semantic models on different topics. How large is
the retrieval model does not identify the synonymy rela- the interaction between topics and model categories? To
tionship between “tumor” and, for example, “neoplasm”. what extent does this interaction reflect in the different
Besides, given a query containing the term “cold”, a re- topic formulations (i.e., queries)?
trieval model might retrieve erroneous documents if it For RQ2, we explore a different set of well-known fea-
does not distinguish between the different meanings the tures that relate to lexical and semantic models. In par-
term “cold” assumes depending on the context, such as ticular, we seek to understand whether pre-retrieval fea-
“common cold”, “cold temperature”, or even “Chronic Ob- tures – based on corpus statistics or synonymy/polysemy
structive Lung Disease”. These queries are known as aspects – can be used to categorize queries as semanti-
semantically hard queries [5].                                                        cally easy or hard. In other words, how effective are
   Traditional IR models, which are known as lexical mod- well-known pre-retrieval features for category selection?
els as they compute the relevance score using heuristics                                 To address the research questions, we first perform
defined over the lexical overlap between queries and statistical analyses quantifying the interaction between
documents, fail to effectively address semantically hard topics, queries, and lexical and semantic categories using
queries. Semantic models were thus introduced to bridge ANalysis Of VAriance (ANOVA) [8]. Based on the out-
the semantic gap [6] and to overcome the limitations of comes of the statistical analyses, we propose a labeling
lexical models. However, semantic models have been strategy to categorize queries into semantically easy or
shown to provide complementary signals to lexical mod- hard. The labeled queries are used to train a category
els that prove effective for semantically hard queries, but selector. The selector serves as a proxy to evaluate the
                                                                                      effectiveness of the considered pre-retrieval features in
DESIRES 2021 – 2nd International Conference on Design of                              determining the prominence of the semantic gap within
Experimental Search & Information REtrieval Systems, September                        queries.
15–18, 2021, Padua, Italy
         © 2021 Copyright for this paper by its authors. Use permitted under Creative    We conduct an experimental evaluation on two test
    CEUR
          Commons License Attribution 4.0 International (CC BY 4.0).
          CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                      collections for ad hoc medical retrieval: OHSUMED [9]
and TREC-COVID (Round 1) [10]. For lexical models,              Compared to the approaches reviewed for model selec-
we adopt standard state-of-the-art retrieval models. Re-     tion, in this work we want to understand whether queries
garding semantic models, we focus on first-stage seman-      can be categorized as semantically easy or hard. In other
tic models, which are best suited to tackle the semantic     words, we want to determine which models category be-
gap [11]. In particular, we consider unsupervised first-     tween lexical and semantic is best suited on a per query
stage semantic models, which have shown to be com-           basis. In this sense, our work shows similarities with
petitive with lexical models in medical collections [5].     that of Levi et al. [15], where the objective is to select
Besides, unsupervised semantic models rely on textual        the most effective approach between cluster-based and
signals only – and not on relevance signals – thus al-       document-based retrieval given the query. However, we
lowing us to focus exclusively on semantic and lexical       refrain from using rank-time or post-retrieval features
features.                                                    in our analyses, as we want to keep the approach model-
   The results of the experimental evaluation show that      agnostic – and thus less dependent on the specific sets
topics, queries, and model categories strongly interact to   of considered retrieval models.
determine retrieval effectiveness. This evidence further        QPP techniques are traditionally divided into pre-
highlights the need to adopt the proper model category       retrieval and post-retrieval. Pre-retrieval techniques [19,
to improve retrieval performance. Therefore, identify-       20, 21, 22, 23] exploit the distribution of the query terms
ing the right features to distinguish between semanti-       within the collection, providing coarse-grained infor-
cally easy or hard queries becomes necessary in domains      mation on the expected performance of a given query.
where the semantic gap is prominent – and this work          On the other hand, post-retrieval techniques [24, 25, 26]
poses the cornerstone towards this direction.                leverage the information on the retrieval scores assigned
   The rest of the paper is organized as follows: Section 2  by the retrieval model. Such techniques tend to perform
reports related work; Section 3 presents the experimental    better compared to pre-retrieval QPP [27], but are depen-
analysis; and Section 4 concludes the paper and outlines     dent on the considered models.
the future directions.                                          The typical task for a QPP model is ranking queries
                                                             based on their expected performance [27]. Thus, QPP
                                                             techniques cannot be directly applied to category selec-
2. Related Work                                              tion. Nevertheless, the signals provided by QPP models
                                                             can be used as input features for such task. In this work,
The problem addressed in this work relates to two topics
                                                             we want to identify a query as semantically easy or hard
in IR: Model Selection and Query Performance Prediction
                                                             regardless of the retrieval model considered. Thus, we fo-
(QPP). Below, we review prominent approaches in these
                                                             cus on pre-retrieval approaches and we adopt two types
areas and we highlight differences with our work.
                                                             of features in our analyses: lexical- and semantic-oriented
   One of the first approaches to model selection in IR
                                                             features. Regarding lexical-oriented features, we consider
was developed by He and Ounis [12], who proposed a
                                                             features proposed by He and Ounis [23] and by Zhao et al.
query-based pre-retrieval approach. In [12], the authors
                                                             [22]. He and Ounis [23] explore the possibility to use the
cluster queries according to pre-retrieval features and
                                                             distribution of the Inverse Document Frequency (IDF) over
link the best performing model to each cluster. Then,
                                                             query terms to determine the ability of lexical models
given a new query, they assign it to the closest cluster
                                                             to retrieve relevant documents. Similarly, Zhao et al.
and use the model associated to that cluster to perform
                                                             [22] propose a re-weighting schema based on IDF, called
retrieval. Balasubramanian and Allan [13] proposed a
                                                             Similarity between Collection and Query (SCQ). As for
learning approach for query-dependent model selection.
                                                             semantic-oriented features, we adopt features similar to
The selection framework relies on rank-time features –
                                                             those proposed by Mothe and Tanguy [21], who consider
available to retrieval models during ranking – to select
                                                             linguistic aspects – such as synonymy and polysemy –
between two models. Model selection approaches based
                                                             linked to the query terms. Compared to [21], however, we
on rank-time features have been further explored by
                                                             consider signals from both the query and its interaction
Balasubramanian in [14]. Beyond model selection, Levi
                                                             with documents.
et al. [15] addressed the problem of selective cluster re-
trieval [16, 17, 18], where the objective is to decide, on a
per-query basis, whether to apply cluster-based retrieval 3. Experimental Analysis
or standard document retrieval. In [15], the authors pro-
posed different sets of features based on cluster-based We consider two collections in the following analyses:
rankers, query performance predictors, and cluster prop- OHSUMED [9] and TREC-COVID (Round 1) [10].
erties. The different sets of features are then used to         OHSUMED contains 349K documents and 63 topics.
decide between cluster-based and standard document Topics in OHSUMED have two fields: title and description.
retrieval.                                                   We use description as topic formulation since the title field
Table 1                                                     3.1. RQ1: Topic and Category Interaction
Mean Average Precision (MAP) of the models on OHSUMED
and TREC-COVID collections. Models performance are com-    Several works have shown that queries strongly inter-
parable both within and across models categories.          act with retrieval models in determining their perfor-
                                                           mance [38, 39]. This means that two models might have
        Model         OHSUMED TREC-COVID                   similar average performance on a set of queries but, when
                          Lexical                          looked at the query-level, their performance might vary
                                                           greatly. A similar consideration also applies to lexical
        TF-IDF               0.524             0.362
                                                           and semantic models. Some queries are best suited to
        BM25                 0.620             0.488
        QLM                  0.577             0.434       semantic models, while some others to lexical ones [7, 5].
        DFR                  0.641             0.496       We  are thus interested in quantifying such an effect. In
        DFI                  0.592             0.467       other words, we want to evaluate the interaction between
                                                           queries and model categories.
                        Semantic
                                                              To determine whether the models category – that is,
        Word2Vec             0.568             0.482       lexical or semantic – has a significant effect on perfor-
        NVSM                 0.595             0.455       mance, we conduct an ANOVA on the runs obtained with
        SAFIR𝑠               0.604             0.463       the considered retrieval models. ANOVA is a well-known
        SAFIR𝑝               0.610             0.461
                                                           statistical technique that allows identifying statistically
        SAFIR𝑠𝑝              0.612             0.466
                                                           significant differences among experimental conditions.
                                                           Several works in IR applied ANOVA to determine the ef-
                                                           fect of different factors on the overall performance of an
poorly describes the underlying information need. TREC- IR system [38, 40, 39, 41]. ANOVA models the explained
COVID (Round 1) has 30 topics and relies on the CORD- variable, which in our case is AP, as a linear combination
19 corpus [28], which includes around 51K papers. Each of the effect of each factor in the experimental setup, plus
topic in TREC-COVID has three fields: a short keyword an error component. The error term accounts for the
query, a description, and a narrative. In our experiments, variance in the data unexplained by the model.
we consider each field as a different formulation of the      In our analyses we first consider the following model:
topic. We also include the concatenation of the keyword
query and the description. Thus, the total number of         𝑦𝑖𝑗𝑘 = 𝜇... + 𝜏𝑖 + 𝛾𝑗 + 𝛼𝑘(𝑗) + 𝜏 𝛾𝑖𝑗 + 𝜀𝑖𝑗𝑘 , (MD1)
queries we consider for TREC-COVID is equal to 120.
   Regarding lexical and semantic models, we consider where 𝑦𝑖𝑗𝑘 is the performance (measured using AP) ob-
five different models for each category. The lexical mod- served on the 𝑖-th topic using the 𝑘-th model of the 𝑗-th
els used are: TF-IDF [29]; BM25 [30]; Query Likelihood class; 𝜇... is the grand mean over all the data; 𝜏𝑖 is the
Model with Dirichlet Smoothing (QLM) [31]; Divergence effect of the 𝑖-th topic; 𝛾𝑗 is the effect of the 𝑗-th class;
From Randomness (DFR) [32]; and Divergence From In- 𝛼𝑘(𝑗) is the effect of the 𝑘-th model inside the 𝑗-th class;
dependence (DFI) [33]. All lexical models perform stop- 𝜏 𝛾𝑖𝑗 is the interaction between the 𝑖-th topic and the
words removal and stemming. As for semantic models, 𝑗-th class and 𝜀𝑖𝑗𝑘 is the prediction error. Note that the
we adopt: a Word2Vec [34] based approach where query model factor is nested inside the category one. In the
and document representations are built by summing above-mentioned ANOVA model, a IR model is mean-
up the IDF-weighted representation of the words con- ingful only in relation to its category. In other words,
tained in them [35, 36]; the Neural Vector Space Model since we cannot consider, for instance, BM25 inside the
(NVSM) [11]; and three variants of the Semantic-Aware “semantic” category, nor we can consider NVSM in the
neural Framework for IR (SAFIR) [5]. The three variants “lexical” one, we define the model factor as nested inside
of SAFIR are SAFIRsp , which integrates both polysemy the category, and thus each model contributes only to
and synonymy, SAFIRp which integrates polysemy but the variance of its category.
not synonymy, and SAFIRs which integrates synonymy            For each ANOVA, we report the Sum of Squares (SS),
but not polysemy. All semantic models have been trained the Degrees of Freedom (DF), the Mean Squares (MS),
for 10 epochs with parameters set as in [5].               the F-statistic (F), the p-value and the Strength of Associ-
   We evaluate models using Average Precision (AP) at cut- ation (SOA), using the 𝜔 2 indicator. The SOA indicates
off 1000, obtaining an experimental Grid of Points (GoP) the impact of each factor on the variability of the data.
as defined in [37]. The performances of the retrieval Typically, a factor with 0.01 ≤ 𝜔 2 < 0.06 is consid-
models in terms of AP are reported in Table 1 for both ered small-sized, while 0.06 ≤ 𝜔 2 < 0.14 indicates a
OHSUMED and TREC-COVID collections.                        medium-size effect, and 𝜔 2 ≥ 0.14 a large-size effect.
                                                           Table 2 reports the results of the ANOVA on OHSUMED
                                                           using the above-mentioned GoP of runs.
                                                              From the results in Table 2 we observe that the effect of
Table 2
ANOVA summary table on runs for the OHSUMED collection. Observe the large interaction between the topic factor and
category factor. 𝜔 2 for not significant factors is ill-defined and thus not reported.

                               Source          SS      DF       MS        F       p-value    ^ 2⟨𝑓 𝑎𝑐𝑡⟩
                                                                                             𝜔
                       Topic                 19.740     62      0.318   79.831   < 1𝑒 − 4       0.886
                       Category               0.007      1      0.007    1.805     0.1797          —
                       Model(Category)        0.584      8      0.073   18.306   < 1𝑒 − 4       0.180
                       Topic*Category         1.583     62      0.026    6.403   < 1𝑒 − 4       0.347
                       Error                  1.978    496      0.004
                       Total                 23.892    629


the sole models category is not significant (p-value>0.05)       gests that the semantic gap relates more to the underlying
– which means that lexical and semantic categories are           information need than the different topic formulations.
not statistically significantly different. In other words, we       Overall, we hypothesize that the relation between top-
cannot say that either lexical or semantic models perform        ics and model categories, highlighted by ANOVA, links
best in absolute terms. Nevertheless, the interaction be-        to the semantic gap and the association of a topic with
tween topic and category is significant and the 𝜔 2 value        its relevant documents. For instance, if a topic has many
indicates a large effect. This means that the category sig-      relevant documents containing synonyms of the query
nificantly impacts on how good the results on a specific         terms, then a semantic model might be best suited to
topic will be. Such a finding suggests that the semantic         perform retrieval. In fact, in this case, most of the topic
gap is an inherent property of the topics, less related to       formulations will not contain all the possible query syn-
the specific retrieval models and more on their category.        onyms and will thus be affected by the semantic gap.
To further support this intuition, the interaction between       Conversely, topics that can be easily represented by few
the topic and the category is larger than the effect of          keywords – likely to be found within relevant documents
the sole model. Thus, if we understand when a topic is           – will have less ambiguous formulations, which are best
lexical or semantic, we can achieve large performance            suited to lexical models.
improvements.
   As for TREC-COVID, each topic is represented by four          3.2. RQ2: Features Importance for the
different formulations: the keyword query, the descrip-
tion, the narrative and the concatenation of query and de-
                                                                      Semantic Gap
scription. Each formulation of a topic can only be used in     Section 3.1 showed the impact of choosing the proper
relation to that topic and therefore the formulations have     models category depending on the query at hand. If we
to be treated as a nested factor inside the topic. Therefore,  could classify queries as semantically hard or easy, we
we define a second ANOVA model, called MD2:                    might also adopt an IR model from the right category.
                                                               To properly train a classifier capable of doing that, we
𝑦𝑖𝑙𝑗𝑘 = 𝜇... +𝜏𝑖 +𝜑𝑙(𝑖) +𝛾𝑗 +𝛼𝑘(𝑗) +𝜏 𝛾𝑖𝑗 +𝜑𝛾𝑙(𝑖)𝑗 +𝜀𝑖𝑗𝑙𝑘 , need i) to label queries as “semantic” or “lexical”, and ii)
                                                       (MD2) to find a set of features that correlate with such aspects
which also includes 𝜑𝑙(𝑖) , the effect of the 𝑙-th formula- of the queries. The next two paragraphs tackle the above-
tion, nested inside the 𝑖-th topic, and 𝜑𝛾𝑙(𝑖)𝑗 , the interac- mentioned challenges.
tion between the 𝑙-th formulation of the 𝑖-th topic with
the 𝑗-th class. Table 3 summarizes the ANOVA results
                                                               Labeling queries The first aspect we address is the
with MD2 on TREC-COVID.
                                                               labeling of queries as “semantic” or “lexical”. The absence
   From the results on TREC-COVID we observe that
                                                               of a rigorous definition of semantically hard or easy for
both the topic and its formulations have a large effect.
                                                               a query prevents us from manually labeling queries as
The importance of the formulation factor indicates that,
                                                               “semantic” or “lexical”. In this regard, also the definition
with an appropriate topic formulation, the performance
                                                               of “hard” topic is a debated aspect [42]. Therefore, we
on the topic can change greatly. Similar to what we ob-
                                                               propose to label queries according to how the two models
served in Table 2, the interaction between the topic and
                                                               categories perform on them. To the best of our knowl-
the models category is large (𝜔 2 = 0.390), larger than
                                                               edge, this is the first automatic approach to address this
the effect of both the sole category and the model. Also
                                                               problem.
the interaction between the topic formulation and the
                                                                  To this end, we first compute the average performance
models category is large (𝜔 2 = 0.197), although not as
                                                               of each model. Then, for each query, we perform the fol-
large as the one between topic and category. This sug-
Table 3
ANOVA summary table on runs for the TREC-COVID collection. Observe the high 𝜔           ^ 2 effect for the interaction
topic*category that shows the importance of selecting the proper model category for each topic.

                           Source                SS       DF      MS        F        p-value      ^ 2⟨𝑓 𝑎𝑐𝑡⟩
                                                                                                  𝜔
                  Topic                        24.100      29    0.831    301.291    < 1𝑒 − 4        0.879
                  Query(Topic)                 15.568      90    0.173     62.712    < 1𝑒 − 4        0.822
                  Category                      0.074       1    0.074     26.732    < 1𝑒 − 4        0.021
                  Model(Category)               1.470       8    0.184     66.628    < 1𝑒 − 4        0.304
                  Topic*Category                2.200      29    0.076     27.506    < 1𝑒 − 4        0.390
                  Query(Topic)*Category         1.060      90    0.012      4.270    < 1𝑒 − 4        0.197
                  Error                         2.626     952    0.003
                  Total                        47.098    1199


lowing three steps. Firstly, we compute for each model          Table 4
the relative improvement over its average performance.          OHSUMED queries classification.
Secondly, we determine whether the relative improve-
                                                                 Label                     Confidence
ment is, on average, greater for lexical or semantic mod-
els. Finally, we label the considered query as “semantic”                       𝛼 > 0.95    𝛼 > 0.90       𝛼 ≤ 0.90   Total
if the improvement over the average model performance            Semantic             13             3           10     26
is greater for semantic models than for lexical ones; vice       Lexical              13             6           18     37
versa, we label the query as “lexical”.
                                                                 Both                 26             9           28     63
    Note that we do not consider absolute performances
to label queries, since even a poorly performing lexical
method like TF-IDF (cfr. Table 1) might prove effective         Table 5
when the query is semantically easy. Thus, we focus             TREC-COVID queries classification.
on relative improvements, which provide more robust
                                                                 Label                     Confidence
signals to performance outliers.
    Let 𝒮 be the set of models and 𝒬 the set of queries. We                     𝛼 > 0.95    𝛼 > 0.90       𝛼 ≤ 0.90   Total
call 𝐴𝑃𝑠 (𝑞) the AP observed for the model 𝑠 on the query        Semantic             27             7           26     60
𝑞, and MAP𝑠 (𝒬) and std𝑠 (𝒬) respectively the MAP and            Lexical              27             8           25     60
the standard deviation of the AP observed for the model 𝑠
                                                                 Both                 54           15            51    120
over the queries 𝒬. We define 𝑍𝑠,𝑞 = 𝐴𝑃𝑠 (𝑞)−MAP
                                               std𝑠 (𝒬)
                                                        𝑠 (𝒬)


the relative improvement over the mean performance.
    By standardizing relative improvements, we account
                                                             {𝑍𝑠,𝑞 ∀ 𝑠 ∈ 𝒮𝑙 } the set of relative improvements of the
for the variability in models performances. Then, let 𝒮𝑠
                                                             lexical models for 𝑞. Using an unpaired t-test, we deter-
be the set of semantic models, and 𝒮𝑙 the set of lexical
                                                             mine whether 𝒵𝑞,𝑠𝑒𝑚 has greater mean than 𝒵𝑞,𝑙𝑒𝑥 . If
models.
                                                             so, then 𝑞 is labeled as “semantic”. On the other hand, if
Definition 3.1. A query 𝑞 is labeled as “semantic” iff       𝒵𝑞,𝑙𝑒𝑥 has statistically significantly greater mean than
                                                             𝒵𝑞,𝑠𝑒𝑚 , then 𝑞 is labeled as “lexical”. Otherwise, 𝑞 is
                                                             labeled as “neutral”.
              ∑︀                ∑︀
                       𝑍
                 𝑠∈𝒮𝑠 𝑠,𝑞
                                          𝑍
                                   𝑠∈𝒮𝑙 𝑠,𝑞
                  |𝒮𝑠 |
                           >𝛼
                                     |𝒮𝑙 |
                                             ,                  Tables 4 and 5 report the statistics of our labeling ap-
                                                             proach for OHSUMED and TREC-COVID collections,
where >𝛼 , with 𝛼 ∈ [0.5, 1), indicates that the mean respectively, at different levels of confidence. We can ob-
relative improvement for semantic models is statistically serve that, in both collections, queries labeled with con-
significantly higher than that for lexical models at signif- fidence above 𝛼 = 0.90 (𝑝-𝑣𝑎𝑙𝑢𝑒 < 0.1) make up more
icance level 𝛼. Queries are labeled as “lexical” using the than half of the total queries (i.e., 55.6% and 57.5% respec-
opposite ordering relation (<𝛼 ).                            tively). Another interesting observation is that queries
   Therefore, using the above-mentioned definition we labeled with high confidence split evenly between lexical
can label queries as either “semantic” or “lexical” at a and semantic categories. This confirms what we observed
specific level of 𝛼. In practice, given a query 𝑞, we in Tables 2 and 3, where the effect of the sole category
call 𝒵𝑞,𝑠𝑒𝑚 = {𝑍𝑠,𝑞 ∀ 𝑠 ∈ 𝒮𝑠 } the set of relative im- plays a marginal role on performance. Focusing on TREC-
provements of the semantic models for 𝑞, and 𝒵𝑞,𝑙𝑒𝑥 = COVID queries, we observe that different formulations
of the same topic are either classified always in the same       report mean and standard deviation over the different
category or, when this is not the case, such formulations        folds. To determine results significance (marked as †), we
are labeled with low confidence1 . This further explains         apply a randomization test with Bonferroni correction
the magnitude of the effects observed in Table 3, where          for multiple comparisons [43].
the topic formulation showed a lower, although signif-              Regarding OHSUMED, we first highlight that MLP is
icant, interaction with the models category compared             the best performing method. However, MLP is also the
to that of the topic. The only exceptions are topics 16          method with the largest standard deviation for F1. This is
and 23, where the narrative formulation is lexical while         likely due to the small number of samples – i.e., 35 queries
concatenation and query, for topic 16, and concatenation,        labeled with confidence above 0.90. On top of this, none
description, and query, for topic 23, are semantic with          of the considered methods perform statistically better
confidence > 0.95. In this regard, it is interesting to note     than the random classifier. Conversely, results for TREC-
that, for both topics, the formulation labeled as “lexical” is   COVID are more stable – highlighting the impact the
always the narrative one. We attribute the reason for this       number of samples has on the stability of the classifiers
to the richer linguistic structure of the narrative formula-     performance. Also in TREC-COVID, both SVM and MLP
tion, which, in both topics, presents a better description,      are not statistically better than the random classifier. On
as well as several relevant concepts, of the underlying          the other hand, however, DTr obtains preliminary yet
information need – thus limiting the semantic gap and            promising performance (i.e., 67% for accuracy and 66%
reducing the need for semantic models.                           for F1) and it is significantly better than the random
   In the following, we restrict to queries labeled with         classifier for both measures. This suggests the presence of
confidence above 0.90, as we want to focus on queries            underlying patterns within data and the potential of the
that have been labeled with a high degree of confidence.         considered features to distinguish between semantically
Moreover, queries labeled as “neutral” for 𝛼 = 0.90 have         hard (“semantic”) and easy (“lexical”) queries.
been discarded.                                                     Relying on the results of the decision tree, we further
                                                                 investigate the features importance to determine which
Features and Category Selection To address the sec-              features correlate the most with the semantic gap, caus-
ond aspect of RQ2 – that is, classifying a query as “se-         ing the query to be either semantically easy or hard. We
mantic” or “lexical” – we explore two different sets of          only consider the decision tree built for TREC-COVID,
pre-retrieval features: Lexical- and Semantic-oriented fea-      since results on OHSUMED are not statistically signifi-
tures. Lexical-oriented features are based on query and          cant. The first two features by importance are QDF (num-
corpus statistics and depend on the distribution of terms        ber of documents containing at least one query term)
within the collection. Regarding semantic-oriented fea-          and WSDF (number of documents containing only query
tures, we first perform semantic indexing on OHSUMED             terms and no synonyms). Their importance is, respec-
and TREC-COVID collections as in [5]. Then, we adopt             tively, 17.6% and 16.7%. These features are both related
features similar to those proposed by Mothe and Tan-             to the distribution of the query terms in the collection.
guy [21], but, instead of considering only query-based           For this reason, they are likely used by the classifier to
features, we take into account both query- and corpus-           identify semantically easy queries. Indeed, a large num-
based features. The considered features are reported and         ber of documents containing query terms is a potential
described in Table 6.                                            indicator for the performance of lexical models. Besides,
   We employ three well-known classification models              the fact that WSDF is the second most important feature
to understand the effectiveness of the considered pre-           is a further evidence of this: if several documents contain
retrieval features when used to classify queries into lex-       query terms, but only few of them present also synonyms
ical and semantic categories. The adopted models are:            of such terms, then the semantic gap will likely be small
Decision Tree (DTr), Support Vector Machine (SVM), and           and lexical models will be effective. The third feature
Multi-Layer Perceptron (MLP). To perform experiments,            by importance is meanSCQ (12.1%): a pre-retrieval score
we label queries using the process described above and           based on IDF. A query having a high meanSCQ score indi-
we restrict to “semantic” and “lexical” queries that present     cates that lexical models are likely to perform well. This is
a significance score greater than 0.90. For each classi-         due to the fact that most of the lexical approaches rely on
fier, we perform grid search with cross-validation to ob-        heuristics based on IDF. Note also that SCQ is considered
tain the best hyper-parameters. We adopt 5-fold cross-           a “low performing” feature for predicting queries perfor-
validation for TREC-COVID, whereas we use 3-fold cross-          mance [27]. Nevertheless, in our scenario, it gains rele-
validation for OHSUMED to avoid obtaining single-class           vance in determining which models category performs
folds due to the low number of samples. The results of           best for the query. The fourth feature is stdNCPQT (the
the different classifiers are reported in Table 7, where we      standard deviation over the number of concepts for each
                                                                 polysemous word in the query). This feature has impor-
    1
        we omit these statistics, due to space reasons           tance 10.1%, which indicates the relevance of polysemy in
Table 6
Pre-retrieval features considered for the category selection task.
 Name                   Description
                                                   Lexical-oriented features
QL                      Number of terms in the query [21]
{std,mean,max}IDF       Features based on the distribution of the IDF over the query terms [23]
{sum,mean,max}SCQ       Features based on the similarity between corpus [22]
QDF                     Number of documents containing at least one query term
                                                  Semantic-oriented features
QPD                     Number of polysemous words within the query
{sum,std,max}NCQT       Sum, standard deviation, and max over the number of concepts related to query terms
{sum,std}NCPQT          Sum and standard deviation over the number of concepts related to polysemous query terms only
QSD                     Number of synonymous words within the query
{sum,std,max}NSEQC      Sum, standard deviation, and max over the number of different synset elements related to query concepts
{sum,std}NSQC           Sum and standard deviation over the number of different synonyms related to query concepts
SDF                     Number of documents containing at least one synonym of a query term
WSDF                    Number of documents containing at least one query term and no synonyms of the query terms
WTDF                    Number of documents containing at least one query synonym and no query terms


Table 7
Classifiers performance. We report mean and standard deviation over 3- and 5-folds for OHSUMED and TREC-COVID,
respectively. † indicates statistical significance over the random classifier, according to a permutation test with significance
0.95 and Bonferroni correction.
                                          OHSUMED                             TREC-COVID
                                  Accuracy               F1             Accuracy              F1
                         DTr     0.626 (0.089)     0.586 (0.057)     0.668 (0.093)†      0.659 (0.141)†
                         SVM     0.687 (0.074)     0.611 (0.079)     0.623 (0.053)       0.610 (0.136)
                         MLP     0.740 (0.081)     0.675 (0.146)     0.628 (0.217)       0.590 (0.269)


determining the models category: having (several) query            4. Conclusion
words with different concepts associated makes the query
ambiguous and semantic models best suited to address               In this work, we investigated the impact of the seman-
it.The two subsequent features are sumNSEQC (8.8%) and             tic gap on query performance, which features can be
maxNSEQC (7.3%). They represent, respectively, the sum             used to determine this gap, and whether we can exploit
and the maximum of the number of synset elements re-               them to classify query as semantically easy (“lexical”) or
lated to the query concepts. Both features are related to          hard (“semantic”). Using ANOVA, we quantified the in-
synonymy, which is another relevant aspect that identi-            teraction between topics, queries, and models categories.
fies the presence of the semantic gap between queries and          The results showed that such interaction is large, high-
documents. Similarly to our intuition about polysemy,              lighting the importance of choosing the proper models
having query words with several synonyms suggests that             category for retrieval performance. Surprisingly, the anal-
semantic models are best suited to retrieve relevant doc-          yses indicated that topics interact more than queries with
uments. Other features with decreasing, but significant,           models categories. This suggests that the semantic gap
importance are SDF (5.7%) and sumNCPQT (5.2%). As for              relates more to the underlying information need than
the remaining features, they are negligible according to           the different topic formulations. Then, we proposed a
the classifier.                                                    labeling strategy, based on relative improvements, to
   Thus, even though the results are preliminary and indi-         annotate queries as “semantic” or “lexical”. Finally, we
cate there is large room for improvement, they still high-         explored two different sets of pre-retrieval features and
light that the considered lexical- and semantic-oriented           we deployed a prototype classifier to understand the ef-
features relate with models categories. Therefore, they            fectiveness of such features when used to classify queries.
can be used as a starting point to investigate the presence        We obtained promising results, which suggest a corre-
of the semantic gap within test collections and to build           lation between the considered features and the models
better approaches for category selection.                          categories.
                                                                      As future work, we plan to further explore features ex-
traction and selection. In this regard, the preliminary re-      proach, 2nd ed., John Wiley & Sons, New York, USA,
sults suggested that the considered features relate to mod-      2011.
els categories, but also highlighted that such correlation   [9] W. Hersh, C. Buckley, T. J. Leone, D. Hickam,
is weak and needs to be improved. Beyond pre-retrieval           Ohsumed: An interactive retrieval evaluation and
features, we also plan to investigate features related to        new large test collection for research, in: Proc. of
retrieval models – thus getting closer to a post-retrieval       the 17th Annual International ACM-SIGIR Confer-
setup. In this sense, we plan to adopt a pseudo-relevance        ence on Research and Development in Information
strategy that considers retrieved documents and looks            Retrieval. Dublin, Ireland, 3-6 July 1994, Springer
at the distribution of lexical- and semantic-oriented fea-       London, London, 1994, pp. 192–201.
tures in such documents. Finally, we plan to consider [10] E. Voorhees, T. Alam, S. Bedrick, D. Demner-
other domains besides the medical one, such as the news          Fushman, W. R. Hersh, K. Lo, K. Roberts, I. Soboroff,
or Web domains.                                                  L. L. Wang, TREC-COVID: Constructing a Pan-
                                                                 demic Information Retrieval Test Collection, SIGIR
                                                                 Forum 54 (2021).
Acknowledgments                                             [11] C. Van Gysel, M. De Rijke, E. Kanoulas, Neural vec-
                                                                 tor spaces for unsupervised information retrieval,
The work was partially supported by the ExaMode
                                                                 ACM Trans. Inf. Syst. 36 (2018) 1–25.
project, as part of the European Union H2020 program
                                                            [12] B. He, I. Ounis, A Query-based Pre-retrieval Model
under Grant Agreement no. 825292.
                                                                 Selection Approach to Information Retrieval, in:
                                                                 Proc. of Computer-Assisted Information Retrieval
References                                                       (Recherche   d’Information et ses Applications) -
                                                                 RIAO 2004, 7th International Conference, Univer-
  [1] B. Koopman, G. Zuccon, P. Bruza, L. Sitbon, M. Law-        sity Of Avignon, France, April 26-28, 2004., CID,
       ley, Information retrieval as semantic inference: a       2004, pp. 706–719.
       Graph Inference model applied to medical search, [13] N. Balasubramanian, J. Allan, Learning to Select
       Inf. Retr. Journal 19 (2016) 6–37.                        Rankers, in: Proc. of the 33rd International ACM
  [2] R. Zhao, W. I. Grosky, Narrowing the semantic gap -        SIGIR Conference on Research and Development in
       improved text-based web document retrieval using          Information Retrieval, SIGIR 2010, Geneva, Switzer-
      visual features, IEEE Trans. Multimedia 4 (2002)           land, July 19-23, 2010, ACM, 2010, pp. 855–856.
      189–200.                                              [14] N. Balasubramanian, Query-Dependent Selection
  [3] T. Edinger, A. M. Cohen, S. Bedrick, K. H. Ambert,         of Retrieval Alternatives, Ph.D. thesis, University
      W. R. Hersh, Barriers to Retrieving Patient Informa-       of Massachusetts Amherst, 2011.
       tion from Electronic Health Record Data: Failure [15] O. Levi, F. Raiber, O. Kurland, I. Guy, Selective
      Analysis from the TREC Medical Records Track, in:          Cluster-Based Document Retrieval, in: Proc. of
      AMIA 2012, American Medical Informatics Associ-            the 25th ACM International Conference on Infor-
       ation Annual Symposium, AMIA, 2012.                       mation and Knowledge Management, CIKM 2016,
  [4] B. Koopman, G. Zuccon, Why Assessing Relevance             Indianapolis, IN, USA, October 24-28, 2016, ACM,
       in Medical IR is Demanding, in: Proc. of the Med-         2016, pp. 1473–1482.
       ical Information Retrieval Workshop at SIGIR co- [16] A. Griffiths, H. C. Luckhurst, P. Willett, Using In-
       located with the 37th annual international ACM SI-        terdocument Similarity Information in Document
       GIR conference (ACM SIGIR 2014), volume 1276 of           Retrieval Systems, J. Am. Soc. Inf. Sci. 37 (1986)
       CEUR Workshop Proceedings, CEUR-WS.org, 2014,             3–11.
       pp. 16–19.                                           [17] X. Liu, W. B. Croft, Experiments on retrieval of
  [5] M. Agosti, S. Marchesin, G. Silvello, Learning Unsu-       optimal clusters, Technical Report, University of
       pervised Knowledge-Enhanced Representations to            Massachusetts Amherst, 2006.
       Reduce the Semantic Gap in Information Retrieval, [18] A. Tombros, R. Villa, C. J. van Rijsbergen, The Ef-
      ACM Trans. Inf. Syst. 38 (2020) 38:1–38:48.                fectiveness of Query-Specific Hierarchic Clustering
  [6] H. Li, J. Xu, Semantic Matching in Search, Found.          in Information Retrieval, Inf. Process. Manag. 38
      Trends Inf. Retr. 7 (2014) 343–469.                        (2002) 559–582.
  [7] S. Marchesin, A. Purpura, G. Silvello, Focal ele- [19] S. Cronen-Townsend, Y. Zhou, W. B. Croft, Pre-
       ments of neural information retrieval models. An          dicting Query Performance, in: Proc. of the 25th
       outlook through a reproducibility study, Inf. Pro-        Annual International ACM SIGIR Conference on Re-
       cess. Manag. 57 (2020) 102109.                            search and Development in Information Retrieval,
  [8] A. Rutherford, ANOVA and ANCOVA. A GLM Ap-                 August 11-15, 2002, Tampere, Finland, 2002, pp. 299–
                                                                 306.
[20] C. Hauff, D. Hiemstra, F. de Jong, A Survey of Pre-            Retr. 2 (2008) 137–213.
     Retrieval Query Performance Predictors, in: Proc.         [32] G. Amati, C. J. van Rijsbergen, Probabilistic Models
     CIKM, 2008, pp. 1419–1420.                                     of Information Retrieval based on measuring the
[21] J. Mothe, L. Tanguy, Linguistic Features to Predict            Divergence From Randomness, ACM Trans. Inf.
     Query Difficulty, in: Proc. of the Predicting query            Syst 20 (2002) 357–389.
     difficulty-methods and applications workshop, co-         [33] İ. . Kocaba ş, B. T. Din ç er, B. Karao ğ lan, A non-
     located with the ACM Conference on research and                parametric term weighting method for information
     Development in Information Retrieval, SIGIR 2005,              retrieval based on measuring the divergence from
     2005, pp. 7–10.                                                independence, Information Retrieval 17 (2014) 153–
[22] Y. Zhao, F. Scholer, Y. Tsegay, Effective Pre-retrieval        176.
     Query Performance Prediction Using Similarity and         [34] T. Mikolov, K. Chen, G. Corrado, J. Dean, Effi-
     Variability Evidence, in: Proc. of the 30th European           cient Estimation of Word Representations in Vector
     Conference on IR Research, ECIR 2008, Glasgow,                 Space, in: Proc. of the 1st International Conference
     UK, March 30-April 3, 2008, 2008, pp. 52–64.                   on Learning Representations, ICLR 2013, Scottsdale,
[23] B. He, I. Ounis, Inferring Query Performance Us-               Arizona, USA, May 2-4, 2013, 2013.
     ing Pre-retrieval Predictors, in: Proc. of the String     [35] I. Vulić, M. F. Moens, Monolingual and Cross-
     Processing and Information Retrieval, 11th Interna-            Lingual Information Retrieval Models Based on
     tional Conference, SPIRE 2004, Padova, Italy, Octo-            (Bilingual) Word Embeddings, in: Proc. of the 38th
     ber 5-8, 2004, 2004, pp. 43–54.                                International ACM SIGIR Conference on Research
[24] J. A. Aslam, V. Pavlu, Query Hardness Estimation               and Development in Information Retrieval, ACM,
     Using Jensen-Shannon Divergence Among Multiple                 2015, pp. 363–372.
     Scoring Functions, in: Proc. of the 29th European         [36] X. Liu, J. Y. Nie, A. Sordoni, Constraining Word
     Conference on IR Research, ECIR 2007, Rome, Italy,             Embeddings by Prior Knowledge - Application to
     April 2-5, 2007, 2007, pp. 198–209.                            Medical Information Retrieval, in: Proc. of the 12th
[25] H. Roitman, Query Performance Prediction using                 Asia Information Retrieval Societies Conference,
     Passage Information, in: Proc. of the 41st Inter-              AIRS 2016, Springer, 2016, pp. 155–167.
     national ACM SIGIR Conference on Research &               [37] N. Ferro, D. Harman, CLEF 2009: Grid@CLEF Pilot
     Development in Information Retrieval, SIGIR 2018,              Track Overview, in: Proc. CLEF, 2009.
     Ann Arbor, MI, USA, July 08-12, 2018, 2018, pp.           [38] D. Banks, P. Over, N.-F. Zhang, Blind Men and Ele-
     893–896.                                                       phants: Six Approaches to TREC data, Information
[26] A. Shtok, O. Kurland, D. Carmel, Query Perfor-                 Retrieval 1 (1999) 7–34.
     mance Prediction Using Reference Lists, ACM               [39] N. Ferro, G. Silvello, Toward an Anatomy of IR
     Trans. Inf. Syst 34 (2016) 19:1–19:34.                         System Component Performances, J. Assoc. Inf. Sci.
[27] G. Faggioli, O. Zendel, J. S. Culpepper, N. Ferro,             Technol. 69 (2018) 187–200.
     F. Scholer, An enhanced evaluation framework for          [40] E. Voorhees, D. Samarov, I. Soboroff, Using Repli-
     query performance prediction, in: Proc. of the 43rd            cates in Information Retrieval Evaluation, ACM
     European Conference on IR Research, ECIR 2021,                 Trans. Inf. Syst 36 (2017) 12:1–12:21.
     Virtual Event, March 28 - April 1, 2021, Springer         [41] G. Faggioli, N. Ferro, System effect estimation by
     International Publishing, Cham, 2021, pp. 115–129.             sharding: A comparison between anova approaches
[28] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas,                  to detect significant differences, in: Proc. of the 43rd
     J. Yang, D. Burdick, D. Eide, K. Funk, Y. Katsis,              European Conference on IR Research, ECIR 2021,
     R. Kinney, Y. Li, Z. Liu, W. Merrill, P. Mooney,               Virtual Event, March 28 - April 1, 2021, Springer
     D. Murdick, D. Rishi, J. Sheehan, Z. Shen, B. Stil-            International Publishing, Cham, 2021, pp. 33–46.
     son, A. Wade, K. Wang, N. X. R. Wang, C. Wil-             [42] J. S. Culpepper, G. Faggioli, N. Ferro, O. Kurland,
     helm, B. Xie, D. Raymond, D. S. Weld, O. Etzioni,              Topic difficulty: Collection and query formulation
     S. Kohlmeier, Cord-19: The covid-19 open research              effects, ACM Trans. Inf. Syst. 40 (2021). doi:10.
     dataset, 2020. arXiv:2004.10706.                               1145/3470563.
[29] W. B. Croft, D. Metzler, T. Strohman, Search En-          [43] P. Sedgwick, Multiple significance tests: the bon-
     gines: Information Retrieval in Practice, Addison-             ferroni correction, Bmj 344 (2012).
     Wesley, Reading (MA), USA, 2009.
[30] S. E. Robertson, U. Zaragoza, The Probabilistic
     Relevance Framework: BM25 and Beyond, Found.
     Trnd. Inf. Retr. 3 (2009) 333–389.
[31] C. Zhai, Statistical Language Models for Informa-
     tion Retrieval. A Critical Review, Found. Trnd. Inf.

</pre>