=Paper=
{{Paper
|id=Vol-2950/paper-06
|storemode=property
|title=What Makes a Query Semantically Hard?
|pdfUrl=https://ceur-ws.org/Vol-2950/paper-06.pdf
|volume=Vol-2950
|authors=Guglielmo Faggioli,Stefano Marchesin
|dblpUrl=https://dblp.org/rec/conf/desires/Faggioli021
}}
==What Makes a Query Semantically Hard?==
What Makes a Query Semantically Hard?
Guglielmo Faggioli1 , Stefano Marchesin1
1
University of Padova, Padova, Italy
Abstract
Traditional Information Retrieval (IR) models, also known as lexical models, are hindered by the semantic gap, which refers
to the mismatch between different representations of the same underlying concept. To address this gap, semantic models
have been developed. Semantic and lexical models exploit complementary signals that are best suited for different types of
queries. For this reason, these model categories should not be used interchangeably, but should rather be properly alternated
depending on the query. Therefore, it is important to identify queries where the semantic gap is prominent and thus semantic
models prove effective. In this work, we quantify the impact of using semantic or lexical models on different queries, and
we show that the interaction between queries and model categories is large. Then, we propose a labeling strategy to classify
queries into semantically hard or easy, and we deploy a prototype classifier to discriminate between them.
1. Introduction less for other queries [7]. Thus, it becomes necessary to
identify what category of models – between lexical and
The semantic gap is a long-standing problem in semantic – best suits a user query given the document
Information Retrieval (IR) that refers to the difference be- collection at hand. In other words, we need to understand
tween the machine-level description of document and what are the inherent features of query and documents
query contents and the human-level interpretation of that make lexical or semantic models more effective.
their meanings [1]. In other words, it represents the To this end, we address the following research ques-
mismatch between users’ queries and the way retrieval tions:
models understand such queries [2].
The semantic gap affects any domain, but it is promi- RQ1 How and to what extent does the semantic gap
nent in medical search [3, 4, 1]. Within biomedical lit- impact query performance?
erature, the large presence of (quasi-)synonymous and
RQ2 What features determine the prominence of the
polysemous terms – along with the use of acronyms and
semantic gap within queries?
terminological variants – represents a critical challenge
for retrieval models. In this regard, a query containing For RQ1, we investigate and compare the impact of lexi-
the word “tumor” might not be effectively answered if cal and semantic models on different topics. How large is
the retrieval model does not identify the synonymy rela- the interaction between topics and model categories? To
tionship between “tumor” and, for example, “neoplasm”. what extent does this interaction reflect in the different
Besides, given a query containing the term “cold”, a re- topic formulations (i.e., queries)?
trieval model might retrieve erroneous documents if it For RQ2, we explore a different set of well-known fea-
does not distinguish between the different meanings the tures that relate to lexical and semantic models. In par-
term “cold” assumes depending on the context, such as ticular, we seek to understand whether pre-retrieval fea-
“common cold”, “cold temperature”, or even “Chronic Ob- tures – based on corpus statistics or synonymy/polysemy
structive Lung Disease”. These queries are known as aspects – can be used to categorize queries as semanti-
semantically hard queries [5]. cally easy or hard. In other words, how effective are
Traditional IR models, which are known as lexical mod- well-known pre-retrieval features for category selection?
els as they compute the relevance score using heuristics To address the research questions, we first perform
defined over the lexical overlap between queries and statistical analyses quantifying the interaction between
documents, fail to effectively address semantically hard topics, queries, and lexical and semantic categories using
queries. Semantic models were thus introduced to bridge ANalysis Of VAriance (ANOVA) [8]. Based on the out-
the semantic gap [6] and to overcome the limitations of comes of the statistical analyses, we propose a labeling
lexical models. However, semantic models have been strategy to categorize queries into semantically easy or
shown to provide complementary signals to lexical mod- hard. The labeled queries are used to train a category
els that prove effective for semantically hard queries, but selector. The selector serves as a proxy to evaluate the
effectiveness of the considered pre-retrieval features in
DESIRES 2021 – 2nd International Conference on Design of determining the prominence of the semantic gap within
Experimental Search & Information REtrieval Systems, September queries.
15–18, 2021, Padua, Italy
© 2021 Copyright for this paper by its authors. Use permitted under Creative We conduct an experimental evaluation on two test
CEUR
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
collections for ad hoc medical retrieval: OHSUMED [9]
and TREC-COVID (Round 1) [10]. For lexical models, Compared to the approaches reviewed for model selec-
we adopt standard state-of-the-art retrieval models. Re- tion, in this work we want to understand whether queries
garding semantic models, we focus on first-stage seman- can be categorized as semantically easy or hard. In other
tic models, which are best suited to tackle the semantic words, we want to determine which models category be-
gap [11]. In particular, we consider unsupervised first- tween lexical and semantic is best suited on a per query
stage semantic models, which have shown to be com- basis. In this sense, our work shows similarities with
petitive with lexical models in medical collections [5]. that of Levi et al. [15], where the objective is to select
Besides, unsupervised semantic models rely on textual the most effective approach between cluster-based and
signals only – and not on relevance signals – thus al- document-based retrieval given the query. However, we
lowing us to focus exclusively on semantic and lexical refrain from using rank-time or post-retrieval features
features. in our analyses, as we want to keep the approach model-
The results of the experimental evaluation show that agnostic – and thus less dependent on the specific sets
topics, queries, and model categories strongly interact to of considered retrieval models.
determine retrieval effectiveness. This evidence further QPP techniques are traditionally divided into pre-
highlights the need to adopt the proper model category retrieval and post-retrieval. Pre-retrieval techniques [19,
to improve retrieval performance. Therefore, identify- 20, 21, 22, 23] exploit the distribution of the query terms
ing the right features to distinguish between semanti- within the collection, providing coarse-grained infor-
cally easy or hard queries becomes necessary in domains mation on the expected performance of a given query.
where the semantic gap is prominent – and this work On the other hand, post-retrieval techniques [24, 25, 26]
poses the cornerstone towards this direction. leverage the information on the retrieval scores assigned
The rest of the paper is organized as follows: Section 2 by the retrieval model. Such techniques tend to perform
reports related work; Section 3 presents the experimental better compared to pre-retrieval QPP [27], but are depen-
analysis; and Section 4 concludes the paper and outlines dent on the considered models.
the future directions. The typical task for a QPP model is ranking queries
based on their expected performance [27]. Thus, QPP
techniques cannot be directly applied to category selec-
2. Related Work tion. Nevertheless, the signals provided by QPP models
can be used as input features for such task. In this work,
The problem addressed in this work relates to two topics
we want to identify a query as semantically easy or hard
in IR: Model Selection and Query Performance Prediction
regardless of the retrieval model considered. Thus, we fo-
(QPP). Below, we review prominent approaches in these
cus on pre-retrieval approaches and we adopt two types
areas and we highlight differences with our work.
of features in our analyses: lexical- and semantic-oriented
One of the first approaches to model selection in IR
features. Regarding lexical-oriented features, we consider
was developed by He and Ounis [12], who proposed a
features proposed by He and Ounis [23] and by Zhao et al.
query-based pre-retrieval approach. In [12], the authors
[22]. He and Ounis [23] explore the possibility to use the
cluster queries according to pre-retrieval features and
distribution of the Inverse Document Frequency (IDF) over
link the best performing model to each cluster. Then,
query terms to determine the ability of lexical models
given a new query, they assign it to the closest cluster
to retrieve relevant documents. Similarly, Zhao et al.
and use the model associated to that cluster to perform
[22] propose a re-weighting schema based on IDF, called
retrieval. Balasubramanian and Allan [13] proposed a
Similarity between Collection and Query (SCQ). As for
learning approach for query-dependent model selection.
semantic-oriented features, we adopt features similar to
The selection framework relies on rank-time features –
those proposed by Mothe and Tanguy [21], who consider
available to retrieval models during ranking – to select
linguistic aspects – such as synonymy and polysemy –
between two models. Model selection approaches based
linked to the query terms. Compared to [21], however, we
on rank-time features have been further explored by
consider signals from both the query and its interaction
Balasubramanian in [14]. Beyond model selection, Levi
with documents.
et al. [15] addressed the problem of selective cluster re-
trieval [16, 17, 18], where the objective is to decide, on a
per-query basis, whether to apply cluster-based retrieval 3. Experimental Analysis
or standard document retrieval. In [15], the authors pro-
posed different sets of features based on cluster-based We consider two collections in the following analyses:
rankers, query performance predictors, and cluster prop- OHSUMED [9] and TREC-COVID (Round 1) [10].
erties. The different sets of features are then used to OHSUMED contains 349K documents and 63 topics.
decide between cluster-based and standard document Topics in OHSUMED have two fields: title and description.
retrieval. We use description as topic formulation since the title field
Table 1 3.1. RQ1: Topic and Category Interaction
Mean Average Precision (MAP) of the models on OHSUMED
and TREC-COVID collections. Models performance are com- Several works have shown that queries strongly inter-
parable both within and across models categories. act with retrieval models in determining their perfor-
mance [38, 39]. This means that two models might have
Model OHSUMED TREC-COVID similar average performance on a set of queries but, when
Lexical looked at the query-level, their performance might vary
greatly. A similar consideration also applies to lexical
TF-IDF 0.524 0.362
and semantic models. Some queries are best suited to
BM25 0.620 0.488
QLM 0.577 0.434 semantic models, while some others to lexical ones [7, 5].
DFR 0.641 0.496 We are thus interested in quantifying such an effect. In
DFI 0.592 0.467 other words, we want to evaluate the interaction between
queries and model categories.
Semantic
To determine whether the models category – that is,
Word2Vec 0.568 0.482 lexical or semantic – has a significant effect on perfor-
NVSM 0.595 0.455 mance, we conduct an ANOVA on the runs obtained with
SAFIR𝑠 0.604 0.463 the considered retrieval models. ANOVA is a well-known
SAFIR𝑝 0.610 0.461
statistical technique that allows identifying statistically
SAFIR𝑠𝑝 0.612 0.466
significant differences among experimental conditions.
Several works in IR applied ANOVA to determine the ef-
fect of different factors on the overall performance of an
poorly describes the underlying information need. TREC- IR system [38, 40, 39, 41]. ANOVA models the explained
COVID (Round 1) has 30 topics and relies on the CORD- variable, which in our case is AP, as a linear combination
19 corpus [28], which includes around 51K papers. Each of the effect of each factor in the experimental setup, plus
topic in TREC-COVID has three fields: a short keyword an error component. The error term accounts for the
query, a description, and a narrative. In our experiments, variance in the data unexplained by the model.
we consider each field as a different formulation of the In our analyses we first consider the following model:
topic. We also include the concatenation of the keyword
query and the description. Thus, the total number of 𝑦𝑖𝑗𝑘 = 𝜇... + 𝜏𝑖 + 𝛾𝑗 + 𝛼𝑘(𝑗) + 𝜏 𝛾𝑖𝑗 + 𝜀𝑖𝑗𝑘 , (MD1)
queries we consider for TREC-COVID is equal to 120.
Regarding lexical and semantic models, we consider where 𝑦𝑖𝑗𝑘 is the performance (measured using AP) ob-
five different models for each category. The lexical mod- served on the 𝑖-th topic using the 𝑘-th model of the 𝑗-th
els used are: TF-IDF [29]; BM25 [30]; Query Likelihood class; 𝜇... is the grand mean over all the data; 𝜏𝑖 is the
Model with Dirichlet Smoothing (QLM) [31]; Divergence effect of the 𝑖-th topic; 𝛾𝑗 is the effect of the 𝑗-th class;
From Randomness (DFR) [32]; and Divergence From In- 𝛼𝑘(𝑗) is the effect of the 𝑘-th model inside the 𝑗-th class;
dependence (DFI) [33]. All lexical models perform stop- 𝜏 𝛾𝑖𝑗 is the interaction between the 𝑖-th topic and the
words removal and stemming. As for semantic models, 𝑗-th class and 𝜀𝑖𝑗𝑘 is the prediction error. Note that the
we adopt: a Word2Vec [34] based approach where query model factor is nested inside the category one. In the
and document representations are built by summing above-mentioned ANOVA model, a IR model is mean-
up the IDF-weighted representation of the words con- ingful only in relation to its category. In other words,
tained in them [35, 36]; the Neural Vector Space Model since we cannot consider, for instance, BM25 inside the
(NVSM) [11]; and three variants of the Semantic-Aware “semantic” category, nor we can consider NVSM in the
neural Framework for IR (SAFIR) [5]. The three variants “lexical” one, we define the model factor as nested inside
of SAFIR are SAFIRsp , which integrates both polysemy the category, and thus each model contributes only to
and synonymy, SAFIRp which integrates polysemy but the variance of its category.
not synonymy, and SAFIRs which integrates synonymy For each ANOVA, we report the Sum of Squares (SS),
but not polysemy. All semantic models have been trained the Degrees of Freedom (DF), the Mean Squares (MS),
for 10 epochs with parameters set as in [5]. the F-statistic (F), the p-value and the Strength of Associ-
We evaluate models using Average Precision (AP) at cut- ation (SOA), using the 𝜔 2 indicator. The SOA indicates
off 1000, obtaining an experimental Grid of Points (GoP) the impact of each factor on the variability of the data.
as defined in [37]. The performances of the retrieval Typically, a factor with 0.01 ≤ 𝜔 2 < 0.06 is consid-
models in terms of AP are reported in Table 1 for both ered small-sized, while 0.06 ≤ 𝜔 2 < 0.14 indicates a
OHSUMED and TREC-COVID collections. medium-size effect, and 𝜔 2 ≥ 0.14 a large-size effect.
Table 2 reports the results of the ANOVA on OHSUMED
using the above-mentioned GoP of runs.
From the results in Table 2 we observe that the effect of
Table 2
ANOVA summary table on runs for the OHSUMED collection. Observe the large interaction between the topic factor and
category factor. 𝜔 2 for not significant factors is ill-defined and thus not reported.
Source SS DF MS F p-value ^ 2⟨𝑓 𝑎𝑐𝑡⟩
𝜔
Topic 19.740 62 0.318 79.831 < 1𝑒 − 4 0.886
Category 0.007 1 0.007 1.805 0.1797 —
Model(Category) 0.584 8 0.073 18.306 < 1𝑒 − 4 0.180
Topic*Category 1.583 62 0.026 6.403 < 1𝑒 − 4 0.347
Error 1.978 496 0.004
Total 23.892 629
the sole models category is not significant (p-value>0.05) gests that the semantic gap relates more to the underlying
– which means that lexical and semantic categories are information need than the different topic formulations.
not statistically significantly different. In other words, we Overall, we hypothesize that the relation between top-
cannot say that either lexical or semantic models perform ics and model categories, highlighted by ANOVA, links
best in absolute terms. Nevertheless, the interaction be- to the semantic gap and the association of a topic with
tween topic and category is significant and the 𝜔 2 value its relevant documents. For instance, if a topic has many
indicates a large effect. This means that the category sig- relevant documents containing synonyms of the query
nificantly impacts on how good the results on a specific terms, then a semantic model might be best suited to
topic will be. Such a finding suggests that the semantic perform retrieval. In fact, in this case, most of the topic
gap is an inherent property of the topics, less related to formulations will not contain all the possible query syn-
the specific retrieval models and more on their category. onyms and will thus be affected by the semantic gap.
To further support this intuition, the interaction between Conversely, topics that can be easily represented by few
the topic and the category is larger than the effect of keywords – likely to be found within relevant documents
the sole model. Thus, if we understand when a topic is – will have less ambiguous formulations, which are best
lexical or semantic, we can achieve large performance suited to lexical models.
improvements.
As for TREC-COVID, each topic is represented by four 3.2. RQ2: Features Importance for the
different formulations: the keyword query, the descrip-
tion, the narrative and the concatenation of query and de-
Semantic Gap
scription. Each formulation of a topic can only be used in Section 3.1 showed the impact of choosing the proper
relation to that topic and therefore the formulations have models category depending on the query at hand. If we
to be treated as a nested factor inside the topic. Therefore, could classify queries as semantically hard or easy, we
we define a second ANOVA model, called MD2: might also adopt an IR model from the right category.
To properly train a classifier capable of doing that, we
𝑦𝑖𝑙𝑗𝑘 = 𝜇... +𝜏𝑖 +𝜑𝑙(𝑖) +𝛾𝑗 +𝛼𝑘(𝑗) +𝜏 𝛾𝑖𝑗 +𝜑𝛾𝑙(𝑖)𝑗 +𝜀𝑖𝑗𝑙𝑘 , need i) to label queries as “semantic” or “lexical”, and ii)
(MD2) to find a set of features that correlate with such aspects
which also includes 𝜑𝑙(𝑖) , the effect of the 𝑙-th formula- of the queries. The next two paragraphs tackle the above-
tion, nested inside the 𝑖-th topic, and 𝜑𝛾𝑙(𝑖)𝑗 , the interac- mentioned challenges.
tion between the 𝑙-th formulation of the 𝑖-th topic with
the 𝑗-th class. Table 3 summarizes the ANOVA results
Labeling queries The first aspect we address is the
with MD2 on TREC-COVID.
labeling of queries as “semantic” or “lexical”. The absence
From the results on TREC-COVID we observe that
of a rigorous definition of semantically hard or easy for
both the topic and its formulations have a large effect.
a query prevents us from manually labeling queries as
The importance of the formulation factor indicates that,
“semantic” or “lexical”. In this regard, also the definition
with an appropriate topic formulation, the performance
of “hard” topic is a debated aspect [42]. Therefore, we
on the topic can change greatly. Similar to what we ob-
propose to label queries according to how the two models
served in Table 2, the interaction between the topic and
categories perform on them. To the best of our knowl-
the models category is large (𝜔 2 = 0.390), larger than
edge, this is the first automatic approach to address this
the effect of both the sole category and the model. Also
problem.
the interaction between the topic formulation and the
To this end, we first compute the average performance
models category is large (𝜔 2 = 0.197), although not as
of each model. Then, for each query, we perform the fol-
large as the one between topic and category. This sug-
Table 3
ANOVA summary table on runs for the TREC-COVID collection. Observe the high 𝜔 ^ 2 effect for the interaction
topic*category that shows the importance of selecting the proper model category for each topic.
Source SS DF MS F p-value ^ 2⟨𝑓 𝑎𝑐𝑡⟩
𝜔
Topic 24.100 29 0.831 301.291 < 1𝑒 − 4 0.879
Query(Topic) 15.568 90 0.173 62.712 < 1𝑒 − 4 0.822
Category 0.074 1 0.074 26.732 < 1𝑒 − 4 0.021
Model(Category) 1.470 8 0.184 66.628 < 1𝑒 − 4 0.304
Topic*Category 2.200 29 0.076 27.506 < 1𝑒 − 4 0.390
Query(Topic)*Category 1.060 90 0.012 4.270 < 1𝑒 − 4 0.197
Error 2.626 952 0.003
Total 47.098 1199
lowing three steps. Firstly, we compute for each model Table 4
the relative improvement over its average performance. OHSUMED queries classification.
Secondly, we determine whether the relative improve-
Label Confidence
ment is, on average, greater for lexical or semantic mod-
els. Finally, we label the considered query as “semantic” 𝛼 > 0.95 𝛼 > 0.90 𝛼 ≤ 0.90 Total
if the improvement over the average model performance Semantic 13 3 10 26
is greater for semantic models than for lexical ones; vice Lexical 13 6 18 37
versa, we label the query as “lexical”.
Both 26 9 28 63
Note that we do not consider absolute performances
to label queries, since even a poorly performing lexical
method like TF-IDF (cfr. Table 1) might prove effective Table 5
when the query is semantically easy. Thus, we focus TREC-COVID queries classification.
on relative improvements, which provide more robust
Label Confidence
signals to performance outliers.
Let 𝒮 be the set of models and 𝒬 the set of queries. We 𝛼 > 0.95 𝛼 > 0.90 𝛼 ≤ 0.90 Total
call 𝐴𝑃𝑠 (𝑞) the AP observed for the model 𝑠 on the query Semantic 27 7 26 60
𝑞, and MAP𝑠 (𝒬) and std𝑠 (𝒬) respectively the MAP and Lexical 27 8 25 60
the standard deviation of the AP observed for the model 𝑠
Both 54 15 51 120
over the queries 𝒬. We define 𝑍𝑠,𝑞 = 𝐴𝑃𝑠 (𝑞)−MAP
std𝑠 (𝒬)
𝑠 (𝒬)
the relative improvement over the mean performance.
By standardizing relative improvements, we account
{𝑍𝑠,𝑞 ∀ 𝑠 ∈ 𝒮𝑙 } the set of relative improvements of the
for the variability in models performances. Then, let 𝒮𝑠
lexical models for 𝑞. Using an unpaired t-test, we deter-
be the set of semantic models, and 𝒮𝑙 the set of lexical
mine whether 𝒵𝑞,𝑠𝑒𝑚 has greater mean than 𝒵𝑞,𝑙𝑒𝑥 . If
models.
so, then 𝑞 is labeled as “semantic”. On the other hand, if
Definition 3.1. A query 𝑞 is labeled as “semantic” iff 𝒵𝑞,𝑙𝑒𝑥 has statistically significantly greater mean than
𝒵𝑞,𝑠𝑒𝑚 , then 𝑞 is labeled as “lexical”. Otherwise, 𝑞 is
labeled as “neutral”.
∑︀ ∑︀
𝑍
𝑠∈𝒮𝑠 𝑠,𝑞
𝑍
𝑠∈𝒮𝑙 𝑠,𝑞
|𝒮𝑠 |
>𝛼
|𝒮𝑙 |
, Tables 4 and 5 report the statistics of our labeling ap-
proach for OHSUMED and TREC-COVID collections,
where >𝛼 , with 𝛼 ∈ [0.5, 1), indicates that the mean respectively, at different levels of confidence. We can ob-
relative improvement for semantic models is statistically serve that, in both collections, queries labeled with con-
significantly higher than that for lexical models at signif- fidence above 𝛼 = 0.90 (𝑝-𝑣𝑎𝑙𝑢𝑒 < 0.1) make up more
icance level 𝛼. Queries are labeled as “lexical” using the than half of the total queries (i.e., 55.6% and 57.5% respec-
opposite ordering relation (<𝛼 ). tively). Another interesting observation is that queries
Therefore, using the above-mentioned definition we labeled with high confidence split evenly between lexical
can label queries as either “semantic” or “lexical” at a and semantic categories. This confirms what we observed
specific level of 𝛼. In practice, given a query 𝑞, we in Tables 2 and 3, where the effect of the sole category
call 𝒵𝑞,𝑠𝑒𝑚 = {𝑍𝑠,𝑞 ∀ 𝑠 ∈ 𝒮𝑠 } the set of relative im- plays a marginal role on performance. Focusing on TREC-
provements of the semantic models for 𝑞, and 𝒵𝑞,𝑙𝑒𝑥 = COVID queries, we observe that different formulations
of the same topic are either classified always in the same report mean and standard deviation over the different
category or, when this is not the case, such formulations folds. To determine results significance (marked as †), we
are labeled with low confidence1 . This further explains apply a randomization test with Bonferroni correction
the magnitude of the effects observed in Table 3, where for multiple comparisons [43].
the topic formulation showed a lower, although signif- Regarding OHSUMED, we first highlight that MLP is
icant, interaction with the models category compared the best performing method. However, MLP is also the
to that of the topic. The only exceptions are topics 16 method with the largest standard deviation for F1. This is
and 23, where the narrative formulation is lexical while likely due to the small number of samples – i.e., 35 queries
concatenation and query, for topic 16, and concatenation, labeled with confidence above 0.90. On top of this, none
description, and query, for topic 23, are semantic with of the considered methods perform statistically better
confidence > 0.95. In this regard, it is interesting to note than the random classifier. Conversely, results for TREC-
that, for both topics, the formulation labeled as “lexical” is COVID are more stable – highlighting the impact the
always the narrative one. We attribute the reason for this number of samples has on the stability of the classifiers
to the richer linguistic structure of the narrative formula- performance. Also in TREC-COVID, both SVM and MLP
tion, which, in both topics, presents a better description, are not statistically better than the random classifier. On
as well as several relevant concepts, of the underlying the other hand, however, DTr obtains preliminary yet
information need – thus limiting the semantic gap and promising performance (i.e., 67% for accuracy and 66%
reducing the need for semantic models. for F1) and it is significantly better than the random
In the following, we restrict to queries labeled with classifier for both measures. This suggests the presence of
confidence above 0.90, as we want to focus on queries underlying patterns within data and the potential of the
that have been labeled with a high degree of confidence. considered features to distinguish between semantically
Moreover, queries labeled as “neutral” for 𝛼 = 0.90 have hard (“semantic”) and easy (“lexical”) queries.
been discarded. Relying on the results of the decision tree, we further
investigate the features importance to determine which
Features and Category Selection To address the sec- features correlate the most with the semantic gap, caus-
ond aspect of RQ2 – that is, classifying a query as “se- ing the query to be either semantically easy or hard. We
mantic” or “lexical” – we explore two different sets of only consider the decision tree built for TREC-COVID,
pre-retrieval features: Lexical- and Semantic-oriented fea- since results on OHSUMED are not statistically signifi-
tures. Lexical-oriented features are based on query and cant. The first two features by importance are QDF (num-
corpus statistics and depend on the distribution of terms ber of documents containing at least one query term)
within the collection. Regarding semantic-oriented fea- and WSDF (number of documents containing only query
tures, we first perform semantic indexing on OHSUMED terms and no synonyms). Their importance is, respec-
and TREC-COVID collections as in [5]. Then, we adopt tively, 17.6% and 16.7%. These features are both related
features similar to those proposed by Mothe and Tan- to the distribution of the query terms in the collection.
guy [21], but, instead of considering only query-based For this reason, they are likely used by the classifier to
features, we take into account both query- and corpus- identify semantically easy queries. Indeed, a large num-
based features. The considered features are reported and ber of documents containing query terms is a potential
described in Table 6. indicator for the performance of lexical models. Besides,
We employ three well-known classification models the fact that WSDF is the second most important feature
to understand the effectiveness of the considered pre- is a further evidence of this: if several documents contain
retrieval features when used to classify queries into lex- query terms, but only few of them present also synonyms
ical and semantic categories. The adopted models are: of such terms, then the semantic gap will likely be small
Decision Tree (DTr), Support Vector Machine (SVM), and and lexical models will be effective. The third feature
Multi-Layer Perceptron (MLP). To perform experiments, by importance is meanSCQ (12.1%): a pre-retrieval score
we label queries using the process described above and based on IDF. A query having a high meanSCQ score indi-
we restrict to “semantic” and “lexical” queries that present cates that lexical models are likely to perform well. This is
a significance score greater than 0.90. For each classi- due to the fact that most of the lexical approaches rely on
fier, we perform grid search with cross-validation to ob- heuristics based on IDF. Note also that SCQ is considered
tain the best hyper-parameters. We adopt 5-fold cross- a “low performing” feature for predicting queries perfor-
validation for TREC-COVID, whereas we use 3-fold cross- mance [27]. Nevertheless, in our scenario, it gains rele-
validation for OHSUMED to avoid obtaining single-class vance in determining which models category performs
folds due to the low number of samples. The results of best for the query. The fourth feature is stdNCPQT (the
the different classifiers are reported in Table 7, where we standard deviation over the number of concepts for each
polysemous word in the query). This feature has impor-
1
we omit these statistics, due to space reasons tance 10.1%, which indicates the relevance of polysemy in
Table 6
Pre-retrieval features considered for the category selection task.
Name Description
Lexical-oriented features
QL Number of terms in the query [21]
{std,mean,max}IDF Features based on the distribution of the IDF over the query terms [23]
{sum,mean,max}SCQ Features based on the similarity between corpus [22]
QDF Number of documents containing at least one query term
Semantic-oriented features
QPD Number of polysemous words within the query
{sum,std,max}NCQT Sum, standard deviation, and max over the number of concepts related to query terms
{sum,std}NCPQT Sum and standard deviation over the number of concepts related to polysemous query terms only
QSD Number of synonymous words within the query
{sum,std,max}NSEQC Sum, standard deviation, and max over the number of different synset elements related to query concepts
{sum,std}NSQC Sum and standard deviation over the number of different synonyms related to query concepts
SDF Number of documents containing at least one synonym of a query term
WSDF Number of documents containing at least one query term and no synonyms of the query terms
WTDF Number of documents containing at least one query synonym and no query terms
Table 7
Classifiers performance. We report mean and standard deviation over 3- and 5-folds for OHSUMED and TREC-COVID,
respectively. † indicates statistical significance over the random classifier, according to a permutation test with significance
0.95 and Bonferroni correction.
OHSUMED TREC-COVID
Accuracy F1 Accuracy F1
DTr 0.626 (0.089) 0.586 (0.057) 0.668 (0.093)† 0.659 (0.141)†
SVM 0.687 (0.074) 0.611 (0.079) 0.623 (0.053) 0.610 (0.136)
MLP 0.740 (0.081) 0.675 (0.146) 0.628 (0.217) 0.590 (0.269)
determining the models category: having (several) query 4. Conclusion
words with different concepts associated makes the query
ambiguous and semantic models best suited to address In this work, we investigated the impact of the seman-
it.The two subsequent features are sumNSEQC (8.8%) and tic gap on query performance, which features can be
maxNSEQC (7.3%). They represent, respectively, the sum used to determine this gap, and whether we can exploit
and the maximum of the number of synset elements re- them to classify query as semantically easy (“lexical”) or
lated to the query concepts. Both features are related to hard (“semantic”). Using ANOVA, we quantified the in-
synonymy, which is another relevant aspect that identi- teraction between topics, queries, and models categories.
fies the presence of the semantic gap between queries and The results showed that such interaction is large, high-
documents. Similarly to our intuition about polysemy, lighting the importance of choosing the proper models
having query words with several synonyms suggests that category for retrieval performance. Surprisingly, the anal-
semantic models are best suited to retrieve relevant doc- yses indicated that topics interact more than queries with
uments. Other features with decreasing, but significant, models categories. This suggests that the semantic gap
importance are SDF (5.7%) and sumNCPQT (5.2%). As for relates more to the underlying information need than
the remaining features, they are negligible according to the different topic formulations. Then, we proposed a
the classifier. labeling strategy, based on relative improvements, to
Thus, even though the results are preliminary and indi- annotate queries as “semantic” or “lexical”. Finally, we
cate there is large room for improvement, they still high- explored two different sets of pre-retrieval features and
light that the considered lexical- and semantic-oriented we deployed a prototype classifier to understand the ef-
features relate with models categories. Therefore, they fectiveness of such features when used to classify queries.
can be used as a starting point to investigate the presence We obtained promising results, which suggest a corre-
of the semantic gap within test collections and to build lation between the considered features and the models
better approaches for category selection. categories.
As future work, we plan to further explore features ex-
traction and selection. In this regard, the preliminary re- proach, 2nd ed., John Wiley & Sons, New York, USA,
sults suggested that the considered features relate to mod- 2011.
els categories, but also highlighted that such correlation [9] W. Hersh, C. Buckley, T. J. Leone, D. Hickam,
is weak and needs to be improved. Beyond pre-retrieval Ohsumed: An interactive retrieval evaluation and
features, we also plan to investigate features related to new large test collection for research, in: Proc. of
retrieval models – thus getting closer to a post-retrieval the 17th Annual International ACM-SIGIR Confer-
setup. In this sense, we plan to adopt a pseudo-relevance ence on Research and Development in Information
strategy that considers retrieved documents and looks Retrieval. Dublin, Ireland, 3-6 July 1994, Springer
at the distribution of lexical- and semantic-oriented fea- London, London, 1994, pp. 192–201.
tures in such documents. Finally, we plan to consider [10] E. Voorhees, T. Alam, S. Bedrick, D. Demner-
other domains besides the medical one, such as the news Fushman, W. R. Hersh, K. Lo, K. Roberts, I. Soboroff,
or Web domains. L. L. Wang, TREC-COVID: Constructing a Pan-
demic Information Retrieval Test Collection, SIGIR
Forum 54 (2021).
Acknowledgments [11] C. Van Gysel, M. De Rijke, E. Kanoulas, Neural vec-
tor spaces for unsupervised information retrieval,
The work was partially supported by the ExaMode
ACM Trans. Inf. Syst. 36 (2018) 1–25.
project, as part of the European Union H2020 program
[12] B. He, I. Ounis, A Query-based Pre-retrieval Model
under Grant Agreement no. 825292.
Selection Approach to Information Retrieval, in:
Proc. of Computer-Assisted Information Retrieval
References (Recherche d’Information et ses Applications) -
RIAO 2004, 7th International Conference, Univer-
[1] B. Koopman, G. Zuccon, P. Bruza, L. Sitbon, M. Law- sity Of Avignon, France, April 26-28, 2004., CID,
ley, Information retrieval as semantic inference: a 2004, pp. 706–719.
Graph Inference model applied to medical search, [13] N. Balasubramanian, J. Allan, Learning to Select
Inf. Retr. Journal 19 (2016) 6–37. Rankers, in: Proc. of the 33rd International ACM
[2] R. Zhao, W. I. Grosky, Narrowing the semantic gap - SIGIR Conference on Research and Development in
improved text-based web document retrieval using Information Retrieval, SIGIR 2010, Geneva, Switzer-
visual features, IEEE Trans. Multimedia 4 (2002) land, July 19-23, 2010, ACM, 2010, pp. 855–856.
189–200. [14] N. Balasubramanian, Query-Dependent Selection
[3] T. Edinger, A. M. Cohen, S. Bedrick, K. H. Ambert, of Retrieval Alternatives, Ph.D. thesis, University
W. R. Hersh, Barriers to Retrieving Patient Informa- of Massachusetts Amherst, 2011.
tion from Electronic Health Record Data: Failure [15] O. Levi, F. Raiber, O. Kurland, I. Guy, Selective
Analysis from the TREC Medical Records Track, in: Cluster-Based Document Retrieval, in: Proc. of
AMIA 2012, American Medical Informatics Associ- the 25th ACM International Conference on Infor-
ation Annual Symposium, AMIA, 2012. mation and Knowledge Management, CIKM 2016,
[4] B. Koopman, G. Zuccon, Why Assessing Relevance Indianapolis, IN, USA, October 24-28, 2016, ACM,
in Medical IR is Demanding, in: Proc. of the Med- 2016, pp. 1473–1482.
ical Information Retrieval Workshop at SIGIR co- [16] A. Griffiths, H. C. Luckhurst, P. Willett, Using In-
located with the 37th annual international ACM SI- terdocument Similarity Information in Document
GIR conference (ACM SIGIR 2014), volume 1276 of Retrieval Systems, J. Am. Soc. Inf. Sci. 37 (1986)
CEUR Workshop Proceedings, CEUR-WS.org, 2014, 3–11.
pp. 16–19. [17] X. Liu, W. B. Croft, Experiments on retrieval of
[5] M. Agosti, S. Marchesin, G. Silvello, Learning Unsu- optimal clusters, Technical Report, University of
pervised Knowledge-Enhanced Representations to Massachusetts Amherst, 2006.
Reduce the Semantic Gap in Information Retrieval, [18] A. Tombros, R. Villa, C. J. van Rijsbergen, The Ef-
ACM Trans. Inf. Syst. 38 (2020) 38:1–38:48. fectiveness of Query-Specific Hierarchic Clustering
[6] H. Li, J. Xu, Semantic Matching in Search, Found. in Information Retrieval, Inf. Process. Manag. 38
Trends Inf. Retr. 7 (2014) 343–469. (2002) 559–582.
[7] S. Marchesin, A. Purpura, G. Silvello, Focal ele- [19] S. Cronen-Townsend, Y. Zhou, W. B. Croft, Pre-
ments of neural information retrieval models. An dicting Query Performance, in: Proc. of the 25th
outlook through a reproducibility study, Inf. Pro- Annual International ACM SIGIR Conference on Re-
cess. Manag. 57 (2020) 102109. search and Development in Information Retrieval,
[8] A. Rutherford, ANOVA and ANCOVA. A GLM Ap- August 11-15, 2002, Tampere, Finland, 2002, pp. 299–
306.
[20] C. Hauff, D. Hiemstra, F. de Jong, A Survey of Pre- Retr. 2 (2008) 137–213.
Retrieval Query Performance Predictors, in: Proc. [32] G. Amati, C. J. van Rijsbergen, Probabilistic Models
CIKM, 2008, pp. 1419–1420. of Information Retrieval based on measuring the
[21] J. Mothe, L. Tanguy, Linguistic Features to Predict Divergence From Randomness, ACM Trans. Inf.
Query Difficulty, in: Proc. of the Predicting query Syst 20 (2002) 357–389.
difficulty-methods and applications workshop, co- [33] İ. . Kocaba ş, B. T. Din ç er, B. Karao ğ lan, A non-
located with the ACM Conference on research and parametric term weighting method for information
Development in Information Retrieval, SIGIR 2005, retrieval based on measuring the divergence from
2005, pp. 7–10. independence, Information Retrieval 17 (2014) 153–
[22] Y. Zhao, F. Scholer, Y. Tsegay, Effective Pre-retrieval 176.
Query Performance Prediction Using Similarity and [34] T. Mikolov, K. Chen, G. Corrado, J. Dean, Effi-
Variability Evidence, in: Proc. of the 30th European cient Estimation of Word Representations in Vector
Conference on IR Research, ECIR 2008, Glasgow, Space, in: Proc. of the 1st International Conference
UK, March 30-April 3, 2008, 2008, pp. 52–64. on Learning Representations, ICLR 2013, Scottsdale,
[23] B. He, I. Ounis, Inferring Query Performance Us- Arizona, USA, May 2-4, 2013, 2013.
ing Pre-retrieval Predictors, in: Proc. of the String [35] I. Vulić, M. F. Moens, Monolingual and Cross-
Processing and Information Retrieval, 11th Interna- Lingual Information Retrieval Models Based on
tional Conference, SPIRE 2004, Padova, Italy, Octo- (Bilingual) Word Embeddings, in: Proc. of the 38th
ber 5-8, 2004, 2004, pp. 43–54. International ACM SIGIR Conference on Research
[24] J. A. Aslam, V. Pavlu, Query Hardness Estimation and Development in Information Retrieval, ACM,
Using Jensen-Shannon Divergence Among Multiple 2015, pp. 363–372.
Scoring Functions, in: Proc. of the 29th European [36] X. Liu, J. Y. Nie, A. Sordoni, Constraining Word
Conference on IR Research, ECIR 2007, Rome, Italy, Embeddings by Prior Knowledge - Application to
April 2-5, 2007, 2007, pp. 198–209. Medical Information Retrieval, in: Proc. of the 12th
[25] H. Roitman, Query Performance Prediction using Asia Information Retrieval Societies Conference,
Passage Information, in: Proc. of the 41st Inter- AIRS 2016, Springer, 2016, pp. 155–167.
national ACM SIGIR Conference on Research & [37] N. Ferro, D. Harman, CLEF 2009: Grid@CLEF Pilot
Development in Information Retrieval, SIGIR 2018, Track Overview, in: Proc. CLEF, 2009.
Ann Arbor, MI, USA, July 08-12, 2018, 2018, pp. [38] D. Banks, P. Over, N.-F. Zhang, Blind Men and Ele-
893–896. phants: Six Approaches to TREC data, Information
[26] A. Shtok, O. Kurland, D. Carmel, Query Perfor- Retrieval 1 (1999) 7–34.
mance Prediction Using Reference Lists, ACM [39] N. Ferro, G. Silvello, Toward an Anatomy of IR
Trans. Inf. Syst 34 (2016) 19:1–19:34. System Component Performances, J. Assoc. Inf. Sci.
[27] G. Faggioli, O. Zendel, J. S. Culpepper, N. Ferro, Technol. 69 (2018) 187–200.
F. Scholer, An enhanced evaluation framework for [40] E. Voorhees, D. Samarov, I. Soboroff, Using Repli-
query performance prediction, in: Proc. of the 43rd cates in Information Retrieval Evaluation, ACM
European Conference on IR Research, ECIR 2021, Trans. Inf. Syst 36 (2017) 12:1–12:21.
Virtual Event, March 28 - April 1, 2021, Springer [41] G. Faggioli, N. Ferro, System effect estimation by
International Publishing, Cham, 2021, pp. 115–129. sharding: A comparison between anova approaches
[28] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, to detect significant differences, in: Proc. of the 43rd
J. Yang, D. Burdick, D. Eide, K. Funk, Y. Katsis, European Conference on IR Research, ECIR 2021,
R. Kinney, Y. Li, Z. Liu, W. Merrill, P. Mooney, Virtual Event, March 28 - April 1, 2021, Springer
D. Murdick, D. Rishi, J. Sheehan, Z. Shen, B. Stil- International Publishing, Cham, 2021, pp. 33–46.
son, A. Wade, K. Wang, N. X. R. Wang, C. Wil- [42] J. S. Culpepper, G. Faggioli, N. Ferro, O. Kurland,
helm, B. Xie, D. Raymond, D. S. Weld, O. Etzioni, Topic difficulty: Collection and query formulation
S. Kohlmeier, Cord-19: The covid-19 open research effects, ACM Trans. Inf. Syst. 40 (2021). doi:10.
dataset, 2020. arXiv:2004.10706. 1145/3470563.
[29] W. B. Croft, D. Metzler, T. Strohman, Search En- [43] P. Sedgwick, Multiple significance tests: the bon-
gines: Information Retrieval in Practice, Addison- ferroni correction, Bmj 344 (2012).
Wesley, Reading (MA), USA, 2009.
[30] S. E. Robertson, U. Zaragoza, The Probabilistic
Relevance Framework: BM25 and Beyond, Found.
Trnd. Inf. Retr. 3 (2009) 333–389.
[31] C. Zhai, Statistical Language Models for Informa-
tion Retrieval. A Critical Review, Found. Trnd. Inf.