1. Introduction

Assessing the Semantic Dificulty of Queries*

Discussion Paper

Guglielmo Faggioli

Stefano Marchesin

0 0 University of Padova , Padova , Italy

2021

Traditional Information Retrieval (IR) models, also known as lexical models, are hindered by the semantic gap, which refers to the mismatch between diferent representations of the same underlying concept. To address this gap, semantic models have been developed. Semantic and lexical models exploit complementary signals that are best suited for diferent types of queries. For this reason, these model categories should not be used interchangeably, but should rather be properly alternated depending on the query. Therefore, it is important to identify queries where the semantic gap is prominent and thus semantic models prove efective. In this work, we quantify the impact of using semantic or lexical models on diferent queries, and we show that the interaction between queries and model categories is large. Then, we propose a labeling strategy to classify queries into semantically hard or easy, and we deploy a prototype classifier to discriminate between them.

1. Introduction

The semantic gap is a long-standing problem in Information Retrieval (IR) that refers to the diference between the machine-level description of document and query contents and the human-level interpretation of their meanings [ 2 ]. In other words, it represents the mismatch between users’ queries and the way retrieval models understand such queries [ 3 ].

The semantic gap afects any domain, but it is prominent in medical search [ 4, 5, 2 ]. For instance, a query containing the word “tumor” might not be efectively answered if the retrieval model does not identify the synonymy relationship between “tumor” and, for example, “neoplasm”. Conversely, given a query containing the term “cold”, a retrieval model might retrieve erroneous documents if it does not distinguish between the diferent meanings the term “cold” assumes depending on the context. These queries are known as semantically hard queries [ 6 ].

Traditional IR models, which are known as lexical models, fail to efectively address semantically hard queries. Semantic models were thus introduced to bridge the semantic gap [ 7 ] and to overcome the limitations of lexical models. However, semantic models have been shown to provide complementary signals to lexical models that prove efective for semantically hard queries, but less for other queries [ 8 ]. Thus, it becomes necessary to identify what category of models – between lexical and semantic – best suits a user query given the document collection at hand. In other words, we need to understand what are the inherent features of query and documents that make lexical or semantic models more efective. To this end, we address the following research questions: RQ1 How and to what extent does the semantic gap impact query performance? RQ2 What features determine the prominence of the semantic gap within queries? For RQ1, we investigate and compare the impact of lexical and semantic models on diferent topics. How large is the interaction between topics and model categories? To what extent does this interaction reflect in the diferent topic formulations (i.e., queries)? For RQ2, we explore a set of well-known features that relate to lexical and semantic models. In particular, we seek to understand whether pre-retrieval features can be used to categorize queries as semantically easy or hard.

2. Experimental Analysis

We consider two collections in the following analyses: OHSUMED [ 9 ] and TREC-COVID (Round 1) [ 10 ]. Regarding lexical models, we consider TF-IDF [ 11 ], BM25 [ 12 ], QLM [ 13 ], DFR [ 14 ], and DFI [ 15 ]. As for semantic ones, we consider W2V [ 16 ], NVSM [17], and the three variants of SAFIR [ 6 ]. We evaluate models using AP. Table 1 reports the performance on both collections.

2.1. RQ1: Topic and Category Interaction

Several works have shown that queries strongly interact with retrieval models in determining their performance [18, 19, 20]. This means that two models might have similar average performance on a set of queries but, when looked at the query-level, their performance might vary greatly. Such consideration also applies to lexical and semantic models. Some queries are best suited to semantic models, while others to lexical ones [ 8, 6 ]. We are thus interested in quantifying the interaction between queries and model categories. To determine whether the models category – that is, lexical or semantic – has a significant efect on performance, we conduct an ANOVA on the runs obtained with the considered retrieval models. ANOVA is a well-known statistical technique that allows identifying statistically significant diferences among experimental conditions. Several works in IR applied ANOVA to determine the efect of diferent factors on the overall performance of an IR system [18, 21, 19, 22]. ANOVA models the explained variable, which in our case is Average Precision (AP), as a linear combination of the efect of each factor in the experimental setup, plus an error component. The error term accounts for the variance in the data unexplained by the model. From the ANOVA on our data, we observe that the efect of the sole models category is not significant (p-value >0.05) – which means that lexical and semantic categories are not statistically significantly diferent. We cannot say that either lexical or semantic models perform best in absolute terms. The topic-category interaction is significant and the 2 value for the strength of association of 34.7% indicates a large efect. This means that the category significantly impacts on how good the results on a specific topic will be. Such a finding suggests that the semantic gap is an inherent property of the topics, less related to the specific retrieval models and more on their category. To further support this intuition, the interaction between the topic and the category is larger than the efect of the sole model. Thus, if we understand when a topic is lexical or semantic, we can achieve large performance improvements. As for TREC-COVID, each topic is represented by four diferent formulations: query, description, narrative and concatenation of query and description. Each formulation of a topic can only be used in relation to that topic thus formulations have to be treated as a nested factor inside the topic. From the results on TREC-COVID we observe that both the topic and its formulations have a large efect. The importance of the formulation factor indicates that, with an appropriate topic formulation, the performance on the topic can change greatly. ANOVA shows that the interaction between the topic and the models category is large (2 =39%) – larger than the efect of both the sole category (2.1%) and the model (30.4%). Also the interaction between the topic formulation and the models category is large (2 =19.7%), although not as large as the one between topic and category. This suggests that the semantic gap relates more to the underlying information need than the diferent topic formulations.

We hypothesize that the relation between topics and model categories, highlighted by ANOVA, links to the semantic gap and to the association of a topic with its relevant documents. For instance, if a topic has many relevant documents containing synonyms of the query terms, then a semantic model might be best suited. In fact, in this case, most of the topic formulations do not contain all the possible query synonyms and will thus be afected by the semantic gap. Conversely, topics that can be easily represented by few keywords – likely present in relevant documents – have less ambiguous formulations, which are best suited to lexical models.

2.2. RQ2: Features Importance for the Semantic Gap

Section 2.1 showed the impact of choosing the proper category depending on the target query. If we could classify queries as semantically hard or easy, we might also adopt an IR model from the right category. To train a classifier for doing that, we need i) to label queries as “semantic” or “lexical”, and ii) to find a set of features that correlate with such aspects of the queries.

The first aspect we address is the labeling of queries as “semantic” or “lexical”. The absence of a rigorous definition of semantically hard or easy for a query prevents us from manually labeling queries as “semantic” or “lexical”. Therefore, we propose to label queries according to how the two models categories perform on them. To this end, we first compute the average performance of each model. Then, for each query, we perform the following three steps. First, we compute for each model the relative improvement over its average performance. Secondly, we determine whether the relative improvement is, on average, greater for lexical or semantic models. Finally, we label the considered query as “semantic” if the improvement over the average model performance is greater for semantic models than for lexical ones; vice versa, we label the query as “lexical”. Note that we do not consider absolute performances to label queries, since even a poorly performing lexical method like TF-IDF (cfr. Table 1) might prove efective when the query is semantically easy. Thus, we focus on relative improvements, which provide more robust signals to performance outliers. To address the second aspect of RQ2, we explore two diferent sets of pre-retrieval features: Lexical- and Semantic-oriented features. Lexical-oriented features are based on query and corpus statistics and depend on the distribution of terms within the collection. Regarding semantic-oriented features, we first perform semantic indexing on OHSUMED and TREC-COVID collections as in [ 6 ]. Then, we adopt features similar to those proposed by Mothe and Tanguy [23], but, instead of considering only query-based features, we take into account both query- and corpus-based features. The considered features are reported and described in the original paper [ 1 ]. Consequently, we employ three wellknown classification models to understand the efectiveness of the considered pre-retrieval features when used to classify queries into lexical and semantic categories. The adopted models are: Decision Tree (DTr), Support Vector Machine (SVM), and Multi-Layer Perceptron (MLP). To perform experiments, we label queries using the process described above. For each classifier, we perform grid search with cross-validation to obtain the best hyper-parameters. We adopt 5-fold cross-validation for TREC-COVID, whereas we use 3-fold cross-validation for OHSUMED to avoid obtaining single-class folds due to the low number of samples. The results of the diferent classifiers are reported in Table 2, where we report mean and standard deviation over the diferent folds. To determine results significance (marked as †), we apply a randomization test with Bonferroni correction for multiple comparisons [24].

The preliminary – yet promising – results highlight that the considered lexical- and semanticoriented features relate with models categories. Therefore, they can be used as a starting point to investigate the presence of the semantic gap within test collections and to build better approaches for category selection.

3. Conclusion

We investigated the impact of the semantic gap on query performance, which features can be used to determine this gap, and whether we can exploit them to classify query as semantically easy (“lexical”) or hard (“semantic”). Using ANOVA we studied the interaction between IR models and information need, observing that the semantic gap relates more to the underlying information need than the diferent topic formulations. Then, we proposed a labeling strategy, based on relative improvements, to annotate queries as “semantic” or “lexical”. Finally, we explored two diferent sets of pre-retrieval features and we deployed a prototype classifier to understand the efectiveness of such features when used to classify queries. We obtained promising results, which suggest a link between the used features and the models categories. Vector Space, in: Proc. of the 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, 2013. [17] C. Van Gysel, M. De Rijke, E. Kanoulas, Neural vector spaces for unsupervised information retrieval, ACM Trans. Inf. Syst. 36 (2018) 1–25. [18] D. Banks, P. Over, N.-F. Zhang, Blind Men and Elephants: Six Approaches to TREC data,

Information Retrieval 1 (1999) 7–34. [19] N. Ferro, G. Silvello, Toward an Anatomy of IR System Component Performances, J. Assoc.

Inf. Sci. Technol. 69 (2018) 187–200. [20] J. S. Culpepper, G. Faggioli, N. Ferro, O. Kurland, Topic dificulty: Collection and query formulation efects, ACM Transactions on Information Systems 40 (2021). [21] E. Voorhees, D. Samarov, I. Soborof, Using Replicates in Information Retrieval Evaluation,

ACM Trans. Inf. Syst 36 (2017) 12:1–12:21. [22] G. Faggioli, N. Ferro, System efect estimation by sharding: A comparison between anova approaches to detect significant diferences, in: Proc. of the 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28 - April 1, 2021, Springer International Publishing, Cham, 2021, pp. 33–46. [23] J. Mothe, L. Tanguy, Linguistic Features to Predict Query Dificulty, in: Proc. of the Predicting query dificulty-methods and applications workshop, co-located with the ACM Conference on research and Development in Information Retrieval, SIGIR 2005, 2005, pp. 7–10. [24] P. Sedgwick, Multiple significance tests: the bonferroni correction, Bmj 344 (2012).

[1]

Faggioli ,

Marchesin , What makes a query semantically hard? , in: Proc. of the Second International Conference on Design of Experimental Search & Information REtrieval Systems , Padova, Italy, September 15-18 , 2021 , volume 2950 of CEUR Workshop Proceedings, CEUR-WS.org , 2021 , pp. 61 - 69 .

[2]

Koopman , G. Zuccon,

Bruza ,

Sitbon ,

Lawley , Information retrieval as semantic inference: a Graph Inference model applied to medical search , Inf. Retr. Journal 19 ( 2016 ) 6 - 37 .

[3]

Zhao , W. I. Grosky , Narrowing the semantic gap - improved text-based web document retrieval using visual features , IEEE Trans. Multimedia 4 ( 2002 ) 189 - 200 .

[4]

Edinger ,

A. M.

Cohen ,

Bedrick ,

K. H.

Ambert ,

W. R.

Hersh , Barriers to Retrieving Patient Information from Electronic Health Record Data: Failure Analysis from the TREC Medical Records Track , in: AMIA 2012 , American Medical Informatics Association Annual Symposium , AMIA, 2012 .

[5]

Koopman , G. Zuccon, Why Assessing Relevance in Medical IR is Demanding , in: Proc. of the Medical Information Retrieval Workshop at SIGIR co -located with the 37th annual international ACM SIGIR conference (ACM SIGIR 2014 ), volume 1276 of CEUR Workshop Proceedings, CEUR-WS.org , 2014 , pp. 16 - 19 .

[6]

Agosti ,

Marchesin , G. Silvello, Learning Unsupervised Knowledge-Enhanced Representations to Reduce the Semantic Gap in Information Retrieval , ACM Trans. Inf. Syst . 38 ( 2020 ) 38 : 1 - 38 : 48 .

[7]

Li ,

Xu , Semantic Matching in Search, Found. Trends Inf. Retr . 7 ( 2014 ) 343 - 469 .

[8]

Marchesin ,

Purpura , G. Silvello, Focal elements of neural information retrieval models. An outlook through a reproducibility study , Inf. Process. Manag . 57 ( 2020 ) 102109 .

[9]

Hersh ,

Buckley ,

T. J.

Leone ,

Hickam , Ohsumed: An interactive retrieval evaluation and new large test collection for research , in: Proc. of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval . Dublin, Ireland, 3 - 6 July 1994 , Springer London, London, 1994 , pp. 192 - 201 .

[10]

Voorhees ,

Alam ,

Bedrick ,

Demner-Fushman ,

W. R.

Hersh ,

Lo ,

Roberts , I. Soborof ,

L. L.

Wang , TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection , SIGIR Forum 54 ( 2021 ).

[11]

W. B.

Croft ,

Metzler , T. Strohman, Search Engines: Information Retrieval in Practice, Addison-Wesley, Reading (MA), USA, 2009 .

[12]

S. E.

Robertson , U. Zaragoza, The Probabilistic Relevance Framework: BM25 and Beyond, Found . Trnd. Inf. Retr. 3 ( 2009 ) 333 - 389 .

[13]

Zhai , Statistical Language Models for Information Retrieval. A Critical Review , Found. Trnd. Inf. Retr . 2 ( 2008 ) 137 - 213 .

[14]

Amati , C. J. van Rijsbergen , Probabilistic Models of Information Retrieval based on measuring the Divergence From Randomness , ACM Trans. Inf. Syst 20 ( 2002 ) 357 - 389 .

[15] İ. . Kocaba ş, B. T. Din ç er, B. Karao ğ lan, A nonparametric term weighting method for information retrieval based on measuring the divergence from independence , Information Retrieval 17 ( 2014 ) 153 - 176 .

[16]

Mikolov ,

Chen , G. Corrado,

Dean , Eficient Estimation of Word Representations in