Assessing the Semantic Difficulty of Queries* Discussion Paper Guglielmo Faggioli1 , Stefano Marchesin1 1 University of Padova, Padova, Italy Abstract Traditional Information Retrieval (IR) models, also known as lexical models, are hindered by the se- mantic gap, which refers to the mismatch between different representations of the same underlying concept. To address this gap, semantic models have been developed. Semantic and lexical models exploit complementary signals that are best suited for different types of queries. For this reason, these model categories should not be used interchangeably, but should rather be properly alternated depending on the query. Therefore, it is important to identify queries where the semantic gap is prominent and thus semantic models prove effective. In this work, we quantify the impact of using semantic or lexical models on different queries, and we show that the interaction between queries and model categories is large. Then, we propose a labeling strategy to classify queries into semantically hard or easy, and we deploy a prototype classifier to discriminate between them. 1. Introduction The semantic gap is a long-standing problem in Information Retrieval (IR) that refers to the difference between the machine-level description of document and query contents and the human-level interpretation of their meanings [2]. In other words, it represents the mismatch between users’ queries and the way retrieval models understand such queries [3]. The semantic gap affects any domain, but it is prominent in medical search [4, 5, 2]. For instance, a query containing the word “tumor” might not be effectively answered if the retrieval model does not identify the synonymy relationship between “tumor” and, for example, “neo- plasm”. Conversely, given a query containing the term “cold”, a retrieval model might retrieve erroneous documents if it does not distinguish between the different meanings the term “cold” assumes depending on the context. These queries are known as semantically hard queries [6]. Traditional IR models, which are known as lexical models, fail to effectively address semanti- cally hard queries. Semantic models were thus introduced to bridge the semantic gap [7] and to overcome the limitations of lexical models. However, semantic models have been shown to provide complementary signals to lexical models that prove effective for semantically hard queries, but less for other queries [8]. Thus, it becomes necessary to identify what category of models – between lexical and semantic – best suits a user query given the document collection at hand. In other words, we need to understand what are the inherent features of query and * The full paper has been originally presented at DESIRES 2021 [1] IIR 2021 – 12th Italian Information Retrieval Workshop, June 29th – July 1st, 2022, Milano, Italy © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Table 1 Mean Average Precision (MAP) of the models on OHSUMED and TREC-COVID collections. Lexical Semantic TF-IDF BM25 QLM DFR DFI W2V NVSM SAFIR𝑠 SAFIR𝑝 SAFIR𝑠𝑝 OHSUMED 0.524 0.62 0.577 0.641 0.592 0.568 0.595 0.604 0.61 0.612 TREC-COVID 0.362 0.488 0.434 0.496 0.467 0.482 0.455 0.463 0.461 0.466 documents that make lexical or semantic models more effective. To this end, we address the following research questions: RQ1 How and to what extent does the semantic gap impact query performance? RQ2 What features determine the prominence of the semantic gap within queries? For RQ1, we investigate and compare the impact of lexical and semantic models on different topics. How large is the interaction between topics and model categories? To what extent does this interaction reflect in the different topic formulations (i.e., queries)? For RQ2, we explore a set of well-known features that relate to lexical and semantic models. In particular, we seek to understand whether pre-retrieval features can be used to categorize queries as semantically easy or hard. 2. Experimental Analysis We consider two collections in the following analyses: OHSUMED [9] and TREC-COVID (Round 1) [10]. Regarding lexical models, we consider TF-IDF [11], BM25 [12], QLM [13], DFR [14], and DFI [15]. As for semantic ones, we consider W2V [16], NVSM [17], and the three variants of SAFIR [6]. We evaluate models using AP. Table 1 reports the performance on both collections. 2.1. RQ1: Topic and Category Interaction Several works have shown that queries strongly interact with retrieval models in determin- ing their performance [18, 19, 20]. This means that two models might have similar average performance on a set of queries but, when looked at the query-level, their performance might vary greatly. Such consideration also applies to lexical and semantic models. Some queries are best suited to semantic models, while others to lexical ones [8, 6]. We are thus interested in quantifying the interaction between queries and model categories. To determine whether the models category – that is, lexical or semantic – has a significant effect on performance, we conduct an ANOVA on the runs obtained with the considered retrieval models. ANOVA is a well-known statistical technique that allows identifying statistically significant differences among experimental conditions. Several works in IR applied ANOVA to determine the effect of different factors on the overall performance of an IR system [18, 21, 19, 22]. ANOVA models the explained variable, which in our case is Average Precision (AP), as a linear combination of the effect of each factor in the experimental setup, plus an error component. The error term accounts for the variance in the data unexplained by the model. From the ANOVA on our data, we observe that the effect of the sole models category is not significant (p-value>0.05) – which means that lexical and semantic categories are not statistically significantly different. We cannot say that either lexical or semantic models perform best in absolute terms. The topic-category interaction is significant and the 𝜔 2 value for the strength of association of 34.7% indicates a large effect. This means that the category significantly impacts on how good the results on a specific topic will be. Such a finding suggests that the semantic gap is an inherent property of the topics, less related to the specific retrieval models and more on their category. To further support this intuition, the interaction between the topic and the category is larger than the effect of the sole model. Thus, if we understand when a topic is lexical or semantic, we can achieve large performance improvements. As for TREC-COVID, each topic is represented by four different formulations: query, description, narrative and concatenation of query and description. Each formulation of a topic can only be used in relation to that topic thus formulations have to be treated as a nested factor inside the topic. From the results on TREC-COVID we observe that both the topic and its formulations have a large effect. The importance of the formulation factor indicates that, with an appropriate topic formulation, the performance on the topic can change greatly. ANOVA shows that the interaction between the topic and the models category is large (𝜔 2 =39%) – larger than the effect of both the sole category (2.1%) and the model (30.4%). Also the interaction between the topic formulation and the models category is large (𝜔 2 =19.7%), although not as large as the one between topic and category. This suggests that the semantic gap relates more to the underlying information need than the different topic formulations. We hypothesize that the relation between topics and model categories, highlighted by ANOVA, links to the semantic gap and to the association of a topic with its relevant documents. For instance, if a topic has many relevant documents containing synonyms of the query terms, then a semantic model might be best suited. In fact, in this case, most of the topic formulations do not contain all the possible query synonyms and will thus be affected by the semantic gap. Conversely, topics that can be easily represented by few keywords – likely present in relevant documents – have less ambiguous formulations, which are best suited to lexical models. 2.2. RQ2: Features Importance for the Semantic Gap Section 2.1 showed the impact of choosing the proper category depending on the target query. If we could classify queries as semantically hard or easy, we might also adopt an IR model from the right category. To train a classifier for doing that, we need i) to label queries as “semantic” or “lexical”, and ii) to find a set of features that correlate with such aspects of the queries. The first aspect we address is the labeling of queries as “semantic” or “lexical”. The absence of a rigorous definition of semantically hard or easy for a query prevents us from manually labeling queries as “semantic” or “lexical”. Therefore, we propose to label queries according to how the two models categories perform on them. To this end, we first compute the average performance of each model. Then, for each query, we perform the following three steps. First, we compute for each model the relative improvement over its average performance. Secondly, we determine whether the relative improvement is, on average, greater for lexical or semantic models. Finally, we label the considered query as “semantic” if the improvement over the average model performance is greater for semantic models than for lexical ones; vice versa, we label the query as “lexical”. Note that we do not consider absolute performances to label queries, since even a poorly performing lexical method like TF-IDF (cfr. Table 1) might prove effective when the query is semantically easy. Thus, we focus on relative improvements, which provide more robust signals to performance outliers. To address the second aspect of RQ2, Table 2 Classifiers performance. We report mean and standard deviation over 3- and 5-folds for OHSUMED and TREC-COVID, respectively. † indicates statistical significance over the random classifier. OHSUMED TREC-COVID Accuracy F1 Accuracy F1 † DTr 0.626 (0.089) 0.586 (0.057) 0.668 (0.093) 0.659 (0.141)† SVM 0.687 (0.074) 0.611 (0.079) 0.623 (0.053) 0.610 (0.136) MLP 0.740 (0.081) 0.675 (0.146) 0.628 (0.217) 0.590 (0.269) we explore two different sets of pre-retrieval features: Lexical- and Semantic-oriented features. Lexical-oriented features are based on query and corpus statistics and depend on the distribution of terms within the collection. Regarding semantic-oriented features, we first perform semantic indexing on OHSUMED and TREC-COVID collections as in [6]. Then, we adopt features similar to those proposed by Mothe and Tanguy [23], but, instead of considering only query-based features, we take into account both query- and corpus-based features. The considered features are reported and described in the original paper [1]. Consequently, we employ three well- known classification models to understand the effectiveness of the considered pre-retrieval features when used to classify queries into lexical and semantic categories. The adopted models are: Decision Tree (DTr), Support Vector Machine (SVM), and Multi-Layer Perceptron (MLP). To perform experiments, we label queries using the process described above. For each classifier, we perform grid search with cross-validation to obtain the best hyper-parameters. We adopt 5-fold cross-validation for TREC-COVID, whereas we use 3-fold cross-validation for OHSUMED to avoid obtaining single-class folds due to the low number of samples. The results of the different classifiers are reported in Table 2, where we report mean and standard deviation over the different folds. To determine results significance (marked as †), we apply a randomization test with Bonferroni correction for multiple comparisons [24]. The preliminary – yet promising – results highlight that the considered lexical- and semantic- oriented features relate with models categories. Therefore, they can be used as a starting point to investigate the presence of the semantic gap within test collections and to build better approaches for category selection. 3. Conclusion We investigated the impact of the semantic gap on query performance, which features can be used to determine this gap, and whether we can exploit them to classify query as semantically easy (“lexical”) or hard (“semantic”). Using ANOVA we studied the interaction between IR models and information need, observing that the semantic gap relates more to the underlying information need than the different topic formulations. Then, we proposed a labeling strategy, based on relative improvements, to annotate queries as “semantic” or “lexical”. Finally, we explored two different sets of pre-retrieval features and we deployed a prototype classifier to understand the effectiveness of such features when used to classify queries. We obtained promising results, which suggest a link between the used features and the models categories. References [1] G. Faggioli, S. Marchesin, What makes a query semantically hard?, in: Proc. of the Second International Conference on Design of Experimental Search & Information REtrieval Systems, Padova, Italy, September 15-18, 2021, volume 2950 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 61–69. [2] B. Koopman, G. Zuccon, P. Bruza, L. Sitbon, M. Lawley, Information retrieval as semantic inference: a Graph Inference model applied to medical search, Inf. Retr. Journal 19 (2016) 6–37. [3] R. Zhao, W. I. Grosky, Narrowing the semantic gap - improved text-based web document retrieval using visual features, IEEE Trans. Multimedia 4 (2002) 189–200. [4] T. Edinger, A. M. Cohen, S. Bedrick, K. H. Ambert, W. R. Hersh, Barriers to Retrieving Patient Information from Electronic Health Record Data: Failure Analysis from the TREC Medical Records Track, in: AMIA 2012, American Medical Informatics Association Annual Symposium, AMIA, 2012. [5] B. Koopman, G. Zuccon, Why Assessing Relevance in Medical IR is Demanding, in: Proc. of the Medical Information Retrieval Workshop at SIGIR co-located with the 37th annual international ACM SIGIR conference (ACM SIGIR 2014), volume 1276 of CEUR Workshop Proceedings, CEUR-WS.org, 2014, pp. 16–19. [6] M. Agosti, S. Marchesin, G. Silvello, Learning Unsupervised Knowledge-Enhanced Repre- sentations to Reduce the Semantic Gap in Information Retrieval, ACM Trans. Inf. Syst. 38 (2020) 38:1–38:48. [7] H. Li, J. Xu, Semantic Matching in Search, Found. Trends Inf. Retr. 7 (2014) 343–469. [8] S. Marchesin, A. Purpura, G. Silvello, Focal elements of neural information retrieval models. An outlook through a reproducibility study, Inf. Process. Manag. 57 (2020) 102109. [9] W. Hersh, C. Buckley, T. J. Leone, D. Hickam, Ohsumed: An interactive retrieval evaluation and new large test collection for research, in: Proc. of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. Dublin, Ireland, 3-6 July 1994, Springer London, London, 1994, pp. 192–201. [10] E. Voorhees, T. Alam, S. Bedrick, D. Demner-Fushman, W. R. Hersh, K. Lo, K. Roberts, I. Soboroff, L. L. Wang, TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection, SIGIR Forum 54 (2021). [11] W. B. Croft, D. Metzler, T. Strohman, Search Engines: Information Retrieval in Practice, Addison-Wesley, Reading (MA), USA, 2009. [12] S. E. Robertson, U. Zaragoza, The Probabilistic Relevance Framework: BM25 and Beyond, Found. Trnd. Inf. Retr. 3 (2009) 333–389. [13] C. Zhai, Statistical Language Models for Information Retrieval. A Critical Review, Found. Trnd. Inf. Retr. 2 (2008) 137–213. [14] G. Amati, C. J. van Rijsbergen, Probabilistic Models of Information Retrieval based on measuring the Divergence From Randomness, ACM Trans. Inf. Syst 20 (2002) 357–389. [15] İ. . Kocaba ş, B. T. Din ç er, B. Karao ğ lan, A nonparametric term weighting method for information retrieval based on measuring the divergence from independence, Information Retrieval 17 (2014) 153–176. [16] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations in Vector Space, in: Proc. of the 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, 2013. [17] C. Van Gysel, M. De Rijke, E. Kanoulas, Neural vector spaces for unsupervised information retrieval, ACM Trans. Inf. Syst. 36 (2018) 1–25. [18] D. Banks, P. Over, N.-F. Zhang, Blind Men and Elephants: Six Approaches to TREC data, Information Retrieval 1 (1999) 7–34. [19] N. Ferro, G. Silvello, Toward an Anatomy of IR System Component Performances, J. Assoc. Inf. Sci. Technol. 69 (2018) 187–200. [20] J. S. Culpepper, G. Faggioli, N. Ferro, O. Kurland, Topic difficulty: Collection and query formulation effects, ACM Transactions on Information Systems 40 (2021). [21] E. Voorhees, D. Samarov, I. Soboroff, Using Replicates in Information Retrieval Evaluation, ACM Trans. Inf. Syst 36 (2017) 12:1–12:21. [22] G. Faggioli, N. Ferro, System effect estimation by sharding: A comparison between anova approaches to detect significant differences, in: Proc. of the 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28 - April 1, 2021, Springer International Publishing, Cham, 2021, pp. 33–46. [23] J. Mothe, L. Tanguy, Linguistic Features to Predict Query Difficulty, in: Proc. of the Predicting query difficulty-methods and applications workshop, co-located with the ACM Conference on research and Development in Information Retrieval, SIGIR 2005, 2005, pp. 7–10. [24] P. Sedgwick, Multiple significance tests: the bonferroni correction, Bmj 344 (2012).