=Paper=
{{Paper
|id=Vol-2741/paper-13
|storemode=property
|title=Some Reflections on the Use of Structural Equation
Modeling for Investigating the
Causal Relationships that Affect Search Engine Results
|pdfUrl=https://ceur-ws.org/Vol-2741/paper-13.pdf
|volume=Vol-2741
|authors=Massimo Melucci
|dblpUrl=https://dblp.org/rec/conf/sigir/Melucci20
}}
==Some Reflections on the Use of Structural Equation
Modeling for Investigating the
Causal Relationships that Affect Search Engine Results==
Some Reflections on the Use of Structural Equation Modeling for Investigating the Causal Relationships that Affect Search Engine Results Massimo Melucci massimo@unipd.it University of Padua Abstract. Search engines and recommender systems pervade everyday life and continuously make decisions regarding what information should be retrieved and how it should be ranked in order to meet the user’s information needs on the user’s behalf. Unfortunately, bias affects auto- mated decision systems and as a consequence fairness cannot be taken for granted. Understanding whether and how bias affects search results can be a necessary and useful condition to every user and designer who aims to investigate the reasons that the systems fail or succeed. In this paper, we discuss whether Structural Equation Modeling (SEM) can be a useful methodology to investigate the causal relationships between the variables describing the content representation and retrieval processes of search engines and recommender systems. Understanding how and why a retrieval system retrieves certain documents can help understand when the system provides biased results. To this end, we provide a general illustration of the issues and the potential of SEM for causal discovery in Information Retrieval. 1 Introduction The evidence of the widespread support provided by search engines and other web applications for human activities should cause all of us to feel frightened by the possible bias occurring in the search engine result pages which provide information relevant to the end user’s information needs [1]. The possible bias on the web calls for theories, methods, data structures and algorithms for supporting the end users to recognize unfair results and find alternative sources of information. Recent scientific initiatives such as research workshops [6, 14] and legislative initiatives [7] signal the importance of fair and transparent search and recommendation systems. The search for the reasons that a search or recommendation system and in general an Information Retrieval (IR) system provide a certain result page to the end user suggested to us as well as other researchers that we should frame the Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). BIRDS 2020, 30 July 2020, Xi’an, China (online). 100 problem within a causality scheme. The internal mechanism of a search engine that ranks a document can be viewed as a cause and the actual document ranking as an effect. In this paper, we suggest that Structural Equation Modeling (SEM) can be considered as a possible framework for searching for the causes which can produce an observed effect, that is, the internal mechanisms of a search engine producing an observed document ranking. A structural equation model may be an appropriate conceptual instrument for investigating the causal relationships between the mechanics of a search engine and the search engine result pages because these pages can be represented as matrices of unit-feature pairs, i.e. data matrices and the features that are observed in the pages can correspond to the manifest variables of the structural equation model. The regression coefficients and the inter-variable covariances can represent possible causal relationships which will be tested for a given sample of data, i.e. the observed covariance matrix. In [13] we illustrated how SEM can be utilized to describe the mechanisms underlying retrieval. Instead, some aspects regarding causality and SEM in the context of IR are discussed in this paper. queries Indexing and Retrieval and \cite{Parliam pages \cite{Parlia Content index \cite{Parlia result page \cite{Parlia World Wide Web Document End User ent&16} ment&16} Representation ment&16} ment&16} Ranking clicks Culture, bias Intent, needs, language, … language… Context Design Decisions Context Fig. 1. The architecture of a retrieval system and how design and context may affect web page retrieval and ranking and ultimately the end user’s experience. 2 How Context can Affect an IR An IR system is a computer system designed and implemented to perform IR activities, i.e. those activities aiming to deliver all and only relevant information to meet a user’s information need. A search engine is the most popular implemen- tation of such a system yet IR technology pervades any device such as desktop computers and smartphones. Figure 1 depicts the general architecture of an IR system and how it relates to contextual factors and design decisions. The web pages that are crawled from the web are indexed in order to implement indexes and a representation of the content. The web page content is then retrieved and ranked to answer the end user’s queries and respond to the clicks and eventually the end user’s information needs. Because of the size of the index, a system has got to decide which tiny subset of pages should be retrieved and how these pages should be ranked and displayed on the user’s device screen. Of course, the system 101 % ' $ Manifest # Variables " ! & Latent Variables Fig. 2. A pictorial representation of a structural equation model cannot decide on its own, since it is just an implementation of data structures and algorithms designed by programmers, engineers and scientists. The theoretical models, such as deep neural networks, that are utilized as the basis of the system implicitly decide what to retrieve and how to rank the indexed web pages. These models might be so complex that even their designers may not be aware of all the internal mechanisms driving the system toward a certain ranking. Such ignorance may be a source of bias since the designers of a system may bring some hidden selection mechanisms into operation. 3 What is SEM SEM refers to the complex of multivariate statistical methods aiming to specify, estimate and fit a system of linear equations to a dataset [3]. The variables of the linear equations can be either exogenous or endogenous and in parallel they can be either manifest or latent, thus yielding four types of variable (latent endogenous, latent exogenous, manifest endogenous, manifest exogenous). A 102 " " " " " " " " " " " " " " " " ! ! ! ! ! ! ! ! ! ! ! queries Indexing and Retrieval and \cite{Parliam pages \cite{Parlia Content index \cite{Parlia result page \cite{Parlia World Wide Web Document End User ent&16} ment&16} Representation ment&16} ment&16} Ranking clicks Culture, bias Intent, needs, language, … language… Context Design Decisions Context # # # # # $ $ $ $ $ $ $ $ Fig. 3. How a structural equation model relates with a retrieval system structural equation model can be specified in general terms as follows (Figure 2): η = Bη + Γξ + ζ (1) ) y = Λy η + (2) x = Λx ξ + δ where Eq. 1 is called “latent model” and Eq. 2 is called “measurement model”. In particular, η is a vector of endogenous latent variables, ξ is a vector of exogenous latent variables, x is a vector of exogenous manifest variables, and y is a vector of endogenous manifest variables. B, Γ, Λy , Λx are coefficient matrices, whereas ζ, , δ are vectors of error uncorrelated with the variables. It can easily be seen how to define a certain linear model by imposing some constraints on the coefficient matrices. The constraints imposed on a structural equation model correspond to the “causal” relationships; for example, a null coefficient means that no causal relationship can be assumed between two variables. 4 How to use SEM to Understand IR Systems The first step of the procedure to understand how a retrieval system decides about retrieval and ranking is the collection of the data of the manifest variables. The manifest variables that should be collected at this step are related to two main conceptual entities of the search process, i.e. documents and users; Figure 3 provides a generic illustration of the relationships between the variables of a structural equation model and the components of a retrieval system under scrutiny. 103 The documents such as web pages are the container of information delivered to the end user by the retrieval system. The documents are mainly a source of exogenous variables, since they are input data for the retrieval system, which is not allowed nor is it requested to update the document content. The data that is observed from documents may be – structured data such as time and location, e.g. the Uniform Resource Locator (URL) and the embedded metadata, – semi-structured data such as logical organization using titles, sections and paragraphs, and – unstructured data such as keywords, which are measured in terms of Term Frequency × Inverse Document Frequency (TFIDF) and Best Match No 25 (BM25) weights. The amount and the quality of the document features depend on the degree to which the applications performing causal analysis are allowed to access the index(es). In the event the applications cannot access the indexes, all the document features can only be extracted from the search engine result pages as illustrated in [13]. Moreover, the effectiveness of document parsing may be crucial; for example, the document author’s gender may be inferred, thus providing the data necessary to check whether the retrieval system is biasing the results according to gender. Document quality can be considered one of the latent variables that is associated to the documents and is a source of retrieval bias; see for example [2,18]. The users can be viewed as sources of streams of data rather than containers of data. They express their own information needs mainly as (streams of) queries, clicks and display or dwelling time intervals. The amount and the quality of the data that can be gathered from the users depend on the type of experimental setting prepared for the investigation of the cause-effect relationships; in the event of controlled experiments, the user can be selected and trained by the scientists and the data can be gathered in a laboratory environment; otherwise, the search engine query logs are the main source of data. The design of a user study can be a complex task depending on the aims and the available resources [9, 10]; for example, user profiles can be built and utilized for the purposes of the causal analysis to understand whether some user’s features, such as gender, affect how he or she formulates queries and then how document retrieval can be affected. User intent can be considered one of the latent variables that can be associated to the users and that can be a source of retrieval bias. In this respect, some noticeable research in user simulation was carried out in Information Science [5]. The definition of a structural equation model is perhaps the most crucial step because it is the step when the analyst can add constraints to the structural equation model and, in this way, express the possible causes and effects under investigation. However, the use of SEM might be complicated. The highly super- vised nature of SEM should be regarded as both a strength and a weakness. When specifying a structural equation model, an analyst imposes her own viewpoint on the mechanics of a retrieval system; the addition of one constraint or the removal of another constraint is definitely a subjective decision. The mechanical procedure of model fitting is nothing but a computational procedure providing 104 a measure of fit and the significance level thereof. As discussed in [3] and [11], the lack of rejection of a structural equation model for a given sample of data or sample covariance matrix cannot be regarded as the sign that the model is the true and only one – there might be other, equally acceptable structural equation models which might significantly be different from the tested model. Nevertheless, the supervision exercised by the analyst guarantees that the discovery of causal relationships is not completely entrusted to an automated system, which might in turn be affected by the bias which is supposed to affect the scrutinized retrieval system. 5 About Causality, SEM and IR In this section, we discuss the relationships between interpretability, explain- ability and causality within the domain of utilization of SEM in IR. The main aim of the discussion of the relationships between interpretability, explainability and causality is to understand the way structural equation models may explain the principles that govern the mechanics of a retrieval system and eventually the reasons behind the production of a certain search engine result page. Inter- pretability, explainability and causality are three broad concepts which appear to be interrelated and, in some cases, largely overlapping; furthermore, there are many other related concepts yet they were already addressed in [12], for example, and we will not further address them in this paper. Despite being overlapped, we consider interpretability, explainability and causality as three distinct yet related notions. In particular, cause-effect relationships cannot be explained without inter- pretation. In the context of IR, an interpreter of the mechanics of a retrieval system is necessary to explain the reasons behind the production of a certain search engine result page. Our argument that cause-effect relationships cannot be explained without interpretation rests on an meaning basis. Causality1 is the relationship between two things where one thing makes the other happen as if there were a sort of physical action between these two things. On the other hand, interpretability3 refers to a broker which is able to find an agreement between the two parties who are trading goods and services. The implicit assumption is that (1) each party is using its own language that cannot be understood by the other party and (2) the interpreter has the ability to perform a sort of translation. In the context of automated decision systems, the broker i.e. the interpreter is thus the agent which translates the model’s language to the user’s language. Therefore, we consider interpretation in the sense that the model might not at all be directly understandable and an interpreter is needed to make the model understandable for the user. Instead, explainability4 refers to the ability to remove the folds in 1 The root of “causality” is from the Latin causa, i.e. a thing2 which should be regarded as fact or event. 3 The root of “interpretability” is from the Latin inter, i.e. between, pretı̆um, i.e. price, and habı̆le, i.e. led by hand. 4 The root of “explainability” is from the Latin ex, i.e. out of and planus, i.e. plain. 105 order to make the internal meaning and content explicit. The action of removing the folds, i.e. explaining the causes and the effects, can only be performed by an interpreter which is able to understand the languages of both parties. Indeed, an automated decision system such as a retrieval system cannot by assumption explain what caused it to produce a certain search result page; furthermore, the end user of a retrieval system cannot be asked to understand the causes because of the high complexity of the retrieval system. We argue that a structural equation model may play the role of an interpreter between a retrieval system and the end user, thus making an explanation of the internal system’s mechanics possible and explicit. A structural equation model may play the role of an interpreter because of the following reasons. 1. First, the model can organize latent and manifest variables as well as ex- ogenous and endogenous variables within a network of paths, also known as path diagrams, making possible “causes and effects” easily readable. In a path diagram, each symbol has a well-defined meaning; ovals represent latent variables, boxes represent observed variables, and oriented edges represent the “causal” relationship between the variable at the base of the edge and the variable at the head of the edge [15–17]. It is the graphical and visual feature of path diagrams that make a structural equation model an effective means of explanation to human users. 2. Second, the graphical representation of a structural equation model has a dual representation consisting of a system of linear equations where (a) each node of the path diagram corresponds to a variable, (b) each edge between two nodes corresponds to the co-occurrence of the corresponding variables in one equation at least, (c) the weight of an edge corresponds to one coefficient of an equation, and (d) the direction of an edge varies due to the fact that the coefficient of a left-hand variable contributing to the right-hand variable of an equation would differ from the the coefficient of a left-hand variable contributing to the right-hand variable if the two variables were swapped. 3. Moreover, whether a variable is endogenous or exogenous can be induced and represented by the linear system since the dependent variables are endogenous whereas the independent variables are exogenous. 4. Finally, latent variables can be expressed with a structural equation model because each variable in a system of linear equations do not necessarily correspond to observable quantities; from a mathematical point of view, it is only required that they be defined over the real field. Some misconceptions surrounding SEM might negatively affect the utilization of structural equation models to explain the mechanics of a retrieval system to the end user. In [4] the authors explained that SEM had often been underestimated or even misunderstood and tried to clarify the false beliefs and uncover the potential of structural equation models. The belief that structural equation models aim to establish causal relations from associations alone is one significant misconception with respect to the use of structural equation models to explain the mechanics of a retrieval system. In contrast, a structural equation model has a different 106 objective, since it provides the methods to test the hypothesis that a sample fits a structural equation model, the latter being the hypothetical explanation provided by the expert user a priori. As a consequence, SEM can never establish causal relations from associations alone because those relations are already encoded into the structural equation model postulated by the user. The demystification of the role played by SEM to establish causal relations is important within the context of a retrieval system, since a structural equation model would be a means to explain how a retrieval system works and as a consequence it can be a means to understand if and the degree to which the system follows some fairness guidelines in delivering the search results to a certain query. If a structural equation model were the outcome of an automated causal discovery algorithm, the assessment of the degree to which the system follows some fairness guidelines in delivering the search results to a certain query would be moved from the level of retrieval and search to the level of causal discovery and thus recreated. The fact that SEM can only assess whether the observed data fit a certain structural equation model allows the end user to get control of the most delicate step of the process, which is understanding the causes of the production of a search result. Thus, only the mere fitting is left to the computational methods. The notion of causality or causation can be another source of issues when using SEM in general and within IR, in particular. It is well known that correlation does not imply causation; for example, increase in weight does not cause increase in height although they are two correlated variables when measured in a population. Moreover, it is impossible for causality to be framed only within the situations in which manipulation alone can be considered as the sole source of cause, i.e. the statement “no causation without manipulation” [8] can hardly be taken for granted [4]. However, manipulation can play a role in the context of retrieval systems in more than one way: – First, the end user can manipulate the structural equation model, and as a result, the possible causes and effects; indeed, the addition or removal of nodes or edges correspond to the process of imposing constraints on the system of linear equations and more importantly to stating that a non-zero coefficient means a possible cause-effect relation between two variables. – Second, the manipulation is physically possible in case of a retrieval system. If a structural equation model fits a certain sample of data, it is possible to investigate the effects on the endogenous variables, such as the effects of the rank of a retrieved document upon the variations of the exogenous variables such as the frequency of a query term. In other words, if the end user observed that the rank of relevant documents improves because of the increase in the frequency of a query term, the retrieval model could be tailored to this observation and the retrieval system could change the retrieval score and as a result the rank of the documents matching the query term. 107 6 Future Directions The size of the data processed to fit a structural equation model can be a significant issue for the future research. In the context of IR, the primary source of data consists of the search engine result page. Such a page implements a non- random sample, i.e. the sample cannot uniformly be drawn from the collection of the page crawled by the search engine. The sample cannot be random because the focus of the investigation of the cause-effect mechanisms underlying the performance of a retrieval system and its impact on the user’s information needs in terms of bias can only be observed from the top-ranked retrieved pages, since the top-ranked hits will be the ones accessed by the users. The size of the data processed to fit a structural equation model is not an issue for computational problems; it is rather an issue for statistical reasons, since sample size affects model estimation and significance testing. The issue of size coupled with the fact that the top-ranked hits should be considered implies that the sampled pages are not equal; in particular, the top ten ranked pages which usually correspond to the displayed “blue links” are the most frequently accessed by the end user. How to consider these top-ten or top-twenty hits should be addressed in future work. References 1. Baeza-Yates, R.: Bias on the web. Communication of the ACM 61(6), 54–61 (2018) 2. Bendersky, M., Croft, W.B., Diao, Y.: Quality-biased ranking of web docu- ments. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. pp. 95–104. WSDM ’11, ACM, New York, NY, USA (2011). https://doi.org/10.1145/1935826.1935849, http://doi.acm.org/10.1145/ 1935826.1935849 3. Bollen, K.A.: Structural Equations with Latent Variables. Wiley (1989) 4. Bollen, K.A., Pearl, J.: Eight myths about causality and structural equation models. In: Morgan, S.L. (ed.) Handbook of causal analysis for social research, pp. 301–328. Springer (2013) 5. Borlund, P.: A study of the use of simulated work task situations in interactive information retrieval evaluations: A meta-evaluation. Journal of Documentation 72(3), 394–413 (2016). https://doi.org/10.1108/JD-06-2015-0068, https://dblp. org/rec/journals/jd/Borlund16 6. Cuzzocrea, A., Bonchi, F., Gunopulos, D.: CIKM 2018 co-located workshops sum- mary. In: Proceedings of CIKM. pp. 2309–2311. CIKM ’18, ACM, New York, NY, USA (2018). https://doi.org/10.1145/3269206.3274267, http://doi.acm.org/10. 1145/3269206.3274267 7. European Parliament, Council of the European Union: Regulation (EU) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/ec (general data protection regulation). https://eur-lex.europa.eu/eli/reg/2016/679/oj (2016) 8. Holland, P.W.: Statistics and causal inference. Journal of the American Statistical Association 81(396), 945–960 (1986), http://www.jstor.org/stable/2289064 108 9. Kelly, D.: Measuring online information seeking context. Part 1: Background and method. Journal of the American Society in Information Science and Technology 57(13), 1729–1739 (2006). https://doi.org/http://dx.doi.org/10.1002/asi.v57:13 10. Kelly, D.: Measuring online information seeking context. Part 2: Findings and discussion. Journal of the American Society in Information Science and Technology 57(13), 1862–1874 (2006). https://doi.org/http://dx.doi.org/10.1002/asi.v57:14 11. Kline, R.B.: Principles and Practice of Structural Equation Modeling. The Guilford Press, fourth edn. (2015) 12. Lipton, Z.C.: The mythos of model interpretability. Queue 16(3), 30:31–30:57 (Jun 2018). https://doi.org/10.1145/3236386.3241340, http://doi.acm.org/10.1145/ 3236386.3241340 13. Melucci, M., Paggiaro, A.: Evaluation of information retrieval systems using struc- tural equation modeling. Computer Science Review 31, 1–98 (2019) 14. Olteanu, A., Garcia-Gathright, J., de Rijke, M., Ekstrand, M.D.: Workshop on fairness, accountability, confidentiality, transparency, and safety in information retrieval (facts-ir). In: Proceedings of SIGIR. pp. 1423–1425. SIGIR’19, ACM, New York, NY, USA (2019). https://doi.org/10.1145/3331184.3331644, http://doi.acm. org/10.1145/3331184.3331644 15. Wright, S.: On the nature of size factors. Genetics pp. 367–374 (1918) 16. Wright, S.: Correlation and causation. Journal of Agricultural Research 20, 557–585 (1921) 17. Wright, S.: The method of path coefficients. Annals of Mathematical Statistics 5, 161–215 (1934) 18. Zhou, Y., Croft, W.B.: Document quality models for web ad hoc retrieval. In: Proceedings of CIKM. pp. 331–332. CIKM ’05, ACM, New York, NY, USA (2005). https://doi.org/10.1145/1099554.1099652, http://doi.acm.org/10.1145/ 1099554.1099652 109