=Paper=
{{Paper
|id=Vol-2067/paper19
|storemode=property
|title=Considering Importance of Information Sources during Aggregation of Alternative Rankings
|pdfUrl=https://ceur-ws.org/Vol-2067/paper19.pdf
|volume=Vol-2067
|authors=Vitaliy V. Tsyganok,Sergii V. Kadenko,Oleh V. Andriichuk
|dblpUrl=https://dblp.org/rec/conf/its2/TsyganokKA17
}}
==Considering Importance of Information Sources during Aggregation of Alternative Rankings==
Considering Importance of Information Sources during Aggregation of Alternative Rankings © Vitaliy V. Tsyganok © Sergii V. Kadenko © Oleh V. Andriichuk Institute for Information Recording of National Academy of Sciences of Ukraine, Kyiv, Ukraine tsyganok@ipri.kiev.ua seriga2009@gmail.com oleh.andriichuk@i.ua Abstract The paper outlines several approaches to aggregation of rankings of alternatives, provided by a set of individual information sources (IS). It is shown, that, depending on dimensionality of the set of alternatives, different aggregation methods can be applied. Particular attention is given to weighted aggregation of alternative rankings provided by different IS. Some intuitive assumptions concerning IS weight calculation are set forth. It is assumed that IS weight should reflect both quantity and quality of information it provides, as well as expert estimate of IS credibility based on previous experience of IS usage. Based on these assumptions, expressions for IS weight calculation are suggested. The approaches, suggested in the paper can be applied to information, provided by sources of different nature, i.e., experts, search engines, paper and online documents, etc. Keywords: alternative ranking, information source, rank ordering, expert competence, relative weight, aggregate ranking. 1 Introduction In the process of informational and analytic research a common problem, often faced by analysts is compiling a ranking of a set of objects or alternatives (products, electoral candidates, political parties etc) according to some criterion. This criterion may be explicitly formulated by a decision-maker, or presented in the form of a particular topical query (such as “top comic actors”, “best online footwear stores”, “downtown restaurants”, etc). Even these randomly selected formulations demonstrate that it is problematic to select these alternatives, satisfying the respective queries based on solely numeric data, as there is no quantitative criterion, according to which the alternatives could be measured. A common approach, summarized in the most explicit way, perhaps, by Tom Saaty ([1]), is as follows: if you cannot measure the alternatives, the best way to numerically describe them, is to compare them among themselves. Conceptually, comparisons of alternatives according to some pre-defined criterion can be classified as either cardinal or ordinal estimates. Cardinal pair-wise comparisons of alternatives bear information about numeric relation between them according to the given criterion, while ordinal comparisons indicate only the ordering of these alternatives. Cardinal pair-wise comparisons are good for relatively small numbers of alternatives, lying within the same order of magnitude. Ordinal pair-wise comparisons are easier to obtain and process and they can be used to compile rankings of large numbers of alternatives. Amount of data, used in informational and analytical research is, usually, large, and that is why ordering of alternatives is more frequently used. For example, this is one of the possible reasons why web search engines operate mostly with rankings (orderings) of references and not with their numerically expressed ratings of some kind. In order to compile a ranking of alternatives based on all available information, rankings, coming from several information sources, need to be taken into consideration and aggregated in some way. Authors of [2] show that information space is influenced by multiple factors of qualitative nature, which are difficult to formalize and describe in numeric terms. Based on these characteristics, it was shown in [3] that information space was a typical example of a weakly structured subject domain. In order to provide at least some analytical description of a weakly structured domain, data, coming from any available sources, needs to be taken into consideration. In this paper we will try to address the most general case, when information sources can include paper and online documentation, web search results, and expert judgments. The problem with aggregation of data, coming from different information sources is that, in the general case, these sources have different weights, depending, again, on several factors. For example, it is reasonable to assume, that more credible information source, providing larger amount of information, should be assigned greater weight prior to data aggregation. While there is, roughly, a dozen of methods of ranking aggregation (proposed by Borda, Condorcet, Kemeny, and others), that are commonly known and described in a multitude of publications, the problem of defining the weights of data sources (voters, judges, experts, search engines etc.) during data aggregation is still relevant and open to discussion. So, in further sections of our paper, we are going to set forth a method, allowing to aggregate data, presented in the form of alternative rankings, coming from different information sources, particularly focusing on different aspects of information source weight definition problem. 132 2 Literature review The problem of rank aggregation became relevant back in the ancient times, when democratic societies and voting systems started to emerge. An extensive analysis of voting rules and their evolution is provided, for example, in [4]. The first rank aggregation methods, that emerged in the process of democratic voting evolution, which are still relevant today, were suggested in the late 18th century by Borda ([5]) and Condorcet ([6]). Since then multiple approaches to aggregation of individual preferences using different social welfare functions were devised. We find it most appropriate to mention the period of 1950-s and 1960-s – the time when Arrow’s impossibility theorem was formulated (see [7]) and attempts were made to bypass its constraints (particularly, by Kemeny [8] and Copeland [9]). F.Aleskerov in [10] summarizes Arrovian aggregation rules in his book. In the 2000s, with growing popularity of online information search engines, the problem of rank aggregation became even more relevant, as it became evident, that beside data obtained from voters, experts, and analysts, it was necessary to take online information sources into account (sometimes, with minimum human involvement). In this context, we should mention several papers from the 2000s, particularly [11-13]. In these publications the authors suggest and compare several approaches, targeted at aggregation of data from online information sources. Definition of weights of information sources and weighted aggregation of rankings represents a separate matter. Methods of Borda and Condorcet can be easily extrapolated to the case when information sources have different weights, as shown by Totsenko in [14]. Kemeny’s median is a more problematic case when information sources have different weights. Ordinal factorial analysis methods, described in [15, 16] allow analysts to calculate weights based on available sets of alternative rankings, previously provided by experts. This approach can be extrapolated to calculation of weights of online information sources as well, if evaluators provide some global alternative ranking a priori. Dwork et al in [11] stress the importance of involvement of human evaluators in the process of preliminary information source weight definition. However, their paper focuses mostly on comparative analysis of rank aggregation methods and does not address the problem of IS weight calculation directly. As for IS weights, we can, again, mention Tom Saaty ([1]) and his academic school (including Ramanathan & Ganesh [17], Forman & Peniwati [18], as well as Yang et al [19]). However, their methods are primarily targeted at aggregation of expert estimates, represented in the form of pair-wise comparison matrices (PCM), provided in some ratio scales (and not rankings). In our current paper we are going to suggest information source weight calculation methods based on both preliminary human evaluation of information source credibility and amount and quality of information, provided by a given information source. 3 The problem statement What is given: n information sources (IS1 – ISn). Each of IS provides a ranking Ri, that includes mi alternatives (i=1..n). Information sources can be experts, search engines, analysts who monitor online content, paper or online documents, etc. We should stress, that the number of alternatives, provided by different information sources is, in the general case, different ( mi m j ; i, j 1..n ). We should find an aggregated (global) ranking of alternatives. 4 Solution ideas for the case of equal weights of information sources We suggest building a solution algorithm based on available methods of aggregation of individual rankings (ordinal estimates). The most common ranking aggregation methods were proposed by Borda, Condorcet and, perhaps, Kemeny. We should stress that if we are dealing with experts, the number of alternatives an expert can analyze “in one go” is no more than 7±2. However, search engines can return rankings of hundreds of links. That is why, usage of domination matrix – based methods is not recommended for such dimensionalities. For example, if we are dealing with 10 IS, providing rankings of 1000 alternatives each, then we have to analyze and process 10 matrices with 1000×1000 cells. In such cases Borda method seems to be the most appropriate option, as it operates with ranking vectors and not domination matrices. Besides, it is easier to extrapolate Borda method to the case when IS have different weights (while, for in stance, extrapolation of Markov chain-based methods looks more problematic). If all IS have the same weight, we can use the following algorithm. 1) Find the length of a ranking with all unique alternatives, featured in individual rankings n M card ({ A(ji ) ; j 1..mi }) . i 1 133 2) Form a unified set of alternatives. For this purpose we should add to each ranking Ri of mi alternatives n all the remaining alternatives {A ; j 1..m } /{ A ; j 1..m } with rank m +1. During unification of k 1 (k ) j k (i ) j i i alternatives, provided by different IS, we should check whether alternatives are conceptually the same. If IS are experts, then for this purpose we can use the methods of semantic similarity determination, suggested in [20]. 3) As a result, we will get M alternatives in each ranking. Each ranking Ri will include mi alternatives with different ranks and (M – mi ) alternatives with the same rank (mi +1). (A similar approach was mentioned by Dwork et al in [11]). 4) Add the ranks of each alternative across all IS. 5) Sort (order) the alternatives in the order of increment of individual rank sums. This sorting will result in the rank order that we should find. n n rkj rlj rk rl , k ,l 1..M (1), j 1 j 1 where rkj и rlj – are the ranks of alternatives Ak and Al in the ranking, provided by IS number j. 5 Usage of different methods under smaller cardinality of the set of alternatives If the number of alternatives amounts to one or two dozens, or if the decision-maker (analyst) is interested only in the rank order of the first few alternatives, provided by IS, then it makes sense to use other ranking methods (beside or instead of Borda) in order to aggregate the individual ranking results. For example, let us assume, we have 10-15 IS, and each of them provides a ranking of alternatives, from which we are particularly interested in the top 10-20 items. If all alternatives (items) are different, then we have 300 (1520) different items at most. In fact (especially if we are talking about search engines), many items featured in individual rankings across different IS are the same. So, if we have 15 IS and each of them provides a ranking of 20 alternatives, then the cardinality of the set of unique items will amount to approximately 100 alternatives. Consequently, if we use aggregation methods operating with ordinal pair-wise comparison matrices (PCM) (such as Condorcet rule or Kemeny’s median) and not with ranking vectors (like Borda or Markov Chain based methods), we will have to analyze 15 square matrices containing 100100 cells. As strict rank order relationship is reciprocally symmetrical (if alternative А1 ranks higher than А2, then А2 ranks lower than А1), during aggregation of PCM we will have to analyze just matrix elements above the principal diagonal {dik, i