Unanimity-Aware Gain for Highly Subjective Assessments Tetsuya Sakai Waseda University tetsuyasakai@acm.org ABSTRACT unanimity can affect statistical significance test results even when IR tasks have diversified: human assessments of items such as so- its impact on the gain value is kept to a minimum. Moreover, since cial media posts can be highly subjective, in which case it becomes our simulated ratings do not consider the correlation present in necessary to hire many assessors per item to reflect their diverse the assessors’ actual ratings, our experiments probably underesti- views. For example, the value of a tweet for a given purpose may mate the effect of introducing unanimity into evaluation. Hence, be judged by (say) ten assessors, and their ratings could be summed if researchers accept that unanimous votes should be valued more up to define its gain value for computing a graded-relevance evalu- highly than controversial ones, then our proposed approach may ation measure. In the present study, we propose a simple variant of be worth incorporating. this approach, which takes into account the fact that some items receive unanimous ratings while others are more controversial. We 2 RELATED WORK generate simulated ratings based on a real social-media-based IR 2.1 Document Relevance Assessments task data to examine the effect of our unanimity-aware approach on the system ranking and on statistical significance. Our results Due to lack of space, we refer the reader to Sakai [7] for a short show that incorporating unanimity can affect statistical signifi- overview of studies related to inter-assessor agreement. Below, we cance test results even when its impact on the gain value is kept to briefly discuss two studies that helps us to explain the novelty of a minimum. Moreover, since our simulated ratings do not consider our approach to utilising multiple relevance assessments. the correlation present in the assessors’ actual ratings, our experi- Megorskaya et al. [4] studied the benefit of communication be- ments probably underestimate the effect of introducing unanimity tween multiple assessors in the context of gamified relevance as- into evaluation. Hence, if researchers accept that unanimous votes sessment for web search evaluation. The premise in their work is should be valued more highly than controversial ones, then our that every document needs to finally receive a single relevance level, proposed approach may be worth incorporating. as a result of a consensus between the assessors or an overruling by a “referee” etc. This is in contrast to our work, where we are CCS CONCEPTS interested in assessment tasks where there may be no such thing as the correct assessment, and therefore it is important to preserve •Information systems → Retrieval effectiveness; different subjective views in the data and in evaluation. Turpin et al. [12] propose to use magnitude estimation in docu- KEYWORDS ment relevance assessments in order to obtain ratio-scale judgments effect sizes; evaluation measures; inter-assessor agreement; p-values; instead of the traditional ordinal- or interval-scale ones, and to in- social media; statistical significance terpret the ratio-scale judgments directly as the gain values for computing normalised discounted cumulative gain (nDCG) and 1 INTRODUCTION expected reciprocal rank (ERR). This is achieved by instructing In traditional test-collection-based IR experiments, we often rely on the assessor to give an arbitrary score to his first document and our experience which says that system rankings would remain sta- subsequently to give a “relative” score to each of the remaining ble even if the set of document relevance assessments are replaced documents, where “relative” means “in comparison to the preceding by another [13]. However, IR tasks have diversified: human assess- document.” While their approach and ours both produce continu- ments of items such as social media posts can be highly subjective, ous relevance assessments, unanimity across judges was not within in which case it becomes necessary to hire many assessors per item the scope of their study. to reflect their diverse views. For example, the value of a tweet for a given purpose may be judged by (say) ten assessors, and their 2.2 Social Media Assessments ratings could be summed up to define its gain value for computing Wang et al. [14] examined the effect of assesser differences in the a graded-relevance evaluation measure (e.g. [8, 11]). In the present context of the TREC Tweet Timeline Generation task, by devising study, we propose a simple variant of this approach, which takes two sets of tweet equivalence classes constructed by different as- into account the fact that some items receive unanimous ratings sessors. Their conclusion is similar to that of Voorhees [13] who while others are more controversial. We generate simulated rat- examined the effect of document relevance assessor differences: de- ings based on a real social-media-based IR task data to examine spite the substantial differences in the two sets of clusters, system the effect of our unanimity-aware approach on the system ranking rankings and the absolute evaluation measure scores based on these and on statistical significance. Our results show that incorporating two sets were very similar. Sakai et al. [8] used graded-relevance Copying permitted for private and academic purposes. measures to evaluate a community QA answer ranking task; each EVIA 2017, co-located with NTCIR-13, Tokyo, Japan. answer was assessed by four assessors and its gain value for com- © 2017 Copyright held by the author. puting the measures was determined as the sum of assessors’ grades. 39 EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. Tetsuya Sakai Table 1: Examples of RawGi , WGi , UGi when D max = 3. final evaluation measures are also intervals or distributions. They Item i Ratings ( N = 5) RawG i Di WG i UG i (p = 0.2) UG i (p = 0.1) call their measures agreement-aware measures. Item 1 22222 10 0 10.0 13 11.5 In contrast to their novel approaches, our proposal simply utilises Item 2 11233 10 2 3.3 11 10.5 Item 3 02233 10 3 0.0 10 10 the original score distribution across assessors to adjust the gain Item 4 11111 5 0 5.0 8 6.5 value of each document so that a traditional evaluation measure Item 5 00003 3 3 0.0 3 3 Item 6 00002 2 2 0.7 3 2.5 can be computed. It remains to be seen how the interval and distri- Item 7 00001 1 1 0.7 3 2 bution measures of Maddalena et al. can effectively be utilised in IR evaluation venues such as CLEF, NTCIR and TREC. More recently, Shang et al. [11] reported on the NTCIR-12 Short 3 PROPOSED METHOD Text Conversation task which is basically a tweet retrieval task: in Our proposal is very simple and highly intuitive. Given a constant their Japanese subtask, the sum of scores from ten assessors were p (0 ≤ p ≤ 1), let us define the unanimity-aware gain as follows: used to define the gain value of each tweet. Note that these studies do not take into account whether the assessors are unanimous or UGi = RawGi + pN (D max − D i ) (2) not; the sum is all that matters. if RawGi > 0; otherwise UGi = RawGi = 0. Here, D max − D i is a simple measure of unanimity where D i is, as before, the difference 2.3 Li and Yoshikawa between the maximum and the minimum among the N ratings2 . The recent work of Li and Yoshikawa [2] is similar in spirit to ours, Whereas, p controls the impact of unanimity on the gain. Thus, and deserves a detailed explanation. They consider the problem while we mainly want to reflect RawGi in our evaluation, we apply of assessing the similarity between two documents using many an “upgrade” according to the degree of unanimity. When the assessors, and propose to incorporate what they call “confusabil- ratings of an item are perfectly unanimous (i.e., D i = 0), we are ity” into measures such as Pearson’s correlation. Specifically, when giving it an extra pN D max ; that is, we shall pretend that pN extra computing a correlation value, they propose to weight each labelled assessors gave it the highest possible rating. item i by 1 − c i , where c i is a normalised measure of confusability For Items 1-3 shown in Table 1, p = 0.2 implies that UG 1 = based on the difference (D i ) between the highest and the lowest 13, UG 2 = 11, UG 3 = 10; this is clearly more intuitive than the WGi ratings for item i, and so on1 . Li and Yoshikawa remark that the values. On the other hand, consider Items 5-7 in Table 1: note that same idea can be applied to other measures such as nDCG, although if p = 0.2, UG 5 = UG 6 = UG 7 = 3. If this is not desirable, p = 0.1 they do not provide any details: here, let us try to faithfully apply may be used instead as shown in the same table; however, we shall their idea to ranked retrieval evaluation based on a group of as- discuss how to set an appropriate p elsewhere with real assessments sessors. Let N be the number of assessors per item, and suppose in our future work. Hereafter, we only consider a modest impact by that each assessor assigns to each item a rating on a scale from letting p = 0.2, as the focus of the present study is to demonstrate 0, 1, . . . , D max . A straightforward way to define the final relevance that our approach has a practical impact on experimental results level or the actual gain value for each item would be to just sum even with a small p. up the ratings [8, 11]: then we would have relevance levels from Our approach suggests a slight departure from traditional IR 0 to N D max . For any item i with N independent assessments, let evaluation at the implementation level as well. In traditional IR, RawGi denote the gain value thus obtained. The above approach we usually prepare discrete relevance levels (e.g. relevant, highly of Li and Yoshikawa suggests that we modify each gain value as relevant, etc.) to define the gold standard: we know the number follows: of relevance levels in advance, and we map each relevance level to a gain value at the time of measure calculation. In contrast, our WGi = (1 − c i )RawGi = (1 − D i /D max )RawGi . (1) approach suggests that we retain the individual ratings in the test Let us consider Items 1-3 shown in Table 1 with N = 5, D max = 3. collection, from which gain values can be computed on the fly; there Clearly, RawG 1 = RawG 2 = RawG 3 = 10, and according to Eq. 1, is no longer the notion of a predefined set of relevance levels. It is WG 1 = 10, WG 2 = 3.3, WG 3 = 0. Thus Items 2 and 3 are considered easy to see that the highest possible value of UGi is (1 + p)N D max . worse than (say) Item 4 in Table 1, since WG 4 = RawG 4 = 5. Clearly, Fortunately, there is a readily available IR evaluation tool that a more careful consideration is in order. accommodates not only relevance-level-based computation but also direct gain-value-based computation, as we shall discuss below. 2.4 Maddalena et al. More recently, at ICTIR 2017, Maddalena et al. [3] proposed an 4 EXPERIMENTS evaluation approach whose motivation is almost identical as ours: Let us demonstrate the effect of introducing unanimity-aware gain they also claim that the distribution of the scores from different to an IR task where the ratings of the items can be highly subjec- assessors should be utilised for IR evaluation. More specifically, tive. To this end, we chose to use the recent NTCIR-12 Short Text they propose to replace a gain value of a document with an interval Conversation (STC) Chinese subtask data [11] for the following of gain values or even with a distribution of gain values, so that the reasons: (1) STC requires the system to return a “reasonable” tweet as a response to a human tweet, and the assessments are expected 1 Li and Yoshikawa [2] also considered using standard deviation and entropy to quantify c i , but the present study focusses on the simplest case that relies on D i as we believe 2 Variants are possible of course: for example, we could obtain maximum and minimum that evaluation methods should be as simple as possible. values after removing outlier ratings. 40 Unanimity-Aware Gain for Highly Subjective Assessments EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. to be highly subjective; (2) STC was the largest task of NTCIR-12, Table 2: Comparing the system rankings of RawGi vs. UGi with 44 runs from 16 teams for the Chinese subtask3 . The STC with Kendall’s τ with 95% confidence intervals. Chinese test collection contains 100 topics (i.e., input Weibo tweets) D max N =5 N = 10 N = 20 N = 40 N = 80 with relevance assessments (“qrels”) containing the following rele- (a) Mean nG@1 vance levels: L0 (judged nonrelevant); L1 (relevant); and L2 (highly 2 .987 [.968, 1.007] .989 [.970, 1.007] 1 [.995, 1.005] 1 [.995, 1.005] 1 [.995, 1.005] relevant). 4 .992 .983 .996 1 1 [.976, 1.005] [.962, 1.004] [.985, 1.006] [.995, 1.005] [.995, 1.005] From the official qrels, we created 15 simulated variants with 8 .985 .985 .998 .992 1 D max ∈ {2, 4, 8} and N ∈ {5, 10, 20, 40, 80}, as follows. For each [.968, 1.003] [.963, 1.008] [.994, 1.005] [.978, 1.008] [.995, 1.005] (b) Mean P+ judged tweet of each topic, L0 is replaced with (0, 0, . . .); whereas, 2 .985 .994 .998 1 1 both L1 and L2 are replaced with N simulated ratings obtained by [.964, 1.006] [.982, 1.005] [.991, 1.005] [1, 1] [1, 1] 4 .983 .992 .998 1 1 random sampling from a uniform distribution over [0, D max ]. We [.965, 1.002] [.977, 1.005] [.994, 1.004] [.997, 1.003] [.997, 1.003] then compute the unanimity-aware gains using Eq. 2, and evaluate 8 .989 .992 .994 .996 1 [.972, 1.007] [.981, 1.005] [.982, 1.005] [.986, 1.006] [1, 1] up to top 10 Weibo tweets from each run. Note that randomly (c) Mean nERR sampling N times implies that as N gets large, we are more likely 2 .996 .994 1 1 1 [.985, 1.005] [.982, 1.005] [1, 1] [1, 1] [1, 1] to obtain both 0 and D max among the N observations and therefore 4 .996 .994 .998 1 1 D i is more likely to be D max , i.e., UGi is more likely to reduce [.986, 1.006] [.982, 1.005] [.990, 1.005] [1, 1] [1, 1] 8 .989 .992 .998 .998 1 to RawGi (Eq. 2). In contrast, real ratings of different assessors [.977, 1.005] [.978, 1.005] [.991, 1.005] [.990, 1.005] [1, 1] are probably correlated with one another, which should generally make D i smaller than our simulated ratings. Hence, this experiment probably underestimates the impact of introducing unanimity-aware 5 RESULTS AND DISCUSSIONS gain into evaluation. Table 2 compares the system rankings based on RawGi vs. UGi We use the three official measures from STC: nG@1 (normalised in terms of Kendall’s τ with 95% confidence intervals, for nG@1, gain at rank 1)4 , P+ (see below), and nERR (normalised expected re- P+ and nERR averaged over the 100 official Chinese STC topics. It ciprocal rank) [11]. While the official STC Chinese subtask used the can be observed that all of the upper confidence limits are above NTCIREVAL5 toolkit by giving a gain value of 1 to each L1-relevant one, meaning that the systems rankings based on RawGi and UGi tweet and 3 to each L2-relevant one, we utilised an alternative func- are statistically equivalent. However, except where the 95% CIs are tionality of the same tool, which enables us to feed gain values of “[1, 1],” the two rankings are not identical, even with p = 0.2. Recall relevant items directly to it without considering the number of rel- also that we should expect to see lower rank correlations if we use evance levels. This feature was already available in NTCIREVAL for real assessors’ ratings with correlations among them. the purpose of accommodating the global gain proposed by Sakai Probably a more practical concern than the change in the over- and Song [9], which is an idea for obtaining a real-valued gain all system ranking is: does the proposed method affect statistical for each relevant web page for search result diversification evalua- significance test results? If a researcher is interested in compar- tion. In their work, global gain was computed from intent-aware ing every system pair, then conducting a pairwise test such as the probabilities and per-intent graded relevance assessments. paired t-test repeatedly (without correcting α) is not the correct P+, an official measure from STC but nevertheless less well- approach: one elegant solution would be to use the randomised known than nDCG and nERR, deserves a brief explanation here. Tukey HSD (Honestly Significantly Difference) test [1], which is Just like nERR, it is a measure suitable for navigational intents. Just free from distributional assumptions and ensures that the fam- as Average Precision (AP) employs (binary) precision to measure ilywise error rate (i.e., the probability of incorrectly obtaining a the utility of the top r documents for a user group who abandon statistically significant difference for at least one system pair) is α. the ranked list at r , P+ employs the blended ratio, which combines We use the Discpower6 toolkit to conduct the randomised Tukey precision and cumulative gain, for the same purpose. Furthermore, HSD test from each topic-by-run score matrix, with B = 5, 000 trials just as AP assumes that the distribution of users abandoning the for each test. The STC Chinese subtask had 16 participating teams, ranked list is uniform across all relevant documents (even if some and one run from each team (specifically, best run in terms of the of them are not retrieved) [5], P+ assumes that the distribution official Mean nG@1 score) is considered in this analysis, giving us is uniform over all relevant documents ranked at or above rp , the 16 ∗ 15/2 = 120 comparisons. Do RawGi and UGi give us similar preferred rank [11]. Given a ranked list, the preferred rank is the p-values and similar research conclusions? rank of the most relevant document that is closest to the top. In our Table 3 summarises the discrepancy between the significance case, the preferred rank is the rank of the document in the res file test results with RawGi and those with UGi : these are the compar- that has the highest UGi value and is closest to the top. Both AP isons where the difference is statistically significant at α = 0.05 and P+ represent the expected utility over their user abandonment according to one while not significant according to the other. p- distributions. values, absolute score differences (|d X Y |), and effect sizes (ES HSD ) are also shown; ES HSD is computed by dividing |d X Y | by the resid- ual standard deviation of each experimental condition [6]7 . It can 3 The Japanese subtask actually collected N = 10 individual ratings for each tweet, be observed that the effect of introducing unanimity-aware gain but had only 25 runs from seven teams. We plan to use this data set as well in a follow up study. 6 http://research.nii.ac.jp/ntcir/tools/discpower-en.html 4 Note that neither Discounting nor Cumulation of “nDCG” does not apply at rank 1. 7 This form of effect size measures the difference between two systems in standard 5 http://research.nii.ac.jp/ntcir/tools/ntcireval-en.html deviation units; unlike the p -value, is not a function of the sample size. 41 EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. Tetsuya Sakai Table 3: Discrepancies at α = 0.05: p-values (those below α shown in bold), absolute differences, and effect sizes. (I) RawG i (II) UG i D max N system pair p -value |d X Y | ES HSD p -value |d X Y | ES HSD (a) nG@1 2 5 MSRSC-C-R1 vs. Grad1-C-R1 .064 .1373 1.808 .034 .1361 2.045 ICL00-C-R1 vs. Grad1-C-R1 .062 .1378 1.814 .029 .1377 2.069 cyut-C-R1 vs. HITSZ-C-R1 .064 .1375 1.811 .033 .1363 2.048 10 ICL00-C-R1 vs. PolyU-C-R1 .043 .1564 1.723 .057 .1452 1.756 20 Nders-C-R1 vs. picl-C-R2 .051 .1622 1.629 .045 .1632 1.650 4 5 cyut-C-R1 vs. HITSZ-C-R1 .073 .1390 1.752 .038 .1418 1.953 8 5 MSRSC-C-R1 vs. Grad1-C-R1 .048 .1481 1.797 .057 .1413 1.821 ICL00-C-R1 vs. Grad1-C-R1 .052 .1469 1.782 .049 .1434 1.848 cyut-C-R1 vs. HITSZ-C-R1 .054 .1466 1.779 .024 .1515 1.952 10 ICL00-C-R1 vs. PolyU-C-R1 .035 .1666 1.668 .055 .1568 1.653 (b) P+ 2 5 MSRSC-C-R1 vs. Poly-U-C-R1 .037 .1272 2.340 .051 .1176 2.431 Grad1-C-R1 vs. HITSZ-C-R1 .037 .1273 2.342 .059 .1150 2.377 10 ICL00-C-R1 vs. Grad1-C-R1 .047 .1311 2.251 .070 .1207 2.242 20 ICL00-C-R1 vs. Grad1-C-R1 .049 .1330 2.193 .053 .1319 2.188 4 5 MSRSC-C-R1 vs. Grad1-C-R1 .048 .1240 2.332 .068 .1146 2.343 PolyU-C-R1 vs. HITSZ-C-R1 .076 .1186 2.230 .047 .1197 2.447 8 5 PolyU-C-R1 vs. HITSZ-C-R1 .082 .1181 2.196 .045 .1210 2.400 10 ICL00-C-R1 vs. Grad1-C-R1 .032 .1363 2.298 .086 .1223 2.159 Nders-C-R1 vs. PolyU-C-R1 .035 .1357 2.288 .057 .1272 2.245 (c) nERR 2 5 BUPTTeam-C-R4 vs. ITNLP-C-R3 .045 .1347 2.208 .052 .1268 2.283 MSRSC-C-R1 vs. Grad1-C-R1 .060 .1312 2.151 .046 .1280 2.375 4 5 OKSAT-C-R1 vs. PolyU-C-R1 .046 .1380 2.157 .055 .1319 2.191 8 5 MSRSC-C-R1 vs. Grad1-C-R1 .037 .1436 2.151 .052 .1369 2.139 10 ICL00-C-R1 vs. Grad1-C-R1 .042 .1531 2.005 .052 .1480 2.008 cannot be overlooked, even with p = 0.2. For example, when on the traditional gain values. However, there were only N = 3 D max = 2, N = 5, there are three discrepancies between nG@1 assessors per topic. based on RawGi and that based on UGi among the 120 comparisons. In future work, we would like to apply our approach to diverse Whereas, as was anticipated in Section 4, it can be observed that social-media-related tasks with many assessors (i.e., a large N ), the impact of introducing unanimity is not observed for N = 40, 80. where, unlike our simulated ratings, correlations among the asses- Again, with real ratings that tend to resemble one another and sors are present. With real ratings, we expect to observe a larger make D i smaller that these random ratings do, we will probably impact of introducing unanimity-aware gain on the system ranking observe a more substantial impact of introducing unanimity-aware and statistical significance than we did in our simulations. gain into evaluation. REFERENCES [1] Ben Carterette. 2012. Multiple testing in statistical analysis of systems-based 6 CONCLUSIONS AND FUTURE WORK information retrieval experiments. ACM TOIS 30, 1 (2012). [2] Jiyi Li and Masatoshi Yoshikawa. 2016. Evaluation with Confusable Ground We proposed a simple and intuitive approach to incorporating the Truth. In Proceedings of AIRS 2016 (LNCS 9994). 363–369. assessors’ subjective yet unanimous decisions into gain-value-based [3] Eddy Maddalena, Kevin Roitero, Gianluca Demartini, and Stefano Mizzaro. 2017. Considering Assessor Agreement in IR Evaluation. In Proceedings of ACM ICTIR retrieval evaluation, and demonstrated that this will affect experi- 2017. 75–82. mental outcomes. Our results show that incorporating unanimity [4] Olga Megorskaya, Vladimir Kukushkin, and Pavel Serdyukov. 2015. On the Relation between Asessor’s Agreement and Accuracy in Gamified Relevance can affect statistical significance test results even when its impact Assessment. In Proceedings of ACM SIGIR 2015. 605–614. on the gain value is kept to a minimum. Moreover, since our simu- [5] Stephen E. Robertson. 2008. A New Interpretation of Average Precision. In lated ratings do not consider the correlation present in the assessors’ Proceedings of ACM SIGIR 2008. 689–690. [6] Tetsuya Sakai. 2014. Statistical Reform in Information Retrieval? SIGIR Forum actual ratings, our experiments probably underestimate the effect 48, 1 (2014), 3–12. of introducing unanimity-aware gain into evaluation. Hence, if [7] Tetsuya Sakai. 2017. The Effect of Inter-Assessor Disagreement on IR System researchers accept that unanimous votes should be valued more Evaluation: A Case Study with Lancers and Students. In Proceedings of EVIA 2017. highly than controversial ones, then our proposed approach may [8] Tetsuya Sakai, Daisuke Ishikawa, Noriko Kando, Yohei Seki, Kazuko Kuriyama, be worth incorporating. We also demonstrated how the proposed and Chin-Yew Lin. 2011. Using Graded-Relevance Metrics for Evaluating Com- munity QA Answer Selection. In Proceedings of ACM WSDM 2011. 187–196. approach of directly feeding gain values to an existing evaluation [9] Tetsuya Sakai and Ruihua Song. 2011. Evaluating Diversified Search Results tool can be accomplished, while bypassing the notion of discrete Using Per-Intent Graded Relevance. In Proceedings of ACM SIGIR 2011. 1043– relevance levels. 1052. [10] Lifeng Shang, Tetsuya Sakai, Hang Li, Ryuichiro Higashinaka, Yusuke Miyao, Following the present study, the proposed unanimity-aware gain Yuki Arase, and Masako Nomoto. 2017. Overview of the NTCIR-13 Short Text approach was applied to the recent NTCIR-13 Short Text Conversa- Conversation Task. In Proceedings of NTCIR-13. tion (STC-2) Chinese subtask [10], with p = 0.2. There, according [11] Lifeng Shang, Tetsuya Sakai, Zhengdong Lu, Hang Li, Ryuichiro Higashinaka, and Yusuke Miyao. 2016. Overview of the NTCIR-12 Short Text Conversation to the randomised Tukey HSD test, three extra statistically signif- Task. In Proceedings of NTCIR-12. 473–484. icantly different system pairs were obtained by using unanimity- [12] Andrew Turpin, Falk Scholer, Stefano Mizzaro, and Eddy Maddalena. 2015. The Benefits of Magnitude Estimation Relevance Assessments for Information Re- aware nG@1 instead of the traditional nG@1; one extra statistically trieval Evaluation. In Proceedings of ACM SIGIR 2015. 565–574. significantly different system pair was obtained by using unanimity- [13] Ellen M. Voorhees. 1998. Variations in Relevance Judgments and the Measure- aware P+ instead of the traditional P+. Thus the sets of statistically ment of Retrieval Effectiveness. In Proceedings of ACM SIGIR 1998. 315–323. [14] Yulu Wang, Garrick Sherman, Jimmy Lin, and Miles Efron. 2015. Assessor significantly different system pairs according to the unanimity- Differences and User Preferences in Tweet Timeline Generation. In Proceedings aware approach were supersets of the corresponding sets based of ACM SIGIR 2015. 615–624. 42