=Paper= {{Paper |id=Vol-2936/paper-212 |storemode=property |title=Quality-aware Argument Retrieval with Topical Clustering |pdfUrl=https://ceur-ws.org/Vol-2936/paper-212.pdf |volume=Vol-2936 |authors=Lukas Gienapp |dblpUrl=https://dblp.org/rec/conf/clef/Gienapp21 }} ==Quality-aware Argument Retrieval with Topical Clustering== https://ceur-ws.org/Vol-2936/paper-212.pdf
                                                     Quality-aware Argument Retrieval
                                                          with Topical Clustering
                                            Notebook for the Touché Lab on Argument Retrieval at CLEF 2021

Lukas Gienapp1
1
    Leipzig University, Leipzig, 04109, Germany


                                         Abstract
                                         We present a specialized approach to argument retrieval which combines both the general argumen-
                                         tative quality of texts, as well as the latent semantic (topic-)space of the document collection as boost
                                         factors to a general-purpose retrieval model to address the specific domain requirements of argument
                                         search. This setup aims to satisfy our three hypothesized aspects of an argumentative information need:
                                         quality-aware result ranking, near-complete topical coverage, and text proximity to the query.

                                         Keywords
                                         information retrieval, argument retrieval, argument quality, latent semantic clustering, CEUR-WS




1. Introduction
Searching the web, where information on virtually any topic can be accessed, has become a
highly influential factor in everyday decision making. However, in many cases, an information
need can be presumed that is addressed best not by single correct answer, or an unfiltered list
of similar documents, but by a faceted view of different aspects of the search topic at hand. To
this end, traditional approaches to web search only serve a diminished purpose, hence why
specialized retrieval systems for this domain have to be developed, generating insights that
support the user in forming well-justified opinions.
   The first task of the Touché Shared Task Bondarenko et al. [1] supports such everyday decision
making by incentivizing the development of specialized systems for argument retrieval for
controversial questions. The aim of such systems is to retrieve argumentative texts relevant
to controversial topics of general societal interest, which should be useful in conversations,
debates, or forming an individuals’ opinion on the topic at hand. In this paper, we contribute
such a retrieval system, based on three hypothesized aspects of an argumentative information
need: quality-aware result ranking, near-complete topical coverage, and text proximity to the
query.
   In contrast to established general-purpose retrieval models, our proposed method therefore
does not only rank by term proximity to the query using the general-purpose Dirichlet language
model for retrieval, but additionally takes into account the argumentative quality of text snippets,

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" lukas.gienapp@uni-leipzig.de (L. Gienapp)
 0000-0001-5707-3751 (L. Gienapp)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
as estimated using a support vector regression model, and the latent semantic space of the
document collection, calculated by performing clustering on phrase embeddings. In Section 2,
we review existing approaches to argument retrieval, and derive our three information need
facets from related work. In Section 3, we introduce our method and provide detailed information
on each of the components of our argument retrieval model. Section 4 provides first insight
into the models’ performance, while Section 5 gives concluding remarks.


2. Related Work
This section provides an overview of work that focuses specifically on the retrieval and ranking
of argumentative documents. Throughout, we assume a static document collection, namely the
args.me corpus [2], comprised of 387,740 arguments crawled from online debate portals. Based
on this, Section 2.1 describes several existing retrieval approaches. Section 2.2 reviews different
viewpoints on what an argument search system should achieve, influencing the design decision
made throughout this paper.

2.1. Argument Retrieval Models
Bondarenko et al. [3] identify three central components of an argument retrieval system, based
on a review of systems submitted to the Touché Shared Task: (1) an initial retrieval strategy;
(2) an augmentation component, where results are extended by either expanding the query set,
or directly based on documents features in the initially retrieved document set; (3) a (re)ranking
component based on a primary document feature, influencing the final document scoring. We
structure the literature review around each of these components, drawing inspiration for our
own system at each step.

Initial Retrieval. As one of the first publicly available systems focusing on argument search,
Wachsmuth et al. [4] present Args1 , implementing a fulltext search engine based on the args.me
corpus utilizing the Okapi BM25 retrieval model [5]. In addition to BM25, Potthast et al. [6]
evaluate three other general retrieval models for argument search, taking argument quality
into account besides relevance, and find the DirichletLM model [7] performing best on average.
This is corroborated by Bondarenko et al. [3], where the stock DirichletLM baseline system
placed among the top systems evaluated. Dumani and Schenkel [8] use a parameter-free
divergence-from-randomness model for initial retrieval in their pipeline, yet they do not provide
an ablative evaluation characterizing the baseline performance of this step only. Beyond
traditional retrieval models, employing large transformer-based language models for argument
search has been successfully demonstrated by Akiki and Potthast [9], who use 512-dimensional
phrase embeddings produced by the Universal Sentence Encoder (USE) [10] to calculate query
proximity. Beyond argument search, the USE has been applied to general information retrieval
[11] and numerous NLP tasks [10].


    1
     www.args.me. Unless otherwise noted, all URLs in this paper have been last accessed on June 29, 2021 and
were archived in the Wayback Machine
Result Augmentation. The result augmentation step aims at adding arguments to the result
set that were not identified by the initial retrieval. One particular method of achieving such
augmentation, first applied by Boltuzic and Snajder [12], is clustering—the general idea being
to include all arguments that are members of the same (precomputed) clusters as documents
already present in the initial results. Dumani and Schenkel [8] exploit the dual structure of
the args.me corpus and group together arguments that share an identical claim. A notable
shortcoming here is the strict identity of conclusions as clustering criterion, possibly leading
to very small groupings. Dumani et al. [13] improve on this, utilizing phrase embeddings as
calculated by models like Sentence-BERT [14], or InferSent [15] to project argument snippets
into a clusterable vector space. Akiki and Potthast [9] follow a similar approach, using KMeans
clustering on USE-embeddings to obtain semantic clusters of arguments in the args.me corpus.

Reranking. Since the result augmentation introduces a heap of previously not considered
arguments into the result set, a reranking is warranted to expand the scoring beyond intial query
similarity. In one of the top-scoring system at the first Touché Shared Task, Bundesmann et al.
[16] propose argument quality as reranking feature, predicted using support vector regression.
Other proposed reranking features include sentiment scoring [17], author credibility [3], and
readability [3], but all to only limited success.

2.2. Considerations on the goals of Argument Search
To provide a motivation for the design choices made in Section 3, we consider different aspects
of what a useful argument search system should provide. The underlying assumption here is
that is system is used for conversational argument search: it is to provide assistance to users
collecting argumentative evidence on various societal topics, to either provide debate assistance,
or fulfill a personal informational need [3]. Besides this general goal, more specific requirements
are placed on an argument search system that extend beyond general information retrieval.
Following the propositions made by Wachsmuth et al. [18] and [4], Potthast et al. [6] argue that
the evaluation of argument retrieval models should not only incorporate the classic evaluation
criterion of relevance, but also include argument quality as additional evaluation feature, as
differences between relevance-oriented effectiveness and quality-oriented effectiveness can be
observed. This in turn means that argument search should maximize not only the relevance,
but also the argumentative quality of its results.
   Another issue which is partly raised by Boltuzic and Snajder [12] is the wording of argu-
mentative text. They observe language variability, i.e. the same abstract argument can be
expressed in nearly infinitely many ways, which may lead to shortcomings for the retrieval
quality of term-based ranking models. The authors tackle this issue by applying semantic
clustering. Bundesmann et al. [16] further comment on result diversity, and integrate a measure
of heterogeneity to increase the diversity of viewpoints within their top-ranked results. This
diversity can be related to a cluster-based retrieval as well, as one cluster may contain many
different and diverse viewpoints for a particular topic. Therefore, a topic-aware ranking model
might also yield improved results.
3. Methodological Approach
Our method for argument retrieval is composed of three components, integrating the notions
of (1) textual relevance (Section 3.1), i.e., the relevance as indicated by a term-frequency based
retrieval model; (2) topical relevance (Section 3.2), i.e., the relevance as indicated by a semantic
space, independent of term occurrences; and (3) argumentative relevance (Section 3.3), i.e., the
argumentative quality of the results. This is similar to the three steps described by Bondarenko
et al. [3]: textual relevance is akin to the initial retrieval, topical relevance relates to the result
augmentation step, and argumentative relevance can be seen as reranking feature. However,
the critical difference here is that we do not model these components as successive steps in a
retrieval pipeline, but rather as complementary parts of a final relevance score.

3.1. Textual Component
The textual component is modeled by a classic and domain-independent information retrieval
model relying on term statistics of documents to infer the proximity, i.e., potential relevance, of
each document to the text query. Given the popularity and very favorable performance of the
DirichletLM retrieval model in the prior Touché Shared Task, we rely on it to calculate textual
relevance scores. We use the Lucene implementation of the DirichletLM model2 , which closely
follows the original paper [7].

3.2. Topical Component
We embed all argument conclusions in the document collection into a 512-dimensional vector
space using the Universal Sentence Encoder [10]. We choose this embedder over other phrase
embedding models due to its widespread application, high usability, favorable performance,
and previous usage in the field of argument search [9]. While Akiki and Potthast [9] use
USE-embeddings of the complete argument texts to perform exhaustive nearest-neighbor
lookup for individual arguments at retrieval time, we instead utilize the embedding vectors to
perform KMeans-Clustering on only the arguments’ conclusions, to allow for coherent clusters
of topically similar arguments. The feasibility of this approach has been demonstrated by
Akiki and Potthast [9], who conduct a similar clustering approach to verify the accuracy of
their embedding space and find that the clusters obtained are both coherent (syntactically and
semantically) and meaningful (encompassing specific topics). After clustering the conclusion
space with 𝑘 = 300 (a parameter choice that yielded accurate results upon manual review) each
argument is associated with its clusters’ centroid. Each argument is scored by cosine similarity
between query and its cluster centroid. The centroid is chosen over the individual proximity of
arguments to equally boost the ranking score of all arguments in a cluster/topic, which enables
to rank arguments within a topic by a secondary feature, such as quality.

3.3. Argumentative Component
We follow the considerations of Bundesmann et al. [16] who predict argument quality using
a support vector regression (SVR) model. While they note that reliable quality prediction
is difficulty to achieve, the overall retrieval effectiveness achievable by incorporating such

    2
        https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html
predictions is still sufficiently high. To improve on their method, we introduce a classification
step prior to the quality prediction, which decides whether a given text span is argumentative or
not; non-arguments are then automatically assigned the minimal quality score, while arguments
are passed on to the predictor to infer a rating for argumentative quality. We train both the
argumentative classification and the quality prediction model using the Webis ArgQuality 20
dataset [19]. It contains argumentative quality ratings and a binary classification whether
a text is an argument or not for a subset of 1,271 arguments from the args.me corpus. We
rescale quality scores to a range of [0, 1], and convert arguments to lowercase, remove english
stopwords and vectorize them as TF/IDF vectors.
   First, a support vector machine (SVM) is used for a binary classification to determine if a
sample is an argument or not. Then, valid arguments receive their quality rating as estimated by
a SVR model. The classification is trained on the complete (binary label) data, while the SVR is
trained only the argument subset. Both models are evaluated with 10-fold cross-validation. The
SVM classificator is achieves F-1 score of 0.88. The SVR regression model achieves a mean square
error (MSE) of 0.1949. Both models can thus be deemed reasonably accurate. The combined
model is then applied to predict a quality score for each premise contained in the args.me
corpus. Texts classified as non-arguments receive a score of 0, while others receive their quality
prediction as given in the 0-1 range.

3.4. Final Scoring
Given the three components of our retrieval system described above, the final score 𝑆(𝑞,𝑑) of a
document 𝑑 for a query 𝑞 is given by
                     𝑆(𝑞,𝑑) = 𝑅(𝑞, 𝑑) · (1 + 𝜔𝐶 · 𝐶(𝑞, 𝑑)) · (1 + 𝜔𝑄 · 𝑄(𝑑))                    (1)
   where 𝑅(𝑞, 𝑑) is the initial document relevance score as produced by the DirichletLM model,
𝐶(𝑞, 𝑑) being the cosine distance between the query embedding and the topical cluster centroid
𝑑 is associated with, and 𝑄(𝑑) is the predicted quality score for 𝑑 (independent of the query).
𝜔𝐶,𝑄 are weighting factors to fine-tune the model. Both 𝐶(𝑞, 𝑑) and 𝑄(𝑑) are in [0, 1] and thus
boost the initial score, but never decrease it.


4. Evaluation
We implement the described method as an Elasticsearch-based retrieval system. Quality ratings
as well as cluster centroids are pre-computed for efficient retrieval. At retrieval time, all
documents in the collection are scored by DirichletLM, and for all the cosine distance from
centroid to USE-embedded query is calculated. Documents are then ranked by the scoring
formula in Equation 1. We submit five runs for evaluation, differing in the applied weighting
factors. All setups are summarized in Table 1. The first three weighting schemes are used to
test if a higher, lower or equal influence of topical relevance and quality ratings are beneficial.
With the last two setups, we fix the initial relevance score at 1 to investigate whether the topical
relevance alone (since it is query-dependent, as opposed to quality), can provide meaningful
and accurate results, without depending on a term-frequency based model at all. To generate
and submit runs for Touché 2021, the Tira platform was used [20].
Table 1
Weighting schemes and resulting nDCG scores for both relevance- and quality-based evaluation. Max-
imum per column marked in bold.
          𝜔𝑐      𝜔𝑞                  nDCG@5 (Relevance)   nDCG@5 (Quality)   Remark
          10.0    5.0                       0.645               0.839
          10.0   10.0                       0.639               0.841
           5.0   10.0                       0.637               0.833
          Touché Dirichlet Baseline         0.626               0.796
           0.1    5.0                       0.004               0.767         𝑅(𝑄, 𝐷) = 1
           0.01   5.0                       0.000               0.749         𝑅(𝑄, 𝐷) = 1



   The first three runs, enabling the Dirichlet-based textual component, show strong overall
performance. For relevance-based evaluation, the added topical component yields a net increase
in ranking performance compared to the Dirichlet-only Touché baseline. The ranking perfor-
mance also correlates with parameter choice for 𝑤𝑐 , as higher value results in higher nDCG@5.
Overall, for relevance, our best approach places 9th among teams. For quality-based evaluation,
the same trend can be observed: the quality-based scoring factor has tremendous impact on
improving the argumentative quality of the results. Once again, the higher choice of 𝑤𝑞 results
in the higher ranking performance, however, only in conjunction with a high value of 𝑤𝑐 as
well. In terms of quality evaluation, the three approaches place first among all runs submitted
to Touché. The two-stage prediction model can thus be deemed highly effective.
   The latter two approaches, where the Dirichlet-based textual component has been turned off
turn out to be unusable in practice. With an nDCG score of zero (for relevance), they provide
effectively no use to a user. One possible reason for this is that the embedding space was
constructed on arguments’ conclusions only, which is not sufficient to ensure relevant search
results. However, regarding argumentative quality, the system yields acceptable results, too.


5. Conclusion
We proposed a new approach to argument retrieval, combining several parts of existing systems
that have shown favorable performance prior. The retrieval model is centered around three
components: a classic term-frequency-based retrieval model (DirichletLM) and two boosting
factors, incorporating topical relevance as indicated by a semantic clustering of the underlying
data, and a quality prediction model. The approach can be deemed successful. For both relevance
and quality as evaluation dimensions, the system yields useful results. For quality, it places
highest among the participants of this years’ Touché lab. The evaluation has also shown room
for future improvements: specifically the topical component performs sub-par, and needs to be
revisited. Extending the embeddings to not only include conclusions, but also premises, maybe
even in terms of a dual embedding space promises better results. Parameter fine-tuning for the
Dirichlet model also promises an increase in ranking performance and will be made possible by
the increased availability of relevance judgements from this years’ iteration of Touché.
References
 [1] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko, C. Biemann,
     B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2021: Argument
     Retrieval, in: D. Hiemstra, M.-F. Moens, J. Mothe, R. Perego, M. Potthast, F. Sebastiani
     (Eds.), Advances in Information Retrieval. 43rd European Conference on IR Research (ECIR
     2021), volume 12036 of Lecture Notes in Computer Science, Springer, Berlin Heidelberg New
     York, 2021, pp. 574–582. doi:10.1007/978-3-030-72240-1\_67.
 [2] Y. Ajjour, H. Wachsmuth, J. Kiesel, M. Potthast, M. Hagen, B. Stein, Data Acquisition for
     Argument Search: The args.me corpus, in: C. Benzmüller, H. Stuckenschmidt (Eds.), 42nd
     German Conference on Artificial Intelligence (KI 2019), Springer, Berlin Heidelberg New
     York, 2019, pp. 48–59. doi:10.1007/978-3-030-30179-8\_4.
 [3] A. Bondarenko, M. Fröbe, M. Beloucif, L. Gienapp, Y. Ajjour, A. Panchenko, C. Biemann,
     B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2020: Argument
     Retrieval, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes Papers
     of the CLEF 2020 Evaluation Labs, volume 2696 of CEUR Workshop Proceedings, 2020.
 [4] H. Wachsmuth, M. Potthast, K. Al-Khatib, Y. Ajjour, J. Puschmann, J. Qu, J. Dorsch,
     V. Morari, J. Bevendorff, B. Stein, Building an Argument Search Engine for the Web, in:
     K. Ashley, C. Cardie, N. Green, I. Gurevych, I. Habernal, D. Litman, G. Petasis, C. Reed,
     N. Slonim, V. Walker (Eds.), 4th Workshop on Argument Mining (ArgMining 2017) at
     EMNLP, Association for Computational Linguistics, 2017, pp. 49–59.
 [5] S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, M. Gatford, Okapi at TREC-
     3, in: D. K. Harman (Ed.), Proceedings of The Third Text REtrieval Conference, TREC
     1994, Gaithersburg, Maryland, USA, November 2-4, 1994, volume 500-225 of NIST Special
     Publication, National Institute of Standards and Technology (NIST), 1994, pp. 109–126.
 [6] M. Potthast, L. Gienapp, F. Euchner, N. Heilenkötter, N. Weidmann, H. Wachsmuth, B. Stein,
     M. Hagen, Argument Search: Assessing Argument Relevance, in: 42nd International ACM
     Conference on Research and Development in Information Retrieval (SIGIR 2019), ACM,
     2019. doi:10.1145/3331184.3331327.
 [7] C. Zhai, J. D. Lafferty, A study of smoothing methods for language models applied to
     information retrieval, ACM Trans. Inf. Syst. 22 (2004) 179–214. doi:10.1145/984321.
     984322.
 [8] L. Dumani, R. Schenkel, Quality-aware ranking of arguments, in: M. d’Aquin, S. Dietze,
     C. Hauff, E. Curry, P. Cudré-Mauroux (Eds.), CIKM ’20: The 29th ACM International
     Conference on Information and Knowledge Management, Virtual Event, Ireland, October
     19-23, 2020, ACM, 2020, pp. 335–344. doi:10.1145/3340531.3411960.
 [9] C. Akiki, M. Potthast, Exploring Argument Retrieval with Transformers, in: L. Cappellato,
     C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes Papers of the CLEF 2020 Evaluation
     Labs, volume 2696, 2020.
[10] D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-
     Cespedes, S. Yuan, C. Tar, Y. Sung, B. Strope, R. Kurzweil, Universal sentence encoder,
     CoRR abs/1803.11175 (2018). URL: http://arxiv.org/abs/1803.11175. arXiv:1803.11175.
[11] Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, N. Constant, G. H. Ábrego, S. Yuan, C. Tar,
     Y. Sung, B. Strope, R. Kurzweil, Multilingual universal sentence encoder for semantic
     retrieval, in: A. Çelikyilmaz, T. Wen (Eds.), Proceedings of the 58th Annual Meeting of the
     Association for Computational Linguistics: System Demonstrations, ACL 2020, Online,
     July 5-10, 2020, Association for Computational Linguistics, 2020, pp. 87–94.
[12] F. Boltuzic, J. Snajder, Identifying prominent arguments in online debates using semantic
     textual similarity, in: Proceedings of the 2nd Workshop on Argumentation Mining,
     ArgMining@HLT-NAACL 2015, June 4, 2015, Denver, Colorado, USA, The Association for
     Computational Linguistics, 2015, pp. 110–115. doi:10.3115/v1/w15-0514.
[13] L. Dumani, C. K. Kreutz, M. Biertz, A. Witry, R. Schenkel, Segmenting and clustering noisy
     arguments, in: D. Trabold, P. Welke, N. Piatkowski (Eds.), Proceedings of the Conference
     "Lernen, Wissen, Daten, Analysen", Online, September 9-11, 2020, volume 2738 of CEUR
     Workshop Proceedings, CEUR-WS.org, 2020, pp. 23–34.
[14] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks,
     in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical
     Methods in Natural Language Processing and the 9th International Joint Conference on
     Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7,
     2019, Association for Computational Linguistics, 2019, pp. 3980–3990. doi:10.18653/v1/
     D19-1410.
[15] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes, Supervised learning of universal
     sentence representations from natural language inference data, in: M. Palmer, R. Hwa,
     S. Riedel (Eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Lan-
     guage Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, Association
     for Computational Linguistics, 2017, pp. 670–680. doi:10.18653/v1/d17-1070.
[16] M. Bundesmann, L. Christ, M. Richter, Creating an argument search engine for online
     debates, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes of CLEF
     2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September
     22-25, 2020, volume 2696 of CEUR Workshop Proceedings, CEUR-WS.org, 2020.
[17] C. Staudte, L. Lange, Sentarg: A hybrid doc2vec/dph model with sentiment analysis
     refinement, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes
     of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece,
     September 22-25, 2020, volume 2696 of CEUR Workshop Proceedings, CEUR-WS.org, 2020.
[18] H. Wachsmuth, N. Naderi, I. Habernal, Y. Hou, G. Hirst, I. Gurevych, B. Stein, Argumenta-
     tion quality assessment: Theory vs. practice, in: R. Barzilay, M. Kan (Eds.), Proceedings of
     the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Van-
     couver, Canada, July 30 - August 4, Volume 2: Short Papers, Association for Computational
     Linguistics, 2017, pp. 250–255. doi:10.18653/v1/P17-2039.
[19] L. Gienapp, B. Stein, M. Hagen, M. Potthast, Efficient pairwise annotation of argument
     quality, in: D. Jurafsky, J. Chai, N. Schluter, J. R. Tetreault (Eds.), Proceedings of the 58th
     Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July
     5-10, 2020, Association for Computational Linguistics, 2020, pp. 5772–5781.
[20] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture,
     in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The
     Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/
     978-3-030-22948-1\_5.