=Paper=
{{Paper
|id=Vol-2696/paper_210
|storemode=property
|title=Retrieving Comparative Arguments using Deep Pre-trained Language Models and NLU
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_210.pdf
|volume=Vol-2696
|authors=Viktoriia Chekalina,Alexander Panchenko
|dblpUrl=https://dblp.org/rec/conf/clef/ChekalinaP20
}}
==Retrieving Comparative Arguments using Deep Pre-trained Language Models and NLU==
Retrieving Comparative Arguments using Deep Pre-trained Language Models and NLU Notebook for the Touché Lab on Argument Retrieval at CLEF 2020 Viktoriia Chekalina‡,† and Alexander Panchenko‡ ‡ Skolkovo Institute of Science and Technology, Moscow, Russia † Philips Research Lab RUS, Moscow, Russia Abstract In this paper, we present our submission to the CLEF-2020 shared task on Comparative Argument Retrieval. We propose several approaches based on state-of-the-art NLP techniques such as Seq2Seq, Transformer, and BERT em- bedding. In addition to these models, we use features that describe the compara- tive structures and comparability of text. For the set of given topics, we retrieve the corresponding responses and rank them using these approaches. Presented so- lutions could help to improve the performance of processing comparative queries in information retrieval and dialogue systems. 1 Introduction People are faced with a multitude of choice problems on a daily basis. This type of ques- tions can be related to products, e.g., which milk producer to trust, which fruit contains less sugar, or which laptop brand is more reliable. Another popular type of compar- ison is related to travel destinations, e.g., which cities or national parks to visit. The comparative questions can also involve more complex objects/matters of comparison, e.g., which country is safer to raise children, Germany, or the United States? Finally, the fuzziness of comparative questions can go even further into some philosophical questions with possibly no definitive answer, e.g., which political system is better to maximize the overall average happiness of a population. Therefore, the comparative information need is an omnipresent type of information need of users. While for some categories of products, e.g., mobile phones and digital cameras, tools for side-to-side comparison of features are available, for many domains, e.g., programming languages or databases, this information is not well structured. On the other hand, the Web contains a vast number of opinions and objective arguments that can facilitate the comparative decision-making process. The goal of our work is to de- velop methods for the retrieval of such textual documents, which are highly relevant for fulfilling the various comparative information needs of the users. Recent research on this topic touched on some aspects of the comparative question answering, e.g., retrieval human-computer interaction interface for comparative queries [17], classifica- tion of comparative questions [3] or extraction of objects and aspect from comparative Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessa- loniki, Greece. texts [1], inter alia. However, the quality of retrieval of comparative answers was not evaluated to date. More specifically, this notebook contains a description of our approach used in the submission of the CLEF-2020 shared task on Comparative Argument Retrieval1 includ- ing all details necessary to reproduce our results. The source codes and data used in our submission are also available online.2 The contribution of our work is three-fold: 1. We are first to use various deep pre-trained language models, such as ULMFiT and Transformer-based, on the task of comparative argument retrieval. 2. We are first to experiment with features based on specialized sequence taggers of comparative structures (detection objects, predicates, and aspects of comparison) that implement a shallow Natural Language Understanding (NLU). 3. We are first to experiment with features based on the density of comparative sen- tences in a text (based on a pre-trained classifier of comparative sentences [13]). The remainder of this paper is organized as following: Section 2 introduces the task, Firefox then Section 3 presents several variations of the method we proposed. In Section 4, file:///Users/sasha/Desktop/topic.xml the results of the experiments based on the manual evaluation are discussed. Finally, Section 5 summarizes the main findings and directions for future work. This XML file does not appear to have any style information associated with it. The document tree is shown below. −Figure 1. An example of a topic which specifies a user’s comparative information need. Such topics are inputs to our methods. The goal of our methods is to retrieve Web documents which would fulfil information need specified in such topics and helping to make an informed choice. 1 of 1 17.07.2020, 11:43 2 Task: Retrieval of Comparative Arguments on the Web The track Touché [4] suggests the following goal: given a set of topics, one needs to retrieve and rank documents according to their relevance. The relevant documents are 1 https://events.webis.de/touche-20/shared-task-2.html 2 https://github.com/skoltech-nlp/touche those which are helpful in making the comparative decision, i.e., those which directly compare the target objects facing their pros and cons. The topic contains a question implying comparison of the two objects, i.e. “What is better, a laptop or a desktop?”, “Which is better, Canon or Nikon?”, “Should I buy or rent?”. An example of the topic is presented on Figure 1. Each topic consists of a title, e.g., a short description similar to those in which a user could enter into an information retrieval engine but also contains two additional fields: description and narrative. These fields specify more closely the context and semantics of the topic, and these are actually used by human annotators to perform judgments of the retrieved documents. In our experiments, we only used the “title” field. We use topic title as a query in ChatNoir search engine [15]3 , which extracts doc- uments from ClueWeb12 corpus.4 In response to the query ChatNoir returns a set of documents which contains titles, body texts, documents identifiers and search engine’s scores. We try to retrieve 1000 unique documents, but for some queries the system gives less. The goal of our methods is to find documents that most reliably and completely answer the query question in this set of pre-retrieved candidates. In other words, the document should be relevant to the topic, be trustworthy, and give an entirely and rea- sonable comparison. 3 Methodology The main objective of the experiments is to develop a method finding among retrieved documents that meet the comparative criteria most fully and reasonably. In addition to the search system’s scoring, we employ pre-trained state-of-the-art language models and methods for getting the rate of the document’s comparability. Candidate documents ChatNoir: an inverted and their scores index + BM25 Similarity scores Topic Topic-document (query) similarity computation Documents sorted by similarity score Figure 2. Overview of our methods: candidate documents are obtained using a traditional inverted index-based IR system and then re-ranked on the basis of topic-document similarity measures specific to each of the proposed approaches. This section contains short descriptions of approaches for the computation of the score of one document in the search engine’s response. All of the approaches described 3 https://www.chatnoir.eu/doc 4 https://lemurproject.org/clueweb12 below are run on TIRA system [14]. The computation of scores for the entire set of responses by a certain method is schematically shown in Figure 2. The ranking process is completed by sorting documents by these values. Ultimately, each of the presented methods computes a similarity score sij between a topic ti and a candidate document dj from a candidate set: sij = sim(ti , dj ). (1) The goal of every presented method is to compute for each topic ti a vector of scores si = (si1 , ..., siNi ), where Ni is the number of candidate documents for the topic ti . Overall, we submitted six various solutions to test topic titles. Since no training data were provided, it is not allowed to evaluate the performance of the suggested strategies during the development. Below, we describe all proposed approaches in detail. 3.1 Baseline based on an inverted index In our experiments, we utilize ChatNoir system [2] as a candidate documents extractor, which was provided (as a baseline) by organizers. ChatNoir is an Elasticsearch-based5 engine providing access to nearly 3 billion web pages from ClueWeb and Common Crawl corpora. Query processing shared across several search node allows reaching response time compared to the commercial system. Text relevance estimation is based on custom BM25 scoring function,6 which ranks the set of texts depends on the query’s tokens existing in each response documents. The defined search system in response to question in the topic title returns docu- ments, its titles and scores. We take scores given by it and create a document ranking based on it, so, similarity score is sij = cnij , (2) where cnij - scores provided by ChatNoir for i-th title to j-th responded document. It should be noted that the system issue may contain similar documents. We look through the response and remove the document with duplicated titles. We also clean documents’ bodies from HTML tags and markups. 3.2 Language model LSTM ULMFiT The simplest way to estimate the relevance of documents is by mapping the query and response in the same vector space. The relevance is defined as the cosine similarity between retrieved objects. We assume that the hidden state in the recurrent network implicitly contains infor- mation about all processed sequences. Providing topic title and a response document’s body to LSTM as an input gives in the hidden state of the last step their compressed representations. 5 https://www.elastic.co 6 https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules- similarity.html The modification of the hidden state at each step depends on the parameters of the model. We employ the weights from pre-trained Universal Language Model Fine- tuning (ULMFiT) [10]. A state-of-the-art language model AWD-LSTM [11] is tuned on Wikitext-103 [12], which collects 28,595 Wikipedia articles and 103 million words. The model class consists of 3 LSTM layers, encoder-decoder, dropouts, and linear layer. We use the class definition from here7 , extract only LSTM layers, and apply them to texts. We pass the query and the document body through these layers. The input tokens are transferred into vectors using bert-as-service library8 . Similarity score for this method is computed as following: sij = cos(hi , hj ), (3) where hi is the hidden state of LSTM that was fed with the i-th topic’s title and hj is the hidden state for j-th response’s body. 3.3 Attention of a Transformer-based language model The Transformer is a neural-network encoder-decoder model, being used as an alterna- tive to recurrent neural networks, such as LSTM, e.g., the ULMFiT model described in the previous section. The key innovation of the Transformer is the attention mechanism, which at each step calculates the importance of each word from the input sequence. Information from pre-trained attention layers can be used to analyze the closeness of the query and the response. A Transformer can deal with a pair of input sequences separated by a special character. The attention layer returns the mutual weights of ev- ery word of this pair. Since we are interested in the relation of the topic and retrieved document, the input pair will be composed of them. The entrance is “[CLS]” + query + “[SEP]” + response document’s body + “[SEP]”, where “[CLS]” and “[SEP]” are special symbols being used when processing the sentences through Transformer. The appropriate Transformers’ head selecting. The attention layer in the standard Transformer provides 12 outputs, named heads. Each of head describes its own, not predefined meaning. For every token in an encoded sequence, one head gives weights for all input tokens. If we encode the input to itself, we get a matrix of adjacent weights for each input word. Using the obtained matrix we can build a map of attention. On Figure 3 these struc- tures are shown for input “[CLS]” + Which is better, a laptop or a desktop? + “[SEP]” + Laptop is preferable than desktop because it is more portable. + “[SEP]”. The bright vertical stripe corresponds to the separation token and should be excluded from con- sideration. For interconnection estimation of words from different sentences, only the upper left and lower right corners of the map should be taken into account - the “non- diagonal” right upper and lower left parts describe the response of the one sentence in input pair to itself. 7 https://github.com/fastai/fastai 8 https://github.com/hanxiao/bert-as-service Figure 3. Maps for 12 attention heads for input “[CLS]” + Which is better, a laptop or a desktop? + “[SEP]” + Laptop is preferable than desktop because it is more portable. + “[SEP]”. To use the Transformer efficiently, we need to select those outputs that provide information relevant to response ranking. As it can be observed in Figure 4, the third head determines similar words in sequence’s pair, so we take it for scoring. To select other suitable heads, we design a sandbox experiment. We take a query “Which is better, a laptop or a desktop?” and make a set of 4 documents consisting of 1 sentence. Two of these documents are retrieved from top Google sites to query deter- mined above and are marked as relevant. The other two are taken from “The Hunting of the Snark” by Lewis Carroll, and they are considered as unreasonable. The query and the obtained sentences are in Table 1. The idea of the experiment is to process paired input to the Transformer attention layer and observe at which outputs the value of the sum for the relevant and irrelevant documents differs the most. We apply the Transformer to query merged with one of four sentences. For each of 12 transformer’s heads, we count the sum of weights from the upper left and lower right. The most significant variation appears in 4, 10, 11 heads (Table 2). These heads are taken into consideration when a similarity score creates. The counting score of similarity using attention layer. The response scoring consists of two steps: first, we concatenate query with enumerated document and process it by Figure 4. The third attention head for input “[CLS]” + Which is better, a laptop or a desktop? + “[SEP]” + Laptop is preferable than desktop because it is more portable. + “[SEP]”. Biggest weights correspond to similar words. Areas highlighted in red describe the response of tokens from one sentence to another and are taken into account in calculating the score, corresponding to two respective sums in Equation 4. Transformer attention layers; second, we count the sum of the appropriate parts of maps for 3, 4, 10, 11 heads. Thereby, a similarity score for i-th topic title and j-th retrieved document is calculated as X Q−1 X Q+R−1 X Q+R−1 X Q−1 X sij = wlm + wlm , (4) h=3,4,10,11 l=1 m=Q+1 l=Q+1 m=1 where Q - the length of query, which is i-th topic title, R - the length of j-th document body. Wlm is an attention weight for l-th to m-th token in input, ranges of indices l and m describes the proper part of the attention map. The accounted attention heads are enumerated by h. The idea of this approach for the third layer is illustrated in Figure 4. Namely, the zones highlighted in red color zones correspond to similarity of words from query to these from a candidate documents. The other parts represent self-similarity of query and document and thus have somewhat trivial sparsity pattern. Query Which is better, a laptop or a desktop? Document 1 Laptop is preferable than desktop because it is more portable. Document 2 If you need portability, the laptop is the best option than desktop. Document 3 The crew was complete: it included boots, a maker of bonnets and hoods. Document 4 Just the pace for snark, the Bellman cried. Table 1. Possible user request and similar in meaning and distant responses for sandbox experi- ment, where we selecting appropriate transformer’s heads. Input h0 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 Query + Document1 1.62 3.70 0.001 5.54 7.79 2.04 4.58 1.38 3.46 6.01 6.05 4.38 Query + Document2 1.70 4.27 0.001 5.08 7.53 1.93 4.69 1.25 3.62 6.37 6.81 5.55 Query + Document3 1.80 4.20 0.001 4.87 4.97 1.72 3.34 1.35 3.09 4.97 3.73 1.11 Query + Document4 1.67 3.49 0.001 2.83 4.33 1.90 3.30 1.40 3.42 4.47 3.92 3.01 Table 2. Sums of the attention heads’ outputs for relevant and unrelated responses. In every cell, there is a value counted over upper left and lower right corners on the attention map. Bold columns have a high variation on close and random answers, which means that they are sensitive to the proximity. 3.4 Bidirectional encoder representations from Transformer (BERT) The architecture of Bidirectional Encoder Representations from Transformer (BERT) [9] is based on Transformer and is fine-tuned on specific masked language tasks. The re- sult is a bidirectional language model that can give distributed representations of words taking into account contextual information. We employ bert-as-service library to provide word embedding from BERT. This library uses pre-trained weights from a large uncased model with 340M parameters and encodes every word in query and document title in ChatNoir’s response. Vectors corresponding to query and title are averages between all word embedding in defined sequences. The similarity score between query and title is described as sij = cos(equery , etitle ), (5) where equery - average between embedded query’s tokens, etitle - average between embedded tokens in responded document title. 3.5 Comparative feature extraction The scores by the approaches described above estimate the relevance of the topic and response as the closeness of the texts; in other words, show how possible and appro- priate the given response is. The closeness is calculated in the context of well-known models (BERT, ULMFiT), trained on a huge amount of texts of a natural language. Such a method allows us to select documents that are similar in meaning but do not evaluate the quality of the comparison explicitly. In order to evaluate the document as an argumentation, we use the combination of one of the some recently used methods and the approach giving information about the document’s argumentativeness. The resultant similarity score is a multiplication of the score provided by the chosen method and additive term r. This term is counted for one document and represents a composition of features relied on the density of comparative sentences and features derived from the number of comparative parameters existed in the text. Initially, r is equal to 1. The comparative degree of the document depends on the number of existent compar- ative sentences. To detect comparative sentences, we use the method described in [13]. It encodes the sentence by the InferSent embedder [8], then applies gradient boosted decision trees (XGBoost) [6] classifier to the resulting features. The XGBoost model is pre-trained on multi-domain Comparative Sentences Corpus 2019, formed by the 7,199 sentences. It determines the probability that considered sentences are being regular or comparative. If the comparative probability is greater than 0.2, the counter of compara- tive sentences is incremented. After using the classifier of comparatives, r increases by the number of revealed sentences n: r = r + n, (6) To precise if the document collates exactly to what the user wants to, we formalize the comparative parameters. We determine two comparison objects, predicates (com- parison conditions, for example, “cheaper”), and comparison features — aspects (“for children”, “for deep learning”). Tagging these parameters in a given sentence leads to the sequence labeling problem. State-of-the-art solutions provide low performance on comparative cases, and we created and trained our own sequence-labeling module to achieve acceptable quality. Our model consists of a single layer LSTM with 200 hidden units from [7]. To the input of the recurrent network, we enter the BERT embedding of words. We train BERT and LSTM parts of the model together with a learning rate 0.00001 and 0.01, respectively. As a target, we use a custom dataset structurally similar to that by Arora [1]9 composed of 3,967 labeled comparative sentences from the different domains. To count comparative parameters part of the additive term, first, we process the query by sequence labeling model described above. The model extracts objects, aspects, and predicates and formalize the user’s answer. Then we combine the document’s title and body, and by a simple search in text, try to detect extracted parameters in it. The additional term changes according to the following law: 9 https://github.com/uhh-lt/comparely ( r ∗ 1.2 if we find one object r= (7) r ∗ 1.5 if we find two objects Appearance in the text one of the predicates or aspects in case of objects’ existence additionally adds 1 to r: r = r + l, (8) where l - number of predicates or aspects founded in the document. 3.6 Combination of Baseline, number of comparative sentences and comparative structure extraction For every document in the response, we count additive term and multiply it with the engine’s score. The resulting similarity score is sij = cnij ∗ rij , (9) where cnij is a score issued by ChatNoir system, rij - additive term for i-th title and j-th document, calculated as described above. For making answer, we rank documents by the resulting values. 3.7 Combination of ULMFiT, number of comparative sentences and comparative structure extraction In this method, we do the same as in the previous section, with the only difference that scores counted by method from the section 3.2 are used as the basic value. We also compute an additive term rij for i-th title and j-th document and resulting scoring is the following: sij = ulmij ∗ rij , (10) where ulmij is a score by an approach based on ULMFiT. 4 Results 4.1 Evaluation Each retrieved document from each of the seven approaches tested by our team was manually evaluated on the scale 0-1-2, where 0 means no relevance, “1” means the document contains relevant information, e.g., characteristics of one of the object, and “2” means very relevant i.e., the document directly compares the objects mentioned in the topic in the required context. In addition to assessing the relevance, for every response, we estimate pieces of evidence provided in the document by support retrieval model [5]. Based on these judg- ments, the official NDCG score was computed for each submission. The results are discussed in the following section. The correspondence of the names of the methods described above and experiment run tags are in the first and the second columns of the Table 3. Method Running tag NDCG@5 § 3.1 Baseline based on an inverted index MyBaselineFilterResponse 0.564 § 3.6 Combination of Baseline and comparative Baseline_CAM_OBJ 0.553 features § 3.7 Combination of ULMFiT and comparative ULMFIT_LSTM_CAM_OBJ 0.464 features § 3.4 Bidirectional Encoder Representations myBertSimilarity 0.404 from Transformer (BERT) § 3.3 Attention of a Transformer-based language MethodAttentionFilterResponse 0.223 model § 3.2 Language model LSTM ULMFiT ULMFIT_LSTM 0.200 Table 3. Results of ranking quality’s measure for described methods. 4.2 Discussion The top-5 discounted cumulative gain(DCG@5) scores for the proposed approaches are in the Table 3. The Table 3 shows that approaches using only pre-trained language models give the smallest scores. It can be explained by the fact that the information stored in the SOTA linguistic model is sufficient to estimate the appropriateness of the text but not enough to assess how complete, persuasively, and supportive the document is. As in many other tasks, the Attention-based model has a better performance than ULMFit - 0.223 against 0.200. This is due to the fact that the attention mechanism allows us to consider the meaningful context that is located at a distance from the current word, which makes the model more expressiveness. BERT-based model is a bidirectional expansion of the attention layer. Therefore, its application increases the performance to 0.405. Overall, a combination of the approaches with comparative information shows bet- ter performance than the same method without comparative terms. Thus, consideration of comparative structures and sense improves the results for ULMFit from 0.200 to 0.464. The best quality is provided by the baseline model being cleaned from document duplicates. Its scoring function is based on the BM25 ranking formula but uses a more efficient way of calculating term frequencies [16]. It provides the ability to consider information from all parts of the document - title and body, which gives superiority over methods that process only the title or only the document’s body. It should be noticed that the baseline gives NDCG@5 0.565, baseline with CAM, and object extraction - 0.554. One reason for decreasing quality when complementary information is added is choosing the weight with which we consider the CAM information and number of comparative structures. The main take-aways are as follows. First, the methods for re-ranking of the candi- date documents which do not rely on the original baseline score, but instead completely replace it with similarity scores based on the language models do not yield superior re- sults to the baseline; therefore, the original scores shall be used. Among all such, com- pletely baseline-free methods BERT-based similarity yielded the best results. Second, a combination of the custom features based on the density of comparative structures in text combined with the baseline yield better results. Since no training data was provided in this version of the shared task, it was not obvious to test various combinations of the tested features, but given such supervised training data, a promising way to further im- prove the results is to combine various signals using a supervised machine learning model. 5 Conclusion In this paper, we present our solution to the Argument retrieval shared task. Our main innovations are (i) the use of large pre-trained language models, (ii) the use of features based on natural language understanding of comparative sentences, and (iii) the use of features based on the density of comparative sentences. It should be noted that modern linguistic models meet the response relevance quite well, but to assess the comparability and argumentation of the answer, we need to add external features. Overall, according to the experimental results, the baseline information retrieval model proved to be a hard baseline. In fact, among all 11 evaluated runs in the shared task only one outperformed the baseline by more than 0.5% which is a substantial differ- ence.10 The results suggest that considering the score taking into account claim support and evidence existing, models based on SOTA language models do not work as well as the models which combine comparative structure and comparative sentiment in sen- tences. We conclude that in future work, more combinations of methods based on a combination of baseline IR models with comparative features shall be investigated. References 1. J. Arora, S. Agrawal, P. Goyal, and S. Pathak. Extracting entities of interest from compar- ative product reviews. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, pages 1975–1978, New York, NY, USA, 2017. ACM. 2. J. Bevendorff, B. Stein, M. Hagen, and M. Potthast. Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl. In L. Azzopardi, A. Hanbury, G. Pasi, and B. Pi- wowarski, editors, Advances in Information Retrieval. 40th European Conference on IR Re- search (ECIR 2018), Lecture Notes in Computer Science, Berlin Heidelberg New York, Mar. 2018. Springer. 3. A. Bondarenko, P. Braslavski, M. Völske, R. Aly, M. Fröbe, A. Panchenko, C. Biemann, B. Stein, and M. Hagen. Comparative web search questions. In Proceedings of the 13th International Conference on Web Search and Data Mining, pages 52–60, 2020. 4. A. Bondarenko, M. Fröbe, M. Beloucif, L. Gienapp, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, and M. Hagen. Overview of Touché 2020: Argument Retrieval. In Working Notes Papers of the CLEF 2020 Evaluation Labs, Sept. 2020. 5. L. Braunstain, O. Kurland, D. Carmel, I. Szpektor, and A. Shtok. Supporting human answers for advice-seeking questions in cqa sites. In N. Ferro, F. Crestani, M.-F. Moens, J. Mothe, 10 https://events.webis.de/touche-20/shared-task-2.html#results F. Silvestri, G. M. Di Nunzio, C. Hauff, and G. Silvello, editors, Advances in Information Retrieval, pages 129–141, Cham, 2016. Springer International Publishing. 6. T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In B. Krishnapuram, M. Shah, A. J. Smola, C. Aggarwal, D. Shen, and R. Rastogi, editors, KDD, pages 785–794. ACM, 2016. 7. A. Chernodub, O. Oliynyk, P. Heidenreich, A. Bondarenko, M. Hagen, C. Biemann, and A. Panchenko. TARGER: Neural argument mining at your fingertips. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstra- tions, pages 195–200, Florence, Italy, July 2019. Association for Computational Linguistics. 8. A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670– 680, Copenhagen, Denmark, Sept. 2017. Association for Computational Linguistics. 9. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirec- tional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. 10. J. Howard and S. Ruder. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia, July 2018. Association for Computational Linguistics. 11. S. Merity, N. S. Keskar, and R. Socher. Regularizing and optimizing lstm language models. In ICLR (Poster). OpenReview.net, 2018. 12. S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. In ICLR (Poster). OpenReview.net, 2017. 13. A. Panchenko, A. Bondarenko, M. Franzek, M. Hagen, and C. Biemann. Categorizing com- parative sentences. In Proceedings of the 6th Workshop on Argument Mining, pages 136–145, Florence, Italy, Aug. 2019. Association for Computational Linguistics. 14. M. Potthast, T. Gollub, M. Wiegmann, and B. Stein. TIRA Integrated Research Architecture. In N. Ferro and C. Peters, editors, Information Retrieval Evaluation in a Changing World, The Information Retrieval Series. Springer, Sept. 2019. 15. M. Potthast, M. Hagen, B. Stein, J. Graßegger, M. Michel, M. Tippmann, and C. Welsch. Chatnoir: a search engine for the clueweb09 corpus. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 1004– 1004, 2012. 16. S. Robertson, H. Zaragoza, and M. Taylor. Simple bm25 extension to multiple weighted fields. pages 42–49, 01 2004. 17. M. Schildwächter, A. Bondarenko, J. Zenker, M. Hagen, C. Biemann, and A. Panchenko. Answering comparative questions: Better than ten-blue-links? In Proceedings of the 2019 Conference on Human Information Interaction and Retrieval, pages 361–365, 2019. 38 −What are the differences between MySQL and PostgreSQL in performance? −Before starting a new DB-related project, a software developer wants to find some tips about when to use which database. Back in their studies some years ago, they had learnt that the choice of a database management system is important when starting a new project. Not having too much experience with database-related software development, the user just remembers that historically, MySQL had been a default choice for creating and maintaining databases. Even though the performance differences between MySQL and PostgreSQL have been largely eliminated in recent versions, there still are differences worth considering like usage of indices, default installation, what types of replication / clustering are available and so on. Getting to know these issues will help the developer make a choice for the project. −Highly relevant documents discuss differences between MySQL and PostgreSQL focusing on their performance such as query throughput on what hardware or time to write large amounts of data, but also development and support effort, etc. Highly relevant documents should provide a conclusion on main differences between the two database management systems and about typical scenarios that would favor either option. Relevant documents may help to form an opinion on the performance of either database by for instance describing the usage of one of the systems in some specific scenario(s). Documents discussing the history of MySQL and PostgreSQL, providing materials to learn SQL syntax, tutorials about database usage, etc. that do not elaborate on MySQL or PostgreSQL performance are not relevant.