<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Retrieving Comparative Arguments using Deep Pre-trained Language Models and NLU</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Viktoriia Chekalinaz</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Panchenkoz</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>In this paper, we present our submission to the CLEF-2020 shared task on Comparative Argument Retrieval. We propose several approaches based on state-of-the-art NLP techniques such as Seq2Seq, Transformer, and BERT embedding. In addition to these models, we use features that describe the comparative structures and comparability of text. For the set of given topics, we retrieve the corresponding responses and rank them using these approaches. Presented solutions could help to improve the performance of processing comparative queries in information retrieval and dialogue systems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>People are faced with a multitude of choice problems on a daily basis. This type of
questions can be related to products, e.g., which milk producer to trust, which fruit contains
less sugar, or which laptop brand is more reliable. Another popular type of
comparison is related to travel destinations, e.g., which cities or national parks to visit. The
comparative questions can also involve more complex objects/matters of comparison,
e.g., which country is safer to raise children, Germany, or the United States? Finally,
the fuzziness of comparative questions can go even further into some philosophical
questions with possibly no definitive answer, e.g., which political system is better to
maximize the overall average happiness of a population. Therefore, the comparative
information need is an omnipresent type of information need of users.</p>
      <p>
        While for some categories of products, e.g., mobile phones and digital cameras,
tools for side-to-side comparison of features are available, for many domains, e.g.,
programming languages or databases, this information is not well structured. On the
other hand, the Web contains a vast number of opinions and objective arguments that
can facilitate the comparative decision-making process. The goal of our work is to
develop methods for the retrieval of such textual documents, which are highly relevant
for fulfilling the various comparative information needs of the users. Recent research
on this topic touched on some aspects of the comparative question answering, e.g.,
retrieval human-computer interaction interface for comparative queries [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],
classification of comparative questions [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or extraction of objects and aspect from comparative
texts [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], inter alia. However, the quality of retrieval of comparative answers was not
evaluated to date.
      </p>
      <p>More specifically, this notebook contains a description of our approach used in the
submission of the CLEF-2020 shared task on Comparative Argument Retrieval1
including all details necessary to reproduce our results. The source codes and data used in our
submission are also available online.2 The contribution of our work is three-fold:
1. We are first to use various deep pre-trained language models, such as ULMFiT and</p>
      <p>
        Transformer-based, on the task of comparative argument retrieval.
2. We are first to experiment with features based on specialized sequence taggers of
comparative structures (detection objects, predicates, and aspects of comparison)
that implement a shallow Natural Language Understanding (NLU).
3. We are first to experiment with features based on the density of comparative
sentences in a text (based on a pre-trained classifier of comparative sentences [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]).
      </p>
      <p>The remainder of this paper is organized as following: Section 2 introduces the task,
Firefox</p>
    </sec>
    <sec id="sec-2">
      <title>2 Task: Retrieval of Comparative Arguments on the Web</title>
      <p>
        The track Touché [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] suggests the following goal: given a set of topics, one needs to
retrieve and rank documents according to their relevance. The relevant documents are
      </p>
      <sec id="sec-2-1">
        <title>1 https://events.webis.de/touche-20/shared-task-2.html</title>
      </sec>
      <sec id="sec-2-2">
        <title>2 https://github.com/skoltech-nlp/touche</title>
        <p>those which are helpful in making the comparative decision, i.e., those which directly
compare the target objects facing their pros and cons.</p>
        <p>The topic contains a question implying comparison of the two objects, i.e. “What is
better, a laptop or a desktop?”, “Which is better, Canon or Nikon?”, “Should I buy or
rent?”. An example of the topic is presented on Figure 1. Each topic consists of a title,
e.g., a short description similar to those in which a user could enter into an information
retrieval engine but also contains two additional fields: description and narrative. These
fields specify more closely the context and semantics of the topic, and these are actually
used by human annotators to perform judgments of the retrieved documents. In our
experiments, we only used the “title” field.</p>
        <p>
          We use topic title as a query in ChatNoir search engine [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]3, which extracts
documents from ClueWeb12 corpus.4 In response to the query ChatNoir returns a set of
documents which contains titles, body texts, documents identifiers and search engine’s
scores. We try to retrieve 1000 unique documents, but for some queries the system gives
less.
        </p>
        <p>The goal of our methods is to find documents that most reliably and completely
answer the query question in this set of pre-retrieved candidates. In other words, the
document should be relevant to the topic, be trustworthy, and give an entirely and
reasonable comparison.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>The main objective of the experiments is to develop a method finding among retrieved
documents that meet the comparative criteria most fully and reasonably.</p>
      <p>In addition to the search system’s scoring, we employ pre-trained state-of-the-art
language models and methods for getting the rate of the document’s comparability.</p>
      <p>Topic
(query)</p>
      <p>ChatNoir: an inverted
index + BM25</p>
      <p>Candidate documents
and their scores</p>
      <p>Topic-document
similarity computation </p>
      <p>Similarity
scores</p>
      <p>Documents
sorted by
similarity 
score</p>
      <p>
        This section contains short descriptions of approaches for the computation of the
score of one document in the search engine’s response. All of the approaches described
3 https://www.chatnoir.eu/doc
4 https://lemurproject.org/clueweb12
below are run on TIRA system [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The computation of scores for the entire set of
responses by a certain method is schematically shown in Figure 2. The ranking process
is completed by sorting documents by these values. Ultimately, each of the presented
methods computes a similarity score sij between a topic ti and a candidate document
dj from a candidate set:
sij = sim(ti; dj ):
(1)
      </p>
      <p>The goal of every presented method is to compute for each topic ti a vector of scores
si = (si1; :::; siNi ), where Ni is the number of candidate documents for the topic ti.</p>
      <p>Overall, we submitted six various solutions to test topic titles. Since no training data
were provided, it is not allowed to evaluate the performance of the suggested strategies
during the development. Below, we describe all proposed approaches in detail.
3.1</p>
      <sec id="sec-3-1">
        <title>Baseline based on an inverted index</title>
        <p>
          In our experiments, we utilize ChatNoir system [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] as a candidate documents extractor,
which was provided (as a baseline) by organizers. ChatNoir is an Elasticsearch-based5
engine providing access to nearly 3 billion web pages from ClueWeb and Common
Crawl corpora. Query processing shared across several search node allows reaching
response time compared to the commercial system. Text relevance estimation is based
on custom BM25 scoring function,6 which ranks the set of texts depends on the query’s
tokens existing in each response documents.
        </p>
        <p>The defined search system in response to question in the topic title returns
documents, its titles and scores. We take scores given by it and create a document ranking
based on it, so, similarity score is
sij = cnij ;
(2)
where cnij - scores provided by ChatNoir for i-th title to j-th responded document.</p>
        <p>It should be noted that the system issue may contain similar documents. We look
through the response and remove the document with duplicated titles. We also clean
documents’ bodies from HTML tags and markups.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Language model LSTM ULMFiT</title>
        <p>The simplest way to estimate the relevance of documents is by mapping the query and
response in the same vector space. The relevance is defined as the cosine similarity
between retrieved objects.</p>
        <p>We assume that the hidden state in the recurrent network implicitly contains
information about all processed sequences. Providing topic title and a response document’s
body to LSTM as an input gives in the hidden state of the last step their compressed
representations.</p>
        <sec id="sec-3-2-1">
          <title>5 https://www.elastic.co</title>
          <p>6
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modulessimilarity.html</p>
          <p>
            The modification of the hidden state at each step depends on the parameters of
the model. We employ the weights from pre-trained Universal Language Model
Finetuning (ULMFiT) [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]. A state-of-the-art language model AWD-LSTM [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] is tuned
on Wikitext-103 [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ], which collects 28,595 Wikipedia articles and 103 million words.
The model class consists of 3 LSTM layers, encoder-decoder, dropouts, and linear layer.
We use the class definition from here7, extract only LSTM layers, and apply them to
texts.
          </p>
          <p>We pass the query and the document body through these layers. The input tokens
are transferred into vectors using bert-as-service library8.</p>
          <p>Similarity score for this method is computed as following:
sij = cos(hi; hj );
(3)
where hi is the hidden state of LSTM that was fed with the i-th topic’s title and hj is
the hidden state for j-th response’s body.
3.3</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Attention of a Transformer-based language model</title>
        <p>The Transformer is a neural-network encoder-decoder model, being used as an
alternative to recurrent neural networks, such as LSTM, e.g., the ULMFiT model described in
the previous section. The key innovation of the Transformer is the attention mechanism,
which at each step calculates the importance of each word from the input sequence.</p>
        <p>Information from pre-trained attention layers can be used to analyze the closeness
of the query and the response. A Transformer can deal with a pair of input sequences
separated by a special character. The attention layer returns the mutual weights of
every word of this pair. Since we are interested in the relation of the topic and retrieved
document, the input pair will be composed of them. The entrance is “[CLS]” + query
+ “[SEP]” + response document’s body + “[SEP]”, where “[CLS]” and “[SEP]” are
special symbols being used when processing the sentences through Transformer.</p>
      </sec>
      <sec id="sec-3-4">
        <title>The appropriate Transformers’ head selecting. The attention layer in the standard</title>
        <p>Transformer provides 12 outputs, named heads. Each of head describes its own, not
predefined meaning. For every token in an encoded sequence, one head gives weights
for all input tokens. If we encode the input to itself, we get a matrix of adjacent weights
for each input word.</p>
        <p>Using the obtained matrix we can build a map of attention. On Figure 3 these
structures are shown for input “[CLS]” + Which is better, a laptop or a desktop? + “[SEP]”
+ Laptop is preferable than desktop because it is more portable. + “[SEP]”. The bright
vertical stripe corresponds to the separation token and should be excluded from
consideration. For interconnection estimation of words from different sentences, only the
upper left and lower right corners of the map should be taken into account - the
“nondiagonal” right upper and lower left parts describe the response of the one sentence in
input pair to itself.</p>
        <sec id="sec-3-4-1">
          <title>7 https://github.com/fastai/fastai 8 https://github.com/hanxiao/bert-as-service</title>
          <p>To use the Transformer efficiently, we need to select those outputs that provide
information relevant to response ranking. As it can be observed in Figure 4, the third
head determines similar words in sequence’s pair, so we take it for scoring.</p>
          <p>To select other suitable heads, we design a sandbox experiment. We take a query
“Which is better, a laptop or a desktop?” and make a set of 4 documents consisting of
1 sentence. Two of these documents are retrieved from top Google sites to query
determined above and are marked as relevant. The other two are taken from “The Hunting of
the Snark” by Lewis Carroll, and they are considered as unreasonable. The query and
the obtained sentences are in Table 1. The idea of the experiment is to process paired
input to the Transformer attention layer and observe at which outputs the value of the
sum for the relevant and irrelevant documents differs the most.</p>
          <p>We apply the Transformer to query merged with one of four sentences. For each
of 12 transformer’s heads, we count the sum of weights from the upper left and lower
right. The most significant variation appears in 4, 10, 11 heads (Table 2). These heads
are taken into consideration when a similarity score creates.</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>The counting score of similarity using attention layer. The response scoring consists</title>
        <p>of two steps: first, we concatenate query with enumerated document and process it by</p>
        <p>Transformer attention layers; second, we count the sum of the appropriate parts of maps
for 3, 4, 10, 11 heads. Thereby, a similarity score for i-th topic title and j-th retrieved
document is calculated as
sij =</p>
        <p>X
4
2Q 1 Q+R 1</p>
        <p>X X</p>
        <p>wlm +
h=3;4;10;11
l=1 m=Q+1</p>
        <p>Q+R 1 Q 1</p>
        <p>X X
l=Q+1 m=1</p>
        <p>3
wlm5 ;
(4)
where Q - the length of query, which is i-th topic title, R - the length of j-th document
body. Wlm is an attention weight for l-th to m-th token in input, ranges of indices l
and m describes the proper part of the attention map. The accounted attention heads are
enumerated by h. The idea of this approach for the third layer is illustrated in Figure 4.
Namely, the zones highlighted in red color zones correspond to similarity of words from
query to these from a candidate documents. The other parts represent self-similarity of
query and document and thus have somewhat trivial sparsity pattern.</p>
      </sec>
      <sec id="sec-3-6">
        <title>Query</title>
        <p>Document 1
Document 2
Document 3</p>
      </sec>
      <sec id="sec-3-7">
        <title>Which is better, a laptop or a desktop?</title>
        <p>Laptop is preferable than desktop because it is more portable.</p>
        <p>If you need portability, the laptop is the best option than desktop.</p>
        <p>The crew was complete: it included boots, a maker of bonnets and hoods.
Query + Document4 1.67 3.49 0.001 2.83 4.33 1.90 3.30 1.40 3.42 4.47 3.92 3.01
Table 2. Sums of the attention heads’ outputs for relevant and unrelated responses. In every
cell, there is a value counted over upper left and lower right corners on the attention map. Bold
columns have a high variation on close and random answers, which means that they are sensitive
to the proximity.
3.4</p>
      </sec>
      <sec id="sec-3-8">
        <title>Bidirectional encoder representations from Transformer (BERT)</title>
        <p>
          The architecture of Bidirectional Encoder Representations from Transformer (BERT)
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] is based on Transformer and is fine-tuned on specific masked language tasks. The
result is a bidirectional language model that can give distributed representations of words
taking into account contextual information.
        </p>
        <p>We employ bert-as-service library to provide word embedding from BERT. This
library uses pre-trained weights from a large uncased model with 340M parameters
and encodes every word in query and document title in ChatNoir’s response. Vectors
corresponding to query and title are averages between all word embedding in defined
sequences. The similarity score between query and title is described as
sij = cos(equery; etitle);
(5)
where equery - average between embedded query’s tokens, etitle - average between
embedded tokens in responded document title.
3.5</p>
      </sec>
      <sec id="sec-3-9">
        <title>Comparative feature extraction</title>
        <p>The scores by the approaches described above estimate the relevance of the topic and
response as the closeness of the texts; in other words, show how possible and
appropriate the given response is. The closeness is calculated in the context of well-known
models (BERT, ULMFiT), trained on a huge amount of texts of a natural language.
Such a method allows us to select documents that are similar in meaning but do not
evaluate the quality of the comparison explicitly.</p>
        <p>In order to evaluate the document as an argumentation, we use the combination of
one of the some recently used methods and the approach giving information about the
document’s argumentativeness. The resultant similarity score is a multiplication of the
score provided by the chosen method and additive term r. This term is counted for one
document and represents a composition of features relied on the density of comparative
sentences and features derived from the number of comparative parameters existed in
the text. Initially, r is equal to 1.</p>
        <p>
          The comparative degree of the document depends on the number of existent
comparative sentences. To detect comparative sentences, we use the method described in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
It encodes the sentence by the InferSent embedder [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], then applies gradient boosted
decision trees (XGBoost) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] classifier to the resulting features. The XGBoost model is
pre-trained on multi-domain Comparative Sentences Corpus 2019, formed by the 7,199
sentences. It determines the probability that considered sentences are being regular or
comparative. If the comparative probability is greater than 0.2, the counter of
comparative sentences is incremented.
        </p>
        <p>After using the classifier of comparatives, r increases by the number of revealed
sentences n:
r = r + n;
(6)</p>
        <p>To precise if the document collates exactly to what the user wants to, we formalize
the comparative parameters. We determine two comparison objects, predicates
(comparison conditions, for example, “cheaper”), and comparison features — aspects (“for
children”, “for deep learning”). Tagging these parameters in a given sentence leads to
the sequence labeling problem. State-of-the-art solutions provide low performance on
comparative cases, and we created and trained our own sequence-labeling module to
achieve acceptable quality.</p>
        <p>
          Our model consists of a single layer LSTM with 200 hidden units from [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. To
the input of the recurrent network, we enter the BERT embedding of words. We train
BERT and LSTM parts of the model together with a learning rate 0:00001 and 0:01,
respectively. As a target, we use a custom dataset structurally similar to that by Arora
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]9 composed of 3,967 labeled comparative sentences from the different domains.
        </p>
        <p>To count comparative parameters part of the additive term, first, we process the
query by sequence labeling model described above. The model extracts objects, aspects,
and predicates and formalize the user’s answer. Then we combine the document’s title
and body, and by a simple search in text, try to detect extracted parameters in it.</p>
        <p>The additional term changes according to the following law:</p>
        <sec id="sec-3-9-1">
          <title>9 https://github.com/uhh-lt/comparely</title>
          <p>Appearance in the text one of the predicates or aspects in case of objects’ existence
additionally adds 1 to r:</p>
          <p>r = r + l;
where l - number of predicates or aspects founded in the document.</p>
        </sec>
      </sec>
      <sec id="sec-3-10">
        <title>Combination of Baseline, number of comparative sentences and comparative structure extraction</title>
        <p>For every document in the response, we count additive term and multiply it with the
engine’s score. The resulting similarity score is</p>
        <p>sij = cnij rij ;
where cnij is a score issued by ChatNoir system, rij - additive term for i-th title and
j-th document, calculated as described above. For making answer, we rank documents
by the resulting values.
3.7</p>
      </sec>
      <sec id="sec-3-11">
        <title>Combination of ULMFiT, number of comparative sentences and comparative structure extraction</title>
        <p>In this method, we do the same as in the previous section, with the only difference that
scores counted by method from the section 3.2 are used as the basic value. We also
compute an additive term rij for i-th title and j-th document and resulting scoring is
the following:
r =
(r 1:2 if we find one object
r 1:5 if we find two objects
(7)
(8)
(9)
(10)
sij = ulmij rij ;
where ulmij is a score by an approach based on ULMFiT.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <sec id="sec-4-1">
        <title>Evaluation</title>
        <p>Each retrieved document from each of the seven approaches tested by our team was
manually evaluated on the scale 0-1-2, where 0 means no relevance, “1” means the
document contains relevant information, e.g., characteristics of one of the object, and
“2” means very relevant i.e., the document directly compares the objects mentioned in
the topic in the required context.</p>
        <p>
          In addition to assessing the relevance, for every response, we estimate pieces of
evidence provided in the document by support retrieval model [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Based on these
judgments, the official NDCG score was computed for each submission. The results are
discussed in the following section. The correspondence of the names of the methods
described above and experiment run tags are in the first and the second columns of the
Table 3.
        </p>
        <p>Running tag
§ 3.1 Baseline based on an inverted index
MyBaselineFilterResponse
§ 3.6 Combination of Baseline and comparative Baseline_CAM_OBJ</p>
        <p>features
§ 3.7 Combination of ULMFiT and comparative ULMFIT_LSTM_CAM_OBJ
features
§ 3.4 Bidirectional Encoder
from Transformer (BERT)</p>
        <p>Representations myBertSimilarity
0.564
0.553
§ 3.3 Attention of a Transformer-based language MethodAttentionFilterResponse 0.223
model
The top-5 discounted cumulative gain(DCG@5) scores for the proposed approaches are
in the Table 3.</p>
        <p>The Table 3 shows that approaches using only pre-trained language models give the
smallest scores. It can be explained by the fact that the information stored in the SOTA
linguistic model is sufficient to estimate the appropriateness of the text but not enough
to assess how complete, persuasively, and supportive the document is. As in many other
tasks, the Attention-based model has a better performance than ULMFit - 0.223 against
0.200. This is due to the fact that the attention mechanism allows us to consider the
meaningful context that is located at a distance from the current word, which makes
the model more expressiveness. BERT-based model is a bidirectional expansion of the
attention layer. Therefore, its application increases the performance to 0.405.</p>
        <p>Overall, a combination of the approaches with comparative information shows
better performance than the same method without comparative terms. Thus, consideration
of comparative structures and sense improves the results for ULMFit from 0.200 to
0.464.</p>
        <p>
          The best quality is provided by the baseline model being cleaned from document
duplicates. Its scoring function is based on the BM25 ranking formula but uses a more
efficient way of calculating term frequencies [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. It provides the ability to consider
information from all parts of the document - title and body, which gives superiority over
methods that process only the title or only the document’s body. It should be noticed
that the baseline gives NDCG@5 0.565, baseline with CAM, and object extraction
0.554. One reason for decreasing quality when complementary information is added
is choosing the weight with which we consider the CAM information and number of
comparative structures.
        </p>
        <p>The main take-aways are as follows. First, the methods for re-ranking of the
candidate documents which do not rely on the original baseline score, but instead completely
replace it with similarity scores based on the language models do not yield superior
results to the baseline; therefore, the original scores shall be used. Among all such,
completely baseline-free methods BERT-based similarity yielded the best results. Second,
a combination of the custom features based on the density of comparative structures in
text combined with the baseline yield better results. Since no training data was provided
in this version of the shared task, it was not obvious to test various combinations of the
tested features, but given such supervised training data, a promising way to further
improve the results is to combine various signals using a supervised machine learning
model.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we present our solution to the Argument retrieval shared task. Our main
innovations are (i) the use of large pre-trained language models, (ii) the use of features
based on natural language understanding of comparative sentences, and (iii) the use of
features based on the density of comparative sentences. It should be noted that modern
linguistic models meet the response relevance quite well, but to assess the comparability
and argumentation of the answer, we need to add external features.</p>
      <p>Overall, according to the experimental results, the baseline information retrieval
model proved to be a hard baseline. In fact, among all 11 evaluated runs in the shared
task only one outperformed the baseline by more than 0.5% which is a substantial
difference.10 The results suggest that considering the score taking into account claim support
and evidence existing, models based on SOTA language models do not work as well
as the models which combine comparative structure and comparative sentiment in
sentences. We conclude that in future work, more combinations of methods based on a
combination of baseline IR models with comparative features shall be investigated.
10 https://events.webis.de/touche-20/shared-task-2.html#results</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>J.</given-names>
            <surname>Arora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Goyal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Pathak</surname>
          </string-name>
          .
          <article-title>Extracting entities of interest from comparative product reviews</article-title>
          .
          <source>In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management</source>
          ,
          <source>CIKM '17</source>
          , pages
          <fpage>1975</fpage>
          -
          <lpage>1978</lpage>
          , New York, NY, USA,
          <year>2017</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          .
          <article-title>Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl</article-title>
          . In L. Azzopardi,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          , G. Pasi, and B. Piwowarski, editors,
          <source>Advances in Information Retrieval. 40th European Conference on IR Research (ECIR</source>
          <year>2018</year>
          ), Lecture Notes in Computer Science, Berlin Heidelberg New York, Mar.
          <year>2018</year>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>A.</given-names>
            <surname>Bondarenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Braslavski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Völske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Biemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          .
          <article-title>Comparative web search questions</article-title>
          .
          <source>In Proceedings of the 13th International Conference on Web Search and Data Mining</source>
          , pages
          <fpage>52</fpage>
          -
          <lpage>60</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A.</given-names>
            <surname>Bondarenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Beloucif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gienapp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ajjour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Biemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wachsmuth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          .
          <source>Overview of Touché</source>
          <year>2020</year>
          :
          <article-title>Argument Retrieval</article-title>
          .
          <source>In Working Notes Papers of the CLEF 2020 Evaluation Labs, Sept</source>
          .
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>L.</given-names>
            <surname>Braunstain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kurland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Carmel</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Szpektor, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Shtok</surname>
          </string-name>
          .
          <article-title>Supporting human answers for advice-seeking questions in cqa sites</article-title>
          . In N. Ferro,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-F. Moens</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Silvestri</surname>
            ,
            <given-names>G. M.</given-names>
          </string-name>
          <string-name>
            <surname>Di Nunzio</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hauff</surname>
          </string-name>
          , and G. Silvello, editors,
          <source>Advances in Information Retrieval</source>
          , pages
          <fpage>129</fpage>
          -
          <lpage>141</lpage>
          , Cham,
          <year>2016</year>
          . Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          .
          <article-title>Xgboost: A scalable tree boosting system</article-title>
          . In B. Krishnapuram,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Smola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Shen</surname>
          </string-name>
          , and R. Rastogi, editors,
          <source>KDD</source>
          , pages
          <fpage>785</fpage>
          -
          <lpage>794</lpage>
          . ACM,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A.</given-names>
            <surname>Chernodub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Oliynyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Heidenreich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bondarenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Biemann</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          . TARGER:
          <article-title>Neural argument mining at your fingertips</article-title>
          .
          <source>In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations</source>
          , pages
          <fpage>195</fpage>
          -
          <lpage>200</lpage>
          , Florence, Italy,
          <year>July 2019</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schwenk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Barrault</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Bordes</surname>
          </string-name>
          .
          <article-title>Supervised learning of universal sentence representations from natural language inference data</article-title>
          .
          <source>In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>670</fpage>
          -
          <lpage>680</lpage>
          , Copenhagen, Denmark, Sept.
          <year>2017</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          . BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          , Minneapolis, Minnesota,
          <year>June 2019</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>J.</given-names>
            <surname>Howard</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruder</surname>
          </string-name>
          .
          <article-title>Universal language model fine-tuning for text classification</article-title>
          .
          <source>In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          , pages
          <fpage>328</fpage>
          -
          <lpage>339</lpage>
          , Melbourne, Australia,
          <year>July 2018</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>S.</given-names>
            <surname>Merity</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. S.</given-names>
            <surname>Keskar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          .
          <article-title>Regularizing and optimizing lstm language models</article-title>
          .
          <source>In ICLR (Poster)</source>
          .
          <source>OpenReview.net</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>S.</given-names>
            <surname>Merity</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bradbury</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          .
          <article-title>Pointer sentinel mixture models</article-title>
          .
          <source>In ICLR (Poster)</source>
          .
          <source>OpenReview.net</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bondarenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Franzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Biemann</surname>
          </string-name>
          .
          <article-title>Categorizing comparative sentences</article-title>
          .
          <source>In Proceedings of the 6th Workshop on Argument Mining</source>
          , pages
          <fpage>136</fpage>
          -
          <lpage>145</lpage>
          , Florence, Italy, Aug.
          <year>2019</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>M. Potthast</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wiegmann</surname>
            , and
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
          </string-name>
          . TIRA Integrated Research Architecture. In N. Ferro and C. Peters, editors,
          <source>Information Retrieval Evaluation in a Changing World, The Information Retrieval Series</source>
          . Springer, Sept.
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>M. Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Graßegger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Tippmann</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Welsch</surname>
          </string-name>
          .
          <article-title>Chatnoir: a search engine for the clueweb09 corpus</article-title>
          .
          <source>In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <fpage>1004</fpage>
          -
          <lpage>1004</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Taylor</surname>
          </string-name>
          .
          <article-title>Simple bm25 extension to multiple weighted fields</article-title>
          . pages
          <fpage>42</fpage>
          -
          <lpage>49</lpage>
          ,
          <year>01 2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>M. Schildwächter</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bondarenko</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Zenker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Biemann</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          .
          <article-title>Answering comparative questions: Better than ten-blue-links?</article-title>
          <source>In Proceedings of the 2019 Conference on Human Information Interaction and Retrieval</source>
          , pages
          <fpage>361</fpage>
          -
          <lpage>365</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>