Question Answering for Comparative Questions
                        with GPT-2
                       Notebook for Touché at CLEF 2020

                                        Bjarne Sievers1

                             University of Leipzig, Germany
                       b.sievers@studserv.uni-leipzig.de


        Abstract Finding the best answer for comparative questions is a difficult prob-
        lem in the field of Information Retrieval. Since the best answer ideally covers
        not only both subjects of the query, but puts them into relation with each other,
        keyword based search alone has difficulties understanding the users intention cor-
        rectly. Language models like GPT-2 on the other hand can distinguish fine nu-
        ances of intention in a query or sentence and generating a text conditioned on
        that phrase. We try to leverage this by generating substitute answer articles con-
        ditioned on the given query. Search results are then re-ranked by their textual
        similarity to the generated substitute answer articles.


1     Introduction
Nowadays, search engines are not only used for looking up pure facts, but also for
opinions, discussions and arguments. Users now turn to search engines with questions
like ”Which is healthiest: coffee, green tea or black tea and why?” or ”Do you prefer
tampons or pads?”. In these cases, the user is likely more interested in the arguments or
personal testimonies than in the final answer. Platforms like Quora often list the exact
question and several answers to it, but depending on the answer, little context might be
given. Discussion forums or comments on private blogs might contain more interesting
debates and replies.
    However, search engines still struggle with this type of questions. They usually
either return websites that list facts for parts of the query or websites where the exact
same question was posted. If a question was phrased in a different way, most search
engines have trouble returning good results.
    Language models like GPT-2 on the other hand have proven to precisely capture
the semantics of a given prompt. Conditioned on such a prompt, they can generate
almost arbitrary lengths of coherent text while understanding what the prompt is about.
This has been demonstrated to be useful for summarization, translation and question
answering[13].
    The second task of the ”1st Shared Task on Argument Retrieval” in Touché@CLEF
2020 [4] is about returning the best results for comparative questions from the ClueWeb12
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessa-
    loniki, Greece.
Corpus [1]. For this task, we propose using a language model like GPT-2 to generate
the hypothetically perfect answer to a question. We can then compare this answer with
actual search results from a classical search engine. Re-ranking them by their similarity
with the generated answer, we produce search results that are more useful.


2     Related Work
2.1   Argument Retrieval
Argument Retrieval is a part of Information Retrieval that is focused on providing
the best arguments, either for debates of for comparative questions. Most research to
date is concerned with mining arguments from the web, identifying argument units [6]
and making them searchable [18]. The Comparative Argumentative Machine [15] can
search the web for comparisons between two entities and summarize the tendency of
online publications towards either of them in percent.

2.2   Deep Learning
Since the success of Deep Learning with Convolutional Neural Networks [10] at the
ImageNet competitions, Deep Learning has been used very successfully in many areas.
In the field of speech processing, Recurrent Neural Networks (RNNS) are used, for ex-
ample LSTMs [9]. Long dependencies that exist in natural language can be a challenge
here: words can refer to past phrases or sentences that are even further back in the se-
quence. To counter this problem, approaches are often used that can dynamically decide
on which previous tokens they focus their attention [17]. These transformers combine
CNNs with a modification of attention, the so-called self-attention. Transformers thus
form the basis of almost all current generative language models, for example GPT-2
[13] and T-NLG [2].

2.3   Text Similarity
There are different approaches to capture the similarity between texts. [8] divides the
similarity metrics into three main categories: Similarity at the character or word level,
similarity by comparing with large amounts of text, and semantic similarity calculated
by semantic word graphs.[14] is a Neuronal Network pretrained on NSLI [5] and Mul-
tiSLI [19] on the basis of BERT [7]. It generates semantic vectors at sentence level that
can be compared to each other, for example, via their cosine similarity.


3     Approach
The architecture used in this approach consists of three steps. First, multiple queries
are generated from the given query by replacing words with synonyms, one by one.
Second, results for all queries are queried with ChatNoir [3] and accumulated. Third,
the retrieved results are re-ranked by their similarity with GPT-2 generated texts condi-
tioned on the original query.
                    Figure 1: Architecture of the proposed approach.


3.1   Language Model


In order to best satisfy the information deficit of the users of our system, we have to
provide an answer that is precisely tailored to the question. Lets suppose we understand
exactly what the information deficit is and we have the information and knowledge to
give the answer directly. And even if we do not have this knowledge, we would probably
use very similar words and expressions in our answer, like a document representing
the perfect answer. A system which formulates an answer for a given search query
which at first glance does not differ significantly from a human answer, thus offers
the possibility to reduce the problem finding the best answer to a classical information
retrieval problem. The interpretation of which aspect of the question was meant by the
user and what answer style they are expecting is already decided by this system. The
generated mock document can now be compared with the corpus (here CW12[1]) by
classical search engines and existing documents can be used as an answer.
    Such systems have become conceivable due to the success of Deep Learning (DL)
in Natural Language Processing (NLP). GPT-2 [13] from OpenAI or T-NLG from Mi-
crosoft prove that DL systems are already able to distinguish fine nuances of meaning
even in short text snippets and to generate and summarize suitable texts or even answer
questions. Our approach uses the GPT-2 model to generate conditioned query mock
responses on the query. These generated texts are then used to find matching, simi-
lar documents in the corpus. We show that by using suitable suffixes on the question,
certain answer formats can be favored and thus influence the results.
    In order to find personal opinions, field reports, subjective impressions of the ques-
tion, one can, for example, add generic sentences to the question that one would read in
forums or question-answer platforms. If, on the other hand, you are more interested in
a scientific approach to the text or in journalistic texts, you can use appropriate phrases.
A corresponding overview of tested suffixes can be found in 1.
3.2     Implementation1

Before querying ChatNoir with the topic, each query is duplicated and adapted several
times. This is done by removing all stopwords from the query and generating n queries,
where n is the number of remaining words. In each query, one of the words is now sub-
stituted by its closest neighbor in the GloVe-graph[12] that has a Levenshtein distance
greater than three. The Levenshtein distance is used in order to retrieve words that are
neither singular or plural forms nor spelling mistakes of the original word. ChatNoir is
now queried with each query and the results are downloaded and accumulated.
     For each original query, the system uses the GPT-2-M model to generate ten docu-
ments until a text length of greater than 700 characters has been reached. This is neces-
sary because especially the smaller GPT-2 models sometimes break off after one sen-
tence. The upper text length is bounded by the length of the trained context of GPT-2,
which is 1024 tokens including the tokens of the prompt (here: the query). After weigh-
ing up runtime and performance, GPT-2-M with 345 million parameters is always used
in this thesis. The generated documents are then compared with the accumulated search
results of the ChatNoir queries. This similarity factor is later used as a score for the
final ranking of the results. As possible similarity metrics, the cosine similarity on the
tf-idf vectors as well as on S-Bert embeddings (bert-base-nli-stsb-mean-tokens) were
evaluated (see chapter 3.4), the final run used tf-idf vectors due to the limited capability
of the Tira submission system.


3.3     Word based approach

Another approach for using language models to assess the relevance of documents is
based on the internal understanding of models of words and their contexts. GPT-2 and
other models use large transformers [17] to assign a probability to each word in their
vocabulary for being the next token in the sequence. If we now condition GPT-2 on the
search query, we can compare the first word of our given document with the probability
predicted by GPT-2 for that word. Now we condition GPT-2 on the search query with
the first word of the document as the suffix and compare the probability with the second
word, and so on. Assuming an appropriate normalization over the length of the text, the
sum over all word probabilities could reflect the relevance of the document, since GPT-
2 can remember the meaning of the search query even over longer passages [13]. GPT-2
thus assigns low probabilities to non-topic words that it does not expect, conditioned
on the search query. This is observed when GPT-2 searches are made and documents
coming from a different direction are evaluated.


3.4     Evaluation

In order to evaluate whether the generated texts fit the question at all, a pairwise compar-
ison was done between each of them. The similarity of the pairs was computed by the
cosine similarity on the tf-idf vectors as well as on semantic embeddings of a language
model based on BERT. We also compared the paired similarity between the generated
 1
     Code can be found on Github (github.com/bjrne/gpt-question-answering)
texts between different topics. Finally, we considered the similarity of the generated
texts to the first ten search results per topic.
    In order to get the best texts regardless of the matching topic, different types of
conditioning were tested. Based on the single question as well as on the question with
pre- and suffix, texts were generated and compared qualitatively and quantitatively.


Text Generation In a manual, qualitative evaluation of text quality, it is noticeable that
almost all texts deal with the right topic over the entire length. Sometimes this is at the
expense of a desirable variety in the text, repetitions of the same statement are frequent.
Sometimes the model ends in an infinite loop that always repeats the same phrases
alternately or continuously. Furthermore, some texts do not go beyond the length of a
few sentences. Sample texts can be found in Appendix 5.1.


                                                N      Suffixes
                                                 1     <without suffix>
                                                 2     \n\n
                                                 3     \n\n I am really wondering what the answer is.
                                                 4     \n\n It is obvious that
                                                 5     \n\n Lets get to the truth about what the correct answer really is

                                                                      Table 1: A list of the tested suffixes.

    When comparing texts which were generated by using different suffixes, a differ-
ence in writing style becomes obvious. Texts conditioned of suffixes containing ”I”
usually generate answers written in first-person perspective or personal testimonies. A
quantitative analysis that corroborates this observation can be found in Figure 2a.
    One can see that without adding a suffix to the query (No. 1 & 2), the perspective
of the query correlates with whether the answer is given in first-person perspective or
not. When using a suffix, this correlation diminishes. Moreover, it can be suspected that
more objective suffixes push the generation in a less personal direction (No. 4 & 5).


                                     Count of ’I’ in generated texts in dependence of the suffix                                      Readability of texts in dependence of the suffix
                                                                                                                           50
  Count of ’I’ in generated texts


                                    1.0                                                                                                            Coleman Liau Index
                                                                                                   Average readability of texts


                                                                                                                           40                      Gunnig Fog Index
                                    0.8
                                                                                                                                                   Occurence of syllable-rich words
                                    0.6                                                                                    30

                                    0.4                                                                                    20

                                    0.2
                                                 % ’I’, if question contains ’I’                                           10
                                    0.0          % ’I’, if question does no contain ’I’
                                                                                                                                  0
                                          1            2            3           4           5                                         1           2             3           4            5
                                                     Suffix index (refer to Table 1)                                                             Suffix index (refer to Table 1)

                                                                  (a)                                                                                      (b)

Figure 2: Left: Occurence of ’I’ in GPT-2-generated texts in relation to the suffix used,
see table 1. Right: Average readability in relation to the suffix used, see table 1.
    Since a complete manual check of the fit of all texts to the prompt is out of scope
due to both by subjectivity and effort, the generated texts were compared with each
other and with search results of the same query and other topics by means of cosine
similarity (see Figure 3b). In each row (generated texts) and each column (ChatNoir
documents), the highest value is found in the diagonal, i.e. in the matching topic.
    If you examine the generated texts using sentence embeddings generated by [14],
you can see their semantic similarity to the search queries of the same topic as well
as to the first documents of ChatNoir. In Figure 3a you can see the embeddings of all
sentences for five topics, while the crosses mark the respective search query phrase
embedding.
    In a further comparison, Figure 4 shows sentence embeddings of the documents
together with the texts for five topics. The dark colors represent the sentences of the
generated texts, the lighter colors represent the documents. One can see in both fig-
ures that almost every sentence in the generated texts has a semantic reference to the
question.


                                             Similarity of Generated Texts vs. Query Documents
                                                          2    4    6     8    10
                                              Query Documents, No. of Topic


                                                                                       0.5
                                                                              2
                                                                                                                   0.4
                                                                              4
                                                                                                                   0.3
                                                                              6
                                                                                                                   0.2
                                                                              8
                                                                                                                   0.1
                                                                              10
                                                                                                                   0.0
                                                                                   Generated Texts, No. of Topic

                     (a)                                                                          (b)

Figure 3: Left: S-Bert embeddings of all sentences of the generated documents, reduced
in dimensionality by UMAP[11]. Embeddings of the queries are marked with a cross.
Right: The average of the pairwise cosine-similarity of sentence-embeddings between
the search result documents and the generated texts. First ten topics only.


GPT-2 Bias A major problem with this approach, which has not yet been sufficiently
evaluated, is the bias present in language models. Since these are usually trained on
freely available texts on the Internet, they ”learn” to adopt various implicit and ex-
plicit discriminations. If the output of such systems is now used to understand search
queries, this bias is automatically passed on and ensures that the results are influenced.
For example, if answers to the question ”Do I clean tiles better with a mop or vacuum
cleaner” are generated and all answers and recommendations come from women, ex-
isting stereotypes are consolidated. Furthermore, new perspectives on a topic may be
                                                        What is the difference between sex and love?
                                                        Which is better, laptop or desktop?
                                                        Which is better, Canon or Nikon?
                                                        What are the best dish detergents?
                                                        What are the best cities to live?


Figure 4: S-Bert embeddings of all sentences of the generated texts in light colors and
search result documents in dark colors, reduced in dimensionality by UMAP[11]. Em-
beddings of the queries are marked with a cross.


structurally prevented if these biased texts are seen as ideal for the documents to be
delivered. Other perspectives are thus likely to find it harder to be heard. A separate
evaluation of the biases of this model is outside the scope of this paper, but since an
unchanged, pre-trained GPT-2 model was used, relevant literature can be found in [16].


4   Conclusion

Models like GPT-2 already understand the intention of the user sufficiently well to gen-
erate suitable results. But using this knowledge to create a good ranking is not trivial.
The approaches presented here show the basic possibility, but do not deliver the hoped-
for results. While a semantic similarity to the research question can be recognized and
thus documents with similar vocabulary can be sorted out, the ranking based on this sim-
ilarity does not achieve good sorting. To recognize good arguments, other approaches
seem to make more sense.


References
 1. Clueweb12 - dataset. https://lemurproject.org/clueweb12/index.php,
    accessed: 6.3.2020
 2. Turing-nlg - a 17 billion parameter language model by microsoft.
    https://www.microsoft.com/en-us/research/blog/
    turing-nlg-a-17-billion-parameter-language-model-by-microsoft/,
    accessed: 6.3.2020
 3. Bevendorff, J., Stein, B., Hagen, M., Potthast, M.: Elastic chatnoir: search engine for the
    clueweb and the common crawl. In: European Conference on Information Retrieval. pp.
    820–824. Springer (2018)
 4. Bondarenko, A., Hagen, M., Potthast, M., Wachsmuth, H., Beloucif, M., Biemann, C.,
    Panchenko, A., Stein, B.: Touché: First shared task on argument retrieval. In: Jose, J.M.,
    Yilmaz, E., Magalhães, J., Castells, P., Ferro, N., Silva, M.J., Martins, F. (eds.) Advances in
    Information Retrieval. pp. 517–523. Springer International Publishing, Cham (2020)
 5. Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning
    natural language inference. ArXiv abs/1508.05326 (2015)
 6. Chernodub, A., Oliynyk, O., Heidenreich, P., Bondarenko, A., Hagen, M., Biemann, C.,
    Panchenko, A.: TARGER: Neural Argument Mining at Your Fingertips. In: Costa-jussà, M.,
    Alfonseca, E. (eds.) 57th Annual Meeting of the Association for Computational Linguistics
    (ACL 2019). Association for Computational Linguistics (Jul 2019),
    https://www.aclweb.org/anthology/P19-3031
 7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional
    transformers for language understanding. In: NAACL-HLT (2019)
 8. Gomaa, W.H., Fahmy, A.: A survey of text similarity approaches (2013)
 9. Greff, K., Srivastava, R.K., Koutnı́k, J., Steunebrink, B., Schmidhuber, J.: Lstm: A search
    space odyssey. IEEE Transactions on Neural Networks and Learning Systems 28,
    2222–2232 (2017)
10. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional
    neural networks. In: NIPS (2012)
11. McInnes, L., Healy, J.: Umap: Uniform manifold approximation and projection for
    dimension reduction. ArXiv abs/1802.03426 (2018)
12. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation.
    In: Empirical Methods in Natural Language Processing (EMNLP). pp. 1532–1543 (2014),
    http://www.aclweb.org/anthology/D14-1162
13. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are
    unsupervised multitask learners (2019)
14. Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese
    bert-networks. Proceedings of the 2019 Conference on Empirical Methods in Natural
    Language Processing and the 9th International Joint Conference on Natural Language
    Processing (EMNLP-IJCNLP) (2019). https://doi.org/10.18653/v1/d19-1410,
    http://dx.doi.org/10.18653/v1/d19-1410
15. Schildwächter, M., Bondarenko, A., Zenker, J., Hagen, M., Biemann, C., Panchenko, A.:
    Answering comparative questions: Better than ten-blue-links? CoRR abs/1901.05041
    (2019), http://arxiv.org/abs/1901.05041
16. Tan, Y.C., Celis, L.E.: Assessing social and intersectional biases in contextualized word
    representations (2019)
17. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.,
    Polosukhin, I.: Attention is all you need. In: NIPS (2017)
18. Wachsmuth, H., Potthast, M., Khatib, K.A., Ajjour, Y., Puschmann, J., Qu, J., Dorsch, J.,
    Morari, V., Bevendorff, J., Stein, B.: Building an argument search engine for the web. In:
    ArgMining@EMNLP (2017)
19. Williams, A., Nangia, N., Bowman, S.R.: A broad-coverage challenge corpus for sentence
    understanding through inference. ArXiv abs/1704.05426 (2017)
5     Appendix

5.1   Examples of generated text
What is the difference between love and sex? I would love to have a conversation
about it.
     Is love a natural and universal emotion? Or is it something that we develop and learn
over time?
     Is it a part of the human condition? Or is it something that we can develop and learn
about through practice?
     Is it something that we are born with? Or does it come through our parents?
     I am curious about the nature of love and if it is a part of the human condition.
     My question is a matter of life and death and that is really what it boils down to. I
think all of us will be alive in our lifetimes without knowing it. We will all be alive and
we will all be affected by it and have it shape us for the rest of our lives. It may not be
something we understand, but we all have it.
     I am wondering if love is a natural emotion, or if it is something that we develop
and learn through practice?
     What is the difference between sex and love?
     Love, like sex, is an experience and an expression of our own desires.
     Is it something that we develop and learn over time?
     Yes. I think the answer is that there is a huge range of sexual experiences that can be
experienced. It really depends on your level of experience and your level of experience.
I think it’s an experience that you are capable of.
     Do we develop the same sense of love as we develop sex?
     I think that sex is much more developed in your brain than love. We develop that
same kind of sexual drive, the same sexual experience, that is necessary for our capacity
to become human beings. That same level of experience and that same level of under-
standing is necessary for our capacity to have that same kind of love for one another.
     Is it really necessary to be a human being, to be a loving and affectionate human
being, if we are not able to have sex?
     If we were born with a certain kind of sex drive, would that affect our capacity to
be a good human being? If we were born with a certain kind of sexual drive, would that
affect our capacity to have that same kind of love for one another?
     What is the difference between sex and love?
     I think sex is a natural and universal emotion.
     Are there certain things that we need to learn or experience before we can know
whether we are a loving and affectionate human being?
What is better, laptop or desktop? What’s better? A desktop or a laptop? What’s
better, laptop or desktop?
    I am really wondering what the answer is. Posted on Oct 18th 2017, 12:46 Reply
    #21 TheLionKnight
    ”Killer4Life said: If I had to choose, I’d choose the desk, since I don’t want to have
to work with a mouse or keyboard in an environment where I could lose track of which
direction to turn the mouse. The only reason I could even think of that desk is that the
desk has a screen which looks really cool.
    If there’s no way to put a mouse on it, I might have to say no to my mouse. It’s the
only reason I’m choosing a desk. The only reason I could even think of that desk is that
the desk has a screen which looks really cool.If there’s no way to put a mouse on it, I
might have to say no to my mouse. It’s the only reason I’m choosing a desk. Posted on
Oct 18th 2017, 13:13 Reply
    22 kyren
    ”TheLionKnight said: If I had to choose, I’d choose the desk, since I don’t want to
have to work with a mouse or keyboard in an environment where I could lose track of
which direction to turn the mouse. The only reason I could even think of that desk is
that the desk has a screen which looks really cool.
    If there’s no way to put a mouse on it, I might have to say no to my mouse. It’s the
only reason I’m choosing a desk. Yeah, it’s a good one, considering how the screen is
so small, so small that you could be standing there while wearing it.
    Also, yes, I’d go with the desk because it has a nice shape, the screen is awesome
and the monitor is great.
    I don’t like how the mouse buttons are made and the size is also small.
    You can always make it bigger, though.
    I think the best way is to just buy the mouse, which is much cheaper, and make a
stand so that you can sit on it, because it is so small.
    Also, the only other thing I would change is the button layout on the side so that it
isn’t all vertical.
    ”TheLionKnight said: Yeah, it’s a good one, considering how the screen is so
Which is better, Canon or Nikon? I have only used Canon lenses since I got my first
camera back in 2004 and I never really cared about buying Nikon.
   I am using the D800E to try the new EOS 10D (and have shot some of the best shots
with it so far) so I figured I would get some opinions on the differences.
   I have used both cameras and I have not found any differences in my experiences.
   So I will start with the Canon EOS 30D and see if I can find any differences.
   Canon EOS 30D
   My first shot
   Canon EOS 30D at night
   Canon EOS 30D at sunrise
   Canon EOS 30D at sunset
   Canon EOS 30D at sunset with a flash
   Canon EOS 30D at night with a flash and with a tripod
   Canon EOS 30D at night with a flash and with a tripod with tripod and a flash
   Canon EOS 30D at sunrise with a flash and with a tripod
   Canon EOS 30D at sunset with a flash and with a tripod
   Canon EOS 30D at sunrise with a flash and with a tripod and a flash
   Canon EOS 30D at sunset with a flash and with a tripod and a flash
   Canon EOS 30D at sunrise with a flash and with a tripod and a flash
   Canon EOS 30D at sunset with a flash and with a flash
   Canon EOS 30D at sunrise with a flash and with a tripod and a flash
   Canon EOS 30D at sunset with a flash and with a flash
   Canon EOS 30D at sunrise with a flash and with a flash
   Canon EOS 30D at sunset with a flash and with a flash
   Canon EOS 30D at sunrise with a flash and with a flash
   Canon EOS 30D at sunset with a flash and with a flash
   Canon EOS 30D at sunrise with a flash and with a flash
   Canon EOS 30D at sunset with a flash and with a flash
   Canon EOS 30D at sunrise with a flash and with a flash
   Canon EOS 30D at sunset with a flash and with a flash
   Canon EOS 30D at sunrise with a flash and with a flash
   Canon EOS 30D a