=Paper=
{{Paper
|id=Vol-2936/paper-219
|storemode=property
|title=Thor at Touché 2021: Argument Retrieval for Comparative Questions
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-219.pdf
|volume=Vol-2936
|authors=Ekaterina Shirshakova,Ahmad Wattar
|dblpUrl=https://dblp.org/rec/conf/clef/ShirshakovaW21
}}
==Thor at Touché 2021: Argument Retrieval for Comparative Questions==
<pdf width="1500px">https://ceur-ws.org/Vol-2936/paper-219.pdf</pdf>
<pre>
                                       Thor at Touché 2021: Argument Retrieval for
                                                 Comparative Questions
                                           Notebook for the Touché Lab on Argument Retrieval at CLEF 2021

Ekaterina Shirshakova1 , Ahmad Wattar2
1
  Martin Luther University Halle-Wittenberg, Germany
ekaterina.schirschakova@student.uni-halle.de
2
  Martin Luther University Halle-Wittenberg, Germany
ahmad.wattar@student.uni-halle.de


                                         Abstract
                                         We report on our approach for the Shared Task 2 from the second Touché lab [1] on argument retrieval at
                                         CLEF 2021. We retrieved and ranked the documents from ClueWeb12 to answer comparative questions
                                         using Okapi-BM25. We supplied the ranking model with 2 different kinds of expansion—query expansion
                                         with synonyms from WordNet and index expansion with arguments from Targer. Examination of
                                         different combination of expansions and search fields showed, that although the expansions benefit only
                                         one field while cause worse results for another, in case of synonyms it is still more beneficial to use both
                                         affected fields together with the expansion. We further re-rank the initial set of document candidates
                                         using elasticsearch and it’s built-in BM-25 algorithm.


1. Introduction
People always feel the need to compare things with each other, when they have to make a
decision about which one to choose. This comparison can be used to find out the advantages
and disadvantages of certain objects, or simply to highlight the differences between them. In
everyday life this need can be fulfilled by search engines such as Google, Yahoo, DuckDuckGo,
Yandex and many others, as one can just type in their questions, i.e. “What is better Linux or
Windows?”, in a search field. It became even more convenient with the use of various voice
assistants, as one can simply ask such a question.
However comparative queries are a very specific type of queries, that demand specific solutions.
Such queries can contain an explicit comparison between two or more objects (like in an example
above) or even implicitly compare all entities within a more general group of entities (i.e. a
question “Who is the best singer in the USA?” implies comparison between all singers in the
USA). Returning documents that simply contain information about just one of the compared
objects would not be enough to meet the need of comparison between them. As this problem
drives attention of many researchers, such events as the Touché Lab on Argument Retrieval

CLEF 2021, 21-24 September 2021, Bucharest, Romania
Envelope-Open ekaterina.schirschakova@student.uni-halle.de (E. Shirshakova); ahmad.wattar@student.uni-halle.de (A. Wattar)
GLOBE https://github.com/eshirshakova (E. Shirshakova); https://github.com/ahmad-Wattar (A. Wattar)
Orcid 0000-0002-9026-8806 (E. Shirshakova); 0000-0002-3434-566X (A. Wattar)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
[1, 2] at CLEF 2021 are a good starting point for those who want to find a practical solution for
it.
During the winter semester 2020/2021, we dealt with the Shared Task 2 of the Touché Lab on
Argument Retrieval at CLEF 2021. The task consisted of retrieving documents from ClueWeb12
for 50 different topics and then indexing these documents to develop a search engine that
returns the most relevant results. The topics were provided by the organisators and formulated
as comparative queries, for example ”Which is better, a Mac or a PC?”.
For retrieving the documents we used the API of ChatNoir [3] along with the python library
requests. For each query we have saved the first 10,000 documents that were returned by the
ChatNoir. However, after various experiments we decided to build an index for only the first 110
documents retrieved from ChatNoir for each query, because ChatNoir already delivers mainly
relevant documents within the first 110 results. For indexing we decided to use elasticsearch
– the same open-source instrument that was used to improve ChatNoir [4]. Before building
the index, we removed punctuation marks and lemmatized verbs and nouns with nltk. We also
experimented with TARGER [5] and WordNet [6]. We found out, that saving arguments retrieved
with TARGER along with the document or query expansion with synonyms from WordNet can
provide a slight improvement of results for one search field, but at the same time less relevant
documents will be returned when using another search field with the same extension.
At the end, we evaluated our experiments with ndcg10, recall10, precision and F1-score and
saved our results in a table.


2. Related Work
The interest of natural language processing and especially comparative question answering is
increasing because of evermore growing use of natural language technologies in an everyday
life – many companies develop or use chatbots, interactive bots at call-centers, programs that
can analyze documents and many other useful tools that demand computers to understand
natural language. There is a number of researchers engaged in research of argumentation
mining, for example, Wachsmuth et al. [7] developed an argument search framework using
a composition of approaches for acquiring, mining, assessing, indexing, querying, retrieving,
ranking, and presenting arguments. Stab et al. [8] also introduced an argument search system.
This system allows argument searching in heterogeneous texts. Another approach for argument
retrieval within massive unstructured corpora was proposed by Levy et al. [9]. Here increasing
of potential coverage of documents retrieved for queries was achieved by query relaxation.
   As a part of argument mining problem, claim and premisse detection also drives attention
of various researchers: for example, Ajjour et al. [10] refers to unit segmentation in order to
enhance relevance of documents containing arguments for search engines. Levy et al. [11]
proposes automatic claim detection solution, which can be used on large coropora.
   Eger et al. [12] see argumentational mining as a token-based dependency parsing and a
token-based sequence tagging problem, hence use a multi-task learning setup to solve it. The
neural argument mining framework TARGER, proposed by Chernodub et al. [5], was used in
our approach, as it is an open-source solution which can be used in real-time. Huck [13] also
used TARGER in the elasticsearch-based search engine developed for the Shared Task 2 from
Touché lab on argument retrieval at CLEF 2020. As his approach showed promising results, we
decided to explore possibilities of further improvement of his approach in our work.


3. Approach
3.1. Document Retrieval
To build and test the search engine, we use the topics published for the Shared Task 2 from
Touché lab on argument retrieval at CLEF 2020. We first convert input data (queries with
description and narrative) from xml to json by using xml.etree.ElementTree python module.
To retrieve documents from ChatNoir, we use the python library requests, which allows us to
work with ChatNoir API. We use the operator ”AND” for each query, as after visual inspection
of the first 10 results we found that it tends to deliver more relevant and useful results compared
to an operator “OR”. We also remove punctuation, as punctuation marks are considered as noise,
and it is a common practice in information retrieval. We use the python library boilerpy3 to
extract the content of each found document.
At the end for each topic we saved a dictionary containing keys ’number’, ’title’, ’description’,
’narrative’, ’results’ (number of documents found in ChatNoir) and ’documents’ (list of docu-
ments retrieved from ChatNoir). We save each dictionary in the json-format to a separate file.


3.2. Indexing
First we rearrange the data to make it more convenient for indexing. We read the queries
previously saved as dictionaries and iterate over the first 110 documents saved for each query.
For each document we create a lemmatized version, using nltk. We only lemmatize nouns and
verbs, as comparative adjectives are important for comparison. We use the same technique
to lemmatize the title. Lemmatization helps to find words regardless of their form in the text.
We wanted to test at least two different improvements (query expansion with synonyms and
adding arguments to search fields), because we wanted to compare the performance of different
approaches. In order to do so, at this step we also try to retrieve arguments: we send the
unlemmatized document to TARGER by using the same library requests we used previously
for retrieving documents from ChatNoir. We decided to use the Combo-model of TARGER,
because as a combination of various other models it tends to deliver the best result. After visual
inspection of retrieved arguments for several random topics we decided to save the tokens, that
received a ”P” or ”C”-label (which means, were detected as premise or claim) with probability
higher than 0.99. The most tokens with less probability for the same labels aren’t parts of an
argument. We lemmatize arguments in the same way we did it for the document and its title.
As we preprocessed all text fields needed to be saved to index, we construct bulk data which
then will be used to create an elasticsearch index. For the body of each index element we
create a dictionary containing query, title, lemmatized title, topic number, uuid of the document,
its relevance score from ChatNoir, the document itself, lemmatized document and arguments
retrieved from it. All those keys can be further used as search fields in the index, although for
our purposes we only used the lemmatized versions of the document, title and arguments. Other
fields are either used to represent a document in a convenient for the user form (the original
document and title) or to further evaluation (other fields). We create an elasticsearch-index from
bulk data and adjust BM-25 parameters. Our final parameters are b=0.68 and k1=1.2, as they
showed the best results for an ndcg@10-score for the topics and relevance judgements from
Touché Lab 2020.

3.3. Query Expansion
For query expansion we decide to use synonyms retrieved from WordNet [6], because it can
help to retrieve documents with relevant terms that were implied by the user, but weren’t
mentioned explicitly (as queries in a natural language usually don’t contain synonyms for the
words used in the query: it is considered as unnecessary duplication). We save lemmatized
topics to a dataframe with 2 columns: ’query’ and ’syn’ for synonyms. We then try to find
synonyms for words from the query in order to cover a higher variety of possible options.
For this purpose we collect synonyms for separate words from WordNet. We first tried to
manually detect comparison objects for each query and only find synonyms for them. However,
as participating in the Shared Task demands submitting your approach to TIRA platform [14],
where it will be tested on a virtual machine, the manual selection wasn’t possible. Hence after
experimenting with only synonyms for objects, we have selected the setup with the best results
and for this setup we collect synonyms for all words in the query. We remove duplicates from
the synonyms found and save them in the ’syn’ column. Then the ’syn’ column will be used to
expand the original query.

3.4. Ranking
Okapi-BM25 is the default ranking function of elasticsearch. As we could get decent results
already for default parameters with this function, we decided to keep it. For adjusting the
ranking for the topics from 2020, we tried to change the parameter b with an 0.01 step in a
range from 0.77 to 0.65. The best option is b=0.68. We link it with the fact that many articles are
dedicated to a specific topic, thus the longer documents shouldn’t be considered less relevant
just because of their length (which would be the opposite, if one document covers many topics:
then the longer one would probably contain more topics than needed and because of it be less
relevant). We also tried different numbers for the k1 parameter, which stands for term frequency
saturation. The default version of k1=1.2 showed the best result, probably because the most
documents retrieved are articles from various web-sites (the corpus isn’t dedicated to a specific
range of topics, and the average length of the documents isn’t very long, what would be the
case for the books, or very short, what would be the case for a corpus consisting of Twitter
posts).
In order to get the best results, we explored combinations of different searching fields along
with adding and removing the query expansion with synonyms. As search fields, we decided to
take lemmatized title, lemmatized document and arguments in various combinations: as single
search fields, all possible pairs and finally all three fields. We also tried all above mentioned
combinations with adding synonyms to each query.
We achieved the best results, when we took lemmatized document and lemmatized title as search
fields along with the weighted query expansion with synonyms (we set the boost parameter
to 5 for the main query; this is similar to weight=1 for the main query and weight=0.2 for the
synonyms). As for the arguments search field, used alone it showed the worst results, because
for many documents no arguments from TARGER [5] weren’t delivered.


4. Evaluation
The task of Touché Lab on Argument Retrieval at CLEF 2021 also includes the evaluation of the
developed search engine. To adjust parameters and compare different options, we used topics
and relevance judgements from Touché 2020 [15]. We then have deployed our search engine
with the parameters that showed the best results to the TIRA rating platform [14]. In TIRA,
each participant in the task receives their own virtual machine and sends a retrieval model to
the used task.

4.1. Experimental setup
We used 4 different evaluation metrics to evaluate our approach, namely ndcg@10, recall@10,
precision@10 and F1-score. With the help of these 4 evaluation metrics, we analyzed the results
of all combination mentioned above: topic, title and arguments search fields in all possible
combinations with and without query expansion. We achieved the best results when using
document and title as search fields with query expansion.
We have then adjusted the parameters (b and k1) of the BM-25 for search in documents and titles.
As you can see in table 1, we achieved the best ndcg@10 value when setting b to 0.68 and k1 to 1.2.


Table 1
Adjustment of BM-25 parameters for documents and titles
                                        b      k1    Ndcg@10
                                      0.75    1.2     0.4404
                                      0.74    1.2     0.4394
                                      0.72    1.2     0.4405
                                      0.71    1.2     0.4405
                                      0.70    1.2     0.4411
                                      0.69    1.2     0.4415
                                       0.68    1.2    0.4434
                                      0.68    1.21    0.4427
                                      0.68    1.19    0.4433
                                      0.8     1.3     0.4348
                                      0.7     1.1     0.4396
                                        0      5      0.3372
5. Results
Table 2 represents the evaluation values for each experiment with search fields and query
expansion for the topics used for the Shared Task 2 in 2020. Our approach achieves the best
ndcg@10 value when we expand the objects in the queries with synonyms and use document
and title as search fields. Using this option for the Shared Task 2 in 2021, we achieved the
following results: ndcg@5 for relevance 0.478, for quality 0.680.


Table 2
For each search field (doc, title, args) we evaluate all possible combinations (doc, title and args alone,
then pairs: doc+title, args+title, doc+args) and all search fields together (doc+title+args). For each option
we evaluate searching with no extension (basic) and adding synonyms to the query (syn).
                      Approach        Ndcg@10       Recall@10     Precision    F1-score
                         doc
                         syn            0.3236        0.1990        0.300       0.2393
                        basic           0.3182        0.1962        0.292       0.2347
                         title
                         syn            0.3482        0.2187        0.340       0.2662
                        basic           0.3287        0.2056        0.312       0.2479
                         args
                         syn            0.2704        0.1810        0.260       0.2134
                        basic           0.2646        0.1801        0.262       0.2135
                       doc+title
                          syn           0.4450        0.2800        0.428       0.3385
                        basic           0.4434        0.2716        0.422       0.3305
                      args+title
                        basic           0.3863        0.2531        0.376       0.3025
                         syn            0.3792        0.2503        0.370       0.2986
                      doc+args
                        basic           0.2852        0.1898        0.276       0.2249
                         syn            0.2813        0.1861        0.274       0.2216
                    doc+title+args
                        basic           0.3768        0.2344        0.352       0.2814
                         syn            0.3669        0.2298        0.348       0.2768


6. Conclusion
We found out, that using arguments extracted from TARGER for the title search field can provide
slight improvement, while using them with the document search field decreases ndcg@10 as
well as the other evaluation metrics. Possible reason could be increased term frequency in
the document for query terms (as arguments are some sentences from the document), which
represents lesser significance of the query terms in BM-25. On the other hand, arguments
increase relevance when using title alone as search field. This fact could be used for further
improvement of the search engine. We assume, that including tokens with lesser probability
score for premise and clause from TARGER along with using title and arguments as search
fields could lead to better results for this pair of search fields; however, we still think that this
approach wouldn’t be better than using document and title as search fields together with the
query expansion. Although we think that it is a hypothesis worth testing.
As using both fields with adjusted BM-25 parameters show significant improvement compared
to use of single search fields, query expansion with synonyms for this task was a preferable
option over adding arguments to the index. In general, using synonyms to expand comparative
queries seems to be a promising direction.


References
 [1] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko,
     C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen,                                         Overview of
     Touché 2021: Argument Retrieval, in: D. Hiemstra, M.-F. Moens, J. Mothe, R. Perego,
     M. Potthast, F. Sebastiani (Eds.), Advances in Information Retrieval. 43rd European Con-
     ference on IR Research (ECIR 2021), volume 12036 of Lecture Notes in Computer Science,
     Springer, Berlin Heidelberg New York, 2021, pp. 574–582. URL: https://link.springer.com/
     chapter/10.1007/978-3-030-72240-1_67. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 0 - 7 2 2 4 0 - 1 \ _ 6 7 .
 [2] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko,
     C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen,                                         Overview of
     Touché 2021: Argument Retrieval, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi
     (Eds.), Working Notes Papers of the CLEF 2021 Evaluation Labs, CEUR Workshop Proceed-
     ings, 2021.
 [3] M. Potthast, M. Hagen, B. Stein, J. Graßegger, M. Michel, M. Tippmann, C. Welsch, ChatNoir:
     A Search Engine for the ClueWeb09 Corpus, in: B. Hersh, J. Callan, Y. Maarek, M. Sanderson
     (Eds.), 35th International ACM Conference on Research and Development in Information
     Retrieval (SIGIR 2012), ACM, 2012, p. 1004. doi:1 0 . 1 1 4 5 / 2 3 4 8 2 8 3 . 2 3 4 8 4 2 9 .
 [4] J. Bevendorff, B. Stein, M. Hagen, M. Potthast, Elastic ChatNoir: Search Engine for the
     ClueWeb and the Common Crawl, in: L. Azzopardi, A. Hanbury, G. Pasi, B. Piwowarski
     (Eds.), Advances in Information Retrieval. 40th European Conference on IR Research (ECIR
     2018), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2018.
 [5] A. Chernodub, O. Oliynyk, P. Heidenreich, A. Bondarenko, M. Hagen, C. Biemann,
     A. Panchenko, TARGER: Neural argument mining at your fingertips, in: Proceedings of
     the 57th Annual Meeting of the Association for Computational Linguistics: System Demon-
     strations, Association for Computational Linguistics, Florence, Italy, 2019, pp. 195–200.
     URL: https://www.aclweb.org/anthology/P19-3031. doi:1 0 . 1 8 6 5 3 / v 1 / P 1 9 - 3 0 3 1 .
 [6] G. A. Miller, Wordnet: A lexical database for english, Communications of the ACM 38
     (1995) 39–41. URL: http://doi.acm.org/10.1145/219717.219748. doi:1 0 . 1 1 4 5 / 2 1 9 7 1 7 . 2 1 9 7 4 8 .
 [7] H. Wachsmuth, M. Potthast, K. Al-Khatib, Y. Ajjour, J. Puschmann, J. Qu, J. Dorsch,
     V. Morari, J. Bevendorff, B. Stein, Building an argument search engine for the web, in:
     Proceedings of the 4th Workshop on Argument Mining, Association for Computational Lin-
     guistics, Copenhagen, Denmark, 2017, pp. 49–59. URL: https://www.aclweb.org/anthology/
     W17-5106. doi:1 0 . 1 8 6 5 3 / v 1 / W 1 7 - 5 1 0 6 .
 [8] C. Stab, J. Daxenberger, C. Stahlhut, T. Miller, B. Schiller, C. Tauchmann, S. Eger, I. Gurevych,
     ArgumenText: Searching for arguments in heterogeneous sources, in: Proceedings of the
     2018 Conference of the North American Chapter of the Association for Computational
     Linguistics: Demonstrations, Association for Computational Linguistics, New Orleans,
     Louisiana, 2018, pp. 21–25. URL: https://www.aclweb.org/anthology/N18-5005. doi:1 0 .
     18653/v1/N18- 5005.
 [9] R. Levy, B. Bogin, S. Gretz, R. Aharonov, N. Slonim, Towards an argumentative content
     search engine using weak supervision, in: COLING, 2018.
[10] Y. Ajjour, W.-F. Chen, J. Kiesel, H. Wachsmuth, B. Stein, Unit segmentation of argu-
     mentative texts, in: Proceedings of the 4th Workshop on Argument Mining, Associ-
     ation for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 118–128. URL:
     https://www.aclweb.org/anthology/W17-5115. doi:1 0 . 1 8 6 5 3 / v 1 / W 1 7 - 5 1 1 5 .
[11] R. Levy, S. Gretz, B. Sznajder, S. Hummel, R. Aharonov, N. Slonim, Unsupervised corpus–
     wide claim detection, in: Proceedings of the 4th Workshop on Argument Mining, As-
     sociation for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 79–84. URL:
     https://www.aclweb.org/anthology/W17-5110. doi:1 0 . 1 8 6 5 3 / v 1 / W 1 7 - 5 1 1 0 .
[12] S. Eger, J. Daxenberger, I. Gurevych, Neural end-to-end learning for computational
     argumentation mining, in: Proceedings of the 55th Annual Meeting of the Association
     for Computational Linguistics (Volume 1: Long Papers), Association for Computational
     Linguistics, Vancouver, Canada, 2017, pp. 11–22. URL: https://www.aclweb.org/anthology/
     P17-1002. doi:1 0 . 1 8 6 5 3 / v 1 / P 1 7 - 1 0 0 2 .
[13] J. Huck, Development of a Search Engine to Answer Comparative Queries, in: Notebook
     for the Touch é Lab on Argument Retrieval at CLEF 2020, volume 2696 of CEUR Workshop
     Proceedings, 2020. URL: http://ceur-ws.org/Vol-2696/.
[14] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture,
     in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The
     Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:1 0 . 1 0 0 7 /
     978- 3- 030- 22948- 1\_5.
[15] A. Bondarenko, M. Fröbe, M. Beloucif, L. Gienapp, Y. Ajjour, A. Panchenko, C. Biemann,
     B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2020: Argument
     Retrieval, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes Papers
     of the CLEF 2020 Evaluation Labs, volume 2696 of CEUR Workshop Proceedings, 2020. URL:
     http://ceur-ws.org/Vol-2696/.

</pre>