=Paper=
{{Paper
|id=Vol-3180/paper-258
|storemode=property
|title=Stacked Model based Argument Extraction and Stance Detection using Embedded LSTM model
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-258.pdf
|volume=Vol-3180
|authors=Pavani Rajula,Chia-Chien Hung,Simone Paolo Ponzetto
|dblpUrl=https://dblp.org/rec/conf/clef/RajulaHP22
}}
==Stacked Model based Argument Extraction and Stance Detection using Embedded LSTM model==
Stacked Model based Argument Extraction and Stance Detection using Embedded LSTM model Notebook for the Touché Lab on Argument Retrieval at CLEF 2022 Pavani Rajula1 , Chia-Chien Hung1 and Simone Paolo Ponzetto1 1 Data and Web Science Group, University of Mannheim, Germany Abstract In this paper, we present our submission approach for the third Touché lab at CLEF 2022 [1], shared task 2 on Argument Retrieval for Comparative Questions, which tackles answering comparative questions based on argument retrieval of text passages to support answering comparative questions in the scenario of personal decision making. The previous two Touché editions [2] [3] mostly focused on retrieving complete arguments and documents, while this edition is about whether argument retrieval can support decision-making directly by extracting the argumentative gist from documents, by classifying their stance with respect to the objects compared. In our approach, we performed tokenization and named entity recognition using a RoBERTa classifier. Followed by generating Boolean queries by categorizing the words in must and should terms. The top 100 results are retrieved from an Elasticsearch index that contains the corpus provided for the shared task. Result documents are then stripped of code, advertisements, and other noise. A Stacked model with SVM model, DistilBERT, and a learning meta- model binary classification [4] is performed on a sentence level. Extracted arguments are then scored based on a mix of BM25, the ratio of argument sentences in a document, and the similarity of the query and the sentences in the document. Finally, stance detection is then performed on a per-document-base using a word-embedding LSTM model. Our system, achieved a retrieval performance quality score of 0.492 mean nDCG@5 and relevance score of 0.582 mean nDCG@5 which is a slight improvement over the baseline scores. Keywords Comparative Questions, Argument Identification, Natural Language Processing, Stance Detection 1. Introduction Having alternatives and multiple options for everything these days, people tend to search for them and compare, to get the best among the choices available. The web contains a vast number of opinions and objective arguments that can facilitate the comparative decision-making process, but a faceted view of different aspects of the search topic makes it difficult. It creates the need of developing an open-domain general system that could process such information, and generate insights that support the user in informing well-justified opinions. Such a system has several challenges which include assessing an argument’s relevance to a query, deciding what is an argument’s main gist in terms of the take-away, and estimating how well an implied stance is CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy $ prajula@mail.uni-mannheim.de (P. Rajula); chia-chien.hung@uni-mannheim.de (C. Hung); ponzetto@uni-mannheim.de (S. P. Ponzetto) https://github.com/rpavani1998 (P. Rajula) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) justified. As this problem drives the attention of many researchers, such events as the Touché Lab on Argument Retrieval at CLEF, foster research on argument retrieval and establish more collaboration and exchange of ideas and data sets among researchers and collaborate to develop and share retrieval approaches that aim to support social and personal decisions. We present in this paper our adopted approach for participation as Team Olivier Armstrong in the third Touché lab on argument retrieval at CLEF 2022, Shared Task 2. The main objective of this task is to use argument retrieval to support decision-making directly by extracting the argumentative gist from documents. Followed by classifying their stance with respect to the objects compared. The goal is to retrieve relevant documents from the given document collection, rank them using different approaches, and finally find the stance. Our approach to the submission made to this task is presented in detail in this paper. 2. Related Work The approaches submitted for the previous two Touché editions [2] [3] are the most relevant work, which have about 130 submissions from 44 teams who have participated. Various ap- proaches were proposed, which involved initial document retrieval using ChatNoir [5] search en- gine by using the original query, query pre-processing, and various query expansion techniques such as synonyms from WordNet [6], word embeddings using sense2vec [7] or word2vec [8], argument retrieval techniques, predicting document relevance labels by using a random forest classifier, XGBoost [9], LightGBM [10] and also implemented multiple (re-)ranking algorithms. The conclusive paper gives an overview of the implemented systems and ideas. Argument mining problem, claim and premise detection drives the attention of various researchers [11], An open-source API for the argument retrieval from the text is TARGER [12], DistilBERT-based Argumentation Retrieval [13] proposed by one of the teams participated in Touché 2021 and there was also a Stacked Model proposed for Argument Identification [4], which tackles the argument identification task by following two approaches: a classical machine learning approach Support Vector Machine (SVM) model [14] and a DistilBert-based approach [15]. Stance classification is an active research area that has been studied in different domains. Determining the stand of the text toward a concrete entity or an abstract idea is quite chal- lenging. Supervised learning is the basic and most common approach for most work on stance detection [16, 17]. Many studies have proposed different supervised ML algorithms such as clas- sical algorithms, for instance, Naive Bayes (NB), SVM, decision trees, deep learning algorithms, RNNs, and LSTMs [18] [19] [17] , to detect the stance. 3. Datasets Touchè organizers have provided 50 comparative questions (topics), for which documents are to be retrieved from the given corpus which is a collection document of about 0.9 million text passages. Stance data set [20] is also provided which has data dump from Stack Exchange1 1 https://archive.org/details/stackexchange Figure 1: Architecture of submitted Approach and L6 - Yahoo! Answers Comprehensive Questions and Answers version 1.0 (multi-part)2 , which contains questions from Yahoo and Stack Exchange, with the best answer and answer stance. Additional resources include a Subset of MS MARCO with comparative questions and a collection of text passages expanded with queries generated using DocT5Query. All these datasets are available on the task page.3 For training, the stacked model datasets - Student Essays [21] and Web discourse [22] which are publicly available are used for this purpose. 4. Methodology The approach is built with one of the previous year’s team paper[23] as a baseline. The approach pipeline has multiple components which include query pre-processing, query expansion, docu- ment retrieval using elastic search, and further re-rank them through argument identification using a stacked model [4]. Followed by sorting the documents by multiple ranking scores and finally retrieving the top relevant arguments for stance detection by the embedded LSTM model. Our architecture of the proposed approach to building a search engine for answering comparative questions is presented in Figure 1. Each component in this architecture is explained in detail in the following sub-sections. 4.1. Query Parsing, Token Classification, and Query Expansion Understanding the query and recognizing the query attributes would help in better query enrichment for document retrieval. Based on the approach suggested in the paper Towards Understanding and Answering Comparative Questions [20], we used the token-based Named Entity Recognition(NER) model using RoBERTa classifier [24] to tokenize the queries and label the objects, aspects, and predicates. And then query expansion is done by parsing the query, removing the stop words, and considering the labeled entities in the previous step, the words are classified into should and must categories and build a Boolean query. Entities - Objects and their synonyms fetched using WordNet are considered as must words and Aspect, Predict 2 https://webscope.sandbox.yahoo.com/catalog.php?datatype=l 3 https://webis.de/events/touche-22/shared-task-2.html and its synonyms are considered as should words. By the end of this stage Boolean query is generated joined by ’AND’ and ’OR’ based on the category. 4.2. Document Retrieval In this approach, we have used Elasticsearch client4 , an open-source search engine, which allows us to store, and search from the huge volume of data, the same is used to improve ChatNoir [25] All the documents from the given corpus with the collection of documents are indexed with the contents field provided in the data set along with document_id as the id. The body of the index consists of chatNoirUrl and processed content - removal of special characters, stop words, and lemmatized verbs and nouns with nltk [26]. For each query topic, the boolean query built in the previous step is used to search from the Elasticsearch index dump and retrieve the top 100 documents. 4.3. Argument Extraction Argument identification is used to detect the comparative sentences in the document. To get full content, using Boilerpy35 and Trafilatura6 and the URL from the Elasticsearch results is used to retrieve the HTML source page, removing tags, advertisements, and other cleaning processes. For sentences from the documents, binary classification is applied using the stacked model, which is modeled using Student Essays and Web discourse data sets, and then every sentence is classified as an argument or non-argument. The stacked model architecture consists of two main components: the base models, which include the trained SVM model and the trained transformer-based model (DistilBERT) in parallel, and the meta-model, which will learn from the outputs of the two models to produce the final prediction of a sentence. The proposed Stacked model compared to DistilBERT achieved better performance when trained with the Student Essays and Web Discourse datasets. [4] 4.4. Scoring and Sorting Now in this step, the score is estimated using the best matching between the query and the arguments extracted from the documents to sort and rank them accordingly. Unlike the baseline approach, only three different scores are considered to evaluate the argument quality which are • BM25 measure: Calculating on argumentative sentences of each document with respect to the original query, through re-indexing the retrieved documents by creating new ones that contain only argumentative sentences. Then the arg-BM25 score of each the document is calculated by querying the new argumentative documents with the original topic. • Argument support score: Representing the ratio of argument sentences among all existent sentences in the document. • Similarity score: Evaluating the similarity of two sentences based on the context and English language understanding using the SentenceTransformer library[27], which is 4 https://www.elastic.co/ 5 https://github.com/jmriebold/BoilerPy3 6 https://github.com/adbar/trafilatura Figure 2: Stance Detection Model Architecture calculated using the similarity between the original query and every argumentative sentence in the document, and considers the average as the score. All these scores are normalized and all the scores are summed up using the respective weights then the documents are sorted and the top 25 highly relevant documents are fetched and ranked based on the descending scores. 4.5. Stance detection Now that the top relevant documents are retrieved, the next step is to retrieve the relevant argument passages to detect their stance with respect to the query objects. In this approach, assuming that all the arguments with a good argument quality score for the query, is the most relevant passage, will be used for the stance detection. Argument quality [28] here is measured by the averaging list of arguments and query pairs, using the pre-trained neural network from the IBM Debater project [29] API service. Given a list of sentences and the query, The API returns a score ranging between 0, indicating that sentence is of the lowest quality to 1, indicating that sentence is of very high quality, for the query. Number of studies [18] has shown that for conversation-based tasks, the LSTM approach outperforms other sequential classifiers and feature-based models. Long Short Term Memory networks are a special kind of RNN, capable of learning long-term dependencies. An LSTM recurrent unit tries to remember all the past knowledge that the network is seen so far and to forget irrelevant data. This is done by introducing different activation function layers called gates for different purposes. Each recurrent unit also maintains a vector called the Internal Cell State which conceptually describes the information that was chosen to be retained by the previous LSTM recurrent unit. We experimented with the stance dataset provided by Touchè, which is annotated using predefined labels which are No stance, Neutral, Pro first object, and Pro second object. Figure 2 shows the model architecture used in this approach. The dataset is split into a training set and a test set using an 80:20 split. First, the argument passage is concatenated with both the objects separately with the same stance label for both. Pre-processing is performed on both the data split, based on the tokenization technique, where the text is tokenized and each token is transformed into an index-based representation. Then, each token sentence-based indexes will be passed sequentially through an embedding layer, this embedding layer will Figure 3: Results output an embedded representation of each token which is passed through a spatial dropout layer [30], are passed through an LSTM neural net, stacked by a dense layer. The following settings for LSTM-based models were chosen: input layer size 500 (equal to the word embedding dimension), hidden layer size of 60, training for max 10 epochs with initial learning rate 1e-3 using ADAM [31] for optimization, dropout 0.2. Models were trained using categorical cross- entropy loss. The use of one, relatively small hidden layer and dropout help to avoid over-fitting. This trained model gave an accuracy of 0.86 and a loss of 0.34 on the test data set. For the prediction of the stance, the gist of the arguments retrieved is concatenated with the objects separately, given as input, and then predict using the trained model, which results in the same stance label for both the records. Then the labels are converted to the labels as per the labels specified in the Task and The final output is inserted into a text file in the format proposed by the Touché organization. 5. Evaluation The shown architecture in Figure 1 presents our base approach, from which submission is made for task-2 Touchè 2022. In Table 1, we present the Precision, Recall, and F1 scores for each label obtained for the test dataset. A single-run submission has been made to the Touchè committee through the manual labeling of the documents with the help of a human assessor. Our results seem to point in the right direction, achieving a slight improvement over the baseline with a quality score of 0.492 mean nDCG@5 and relevance score of 0.582 mean nDCG@5. Our stance detection model performed better than the baseline model with F1_Macro score of 0.191. Figure 3 shows our system scores compared with baseline. Our system would have performed much better if more documents are initially retrieved and using multiple other methods for argument extraction or using more data for training the model. For stance detection, instead of considering top argument sentences as the gist of the document, if other techniques are used, the system would have given better results. Table 1 Stance Detection Model Performance Results Label Precision Recall F1-score 0 - No stance 0.846 0.917 0.880 1 - Neutral 0.967 0.894 0.929 2 - Pro first object 0.746 0.909 0.820 3 - Pro second object 0.902 0.780 0.836 6. Conclusion In this paper, we as Team Olivier Armstrong presented our solution to the shared task 2 of argument retrieval for answering comparative questions at Touchè 2022. We proposed an approach for document and argument retrieval, based on several parts of existing systems that have shown acceptable performance previously. Our major contributions supervised learning embedded-LSTM model for stance detection and proposed stacked model for argument extraction is performing better compared to the previously proposed DistilBERT. Both the trained models, have performed well and have acceptable results. Our proposed approach outperforms the baseline. In future work, more enhanced and advanced techniques can be used for better query enrichment, argument extraction, argument quality, and stance detection. References [1] A. Bondarenko, M. Fröbe, J. Kiesel, S. Syed, T. Gurcke, M. Beloucif, A. Panchenko, C. Bie- mann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2022: Argu- ment Retrieval, in: M. Hagen, S. Verberne, C. Macdonald, C. Seifert, K. Balog, K. Nørvåg, V. Setty (Eds.), Advances in Information Retrieval. 44th European Conference on IR Re- search (ECIR 2022), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2022. URL: https://webis.de/publications.html#bondarenko_2022c. [2] A. Bondarenko, M. Fröbe, M. Beloucif, L. Gienapp, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2020: Argument Retrieval, in: A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis, H. Joho, C. Lioma, C. Eickhoff, A. Névéol, L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. 11th International Conference of the CLEF Association (CLEF 2020), volume 12260 of Lecture Notes in Computer Science, Springer, Berlin Hei- delberg New York, 2020, pp. 384–395. URL: https://link.springer.com/chapter/10.1007/ 978-3-030-58219-7_26. doi:10.1007/978-3-030-58219-7\_26. [3] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2021: Argument Retrieval, in: K. Candan, B. Ionescu, L. Goeuriot, H. Müller, A. Joly, M. Maistro, F. Piroi, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. 12th International Conference of the CLEF Association (CLEF 2021), volume 12880 of Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2021, pp. 450–467. URL: https://link.springer.com/chapter/10.1007/978-3-030-85251-1_28. doi:10. 1007/978-3-030-85251-1\_28. [4] A. Alhamzeh, M. Bouhaouel, E. Egyed-Zsigmond, J. Mitrović, L. Brunie, H. Kosch, A stacking approach for cross-domain argument identification, in: C. Strauss, G. Kotsis, A. M. Tjoa, I. Khalil (Eds.), Database and Expert Systems Applications, Springer International Publishing, Cham, 2021, pp. 361–373. [5] M. Potthast, M. Hagen, B. Stein, J. Graßegger, M. Michel, M. Tippmann, C. Welsch, Chatnoir: A search engine for the clueweb09 corpus, in: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’12, Association for Computing Machinery, New York, NY, USA, 2012, p. 1004. URL: https://doi.org/10.1145/2348283.2348429. doi:10.1145/2348283.2348429. [6] G. A. Miller, Wordnet: A lexical database for english, Commun. ACM 38 (1995) 39–41. URL: https://doi.org/10.1145/219717.219748. doi:10.1145/219717.219748. [7] A. Trask, P. Michalak, J. Liu, sense2vec - a fast and accurate method for word sense disambiguation in neural word embeddings (2015). [8] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, 2013. arXiv:1301.3781. [9] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, Association for Computing Machinery, New York, NY, USA, 2016, p. 785–794. URL: https://doi.org/10.1145/2939672.2939785. doi:10.1145/2939672.2939785. [10] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T.-Y. Liu, Lightgbm: A highly efficient gradient boosting decision tree, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Curran Associates Inc., Red Hook, NY, USA, 2017, p. 3149–3157. [11] H. Wachsmuth, M. Potthast, K. Al-Khatib, Y. Ajjour, J. Puschmann, J. Qu, J. Dorsch, V. Morari, J. Bevendorff, B. Stein, Building an argument search engine for the web, in: Proceedings of the 4th Workshop on Argument Mining, Association for Computational Lin- guistics, Copenhagen, Denmark, 2017, pp. 49–59. URL: https://aclanthology.org/W17-5106. doi:10.18653/v1/W17-5106. [12] A. Chernodub, O. Oliynyk, P. Heidenreich, A. Bondarenko, M. Hagen, C. Biemann, A. Panchenko, TARGER: Neural argument mining at your fingertips, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demon- strations, Association for Computational Linguistics, Florence, Italy, 2019, pp. 195–200. URL: https://aclanthology.org/P19-3031. doi:10.18653/v1/P19-3031. [13] A. Alhamzeh, M. Bouhaouel, E. Egyed-Zsigmond, J. Mitrović, Distilbert-based argumenta- tion retrieval for answering comparative questions, in: CLEF, 2021. [14] C. Cortes, V. Vapnik, Support-vector networks, Machine learning 20 (1995) 273–297. [15] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter, CoRR abs/1910.01108 (2019). URL: http://arxiv.org/abs/1910. 01108. arXiv:1910.01108. [16] S. Mohammad, S. Kiritchenko, P. Sobhani, X. Zhu, C. Cherry, SemEval-2016 task 6: Detecting stance in tweets, in: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), Association for Computational Linguistics, San Diego, California, 2016, pp. 31–41. URL: https://aclanthology.org/S16-1003. doi:10.18653/ v1/S16-1003. [17] S. Gottipati, M. Qiu, L. Yang, F. Zhu, J. Jiang, Predicting user’s political party using ideological stances, in: A. Jatowt, E.-P. Lim, Y. Ding, A. Miura, T. Tezuka, G. Dias, K. Tanaka, A. Flanagin, B. T. Dai (Eds.), Social Informatics, Springer International Publishing, Cham, 2013, pp. 177–191. [18] E. Kochkina, M. Liakata, I. Augenstein, Turing at semeval-2017 task 8: Sequential approach to rumour stance classification with branch-lstm, CoRR abs/1704.07221 (2017). URL: http://arxiv.org/abs/1704.07221. arXiv:1704.07221. [19] K. Dey, R. Shrivastava, S. Kaushik, Topical stance detection for twitter: A two-phase LSTM model using attention, CoRR abs/1801.03032 (2018). URL: http://arxiv.org/abs/1801.03032. arXiv:1801.03032. [20] A. Bondarenko, Y. Ajjour, V. Dittmar, N. Homann, P. Braslavski, M. Hagen, Towards Understanding and Answering Comparative Questions, in: K. S. Candan, H. Liu, L. Akoglu, X. L. Dong, J. Tang (Eds.), 15th ACM International Conference on Web Search and Data Mining (WSDM 2022), ACM, 2022, pp. 66–74. URL: https://dl.acm.org/doi/10.1145/3488560. 3498534. doi:10.1145/3488560.3498534. [21] C. Stab, I. Gurevych, Annotating argument components and relations in persuasive essays, in: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin City University and Association for Computational Linguistics, Dublin, Ireland, 2014, pp. 1501–1510. URL: https://aclanthology.org/C14-1142. [22] I. Habernal, I. Gurevych, Argumentation mining in user-generated web discourse, Computational Linguistics 43 (2017) 125–179. URL: https://aclanthology.org/J17-1004. doi:10.1162/COLI_a_00276. [23] A. Alhamzeh, M. Bouhaouel, E. Egyed-Zsigmond, J. Mitrović, Distilbert-based argumen- tation retrieval for answering comparative questions, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum (CLEF 2021), number 2936 in CEUR Workshop Proceedings, Aachen, 2021, pp. 2319–2330. URL: http://ceur-ws.org/Vol-2936/#paper-209. [24] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy- anov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692. [25] J. Bevendorff, B. Stein, M. Hagen, M. Potthast, Elastic chatnoir: Search engine for the clueweb and the common crawl, in: ECIR, 2018. [26] E. Loper, S. Bird, Nltk: The natural language toolkit, CoRR cs.CL/0205028 (2002). URL: http://dblp.uni-trier.de/db/journals/corr/corr0205.html#cs-CL-0205028. [27] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, CoRR abs/1908.10084 (2019). URL: http://arxiv.org/abs/1908.10084. arXiv:1908.10084. [28] S. Gretz, R. Friedman, E. Cohen-Karlik, A. Toledo, D. Lahav, R. Aharonov, N. Slonim, A large-scale dataset for argument quality ranking: Construction and analysis, CoRR abs/1911.11408 (2019). URL: http://arxiv.org/abs/1911.11408. arXiv:1911.11408. [29] R. Bar-Haim, Y. Kantor, E. Venezian, Y. Katz, N. Slonim, Project Debater APIs: Decomposing the AI grand challenge, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021, pp. 267–274. URL: https: //aclanthology.org/2021.emnlp-demo.31. doi:10.18653/v1/2021.emnlp-demo.31. [30] Y. Gal, Z. Ghahramani, A theoretically grounded application of dropout in recurrent neu- ral networks, 2015. URL: https://arxiv.org/abs/1512.05287. doi:10.48550/ARXIV.1512. 05287. [31] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, 2014. URL: https://arxiv. org/abs/1412.6980. doi:10.48550/ARXIV.1412.6980.