A Unified Representation and Deep Learning Architecture for Argumentation Mining of Students’ Persuasive Essays Muhammad Tawsif Sazid1,∗ , Robert E. Mercer1 1 Department of Computer Science, The University of Western Ontario, London, Ontario, Canada Abstract We develop a novel unified representation for the argumentation mining task facilitating the extracting from text and the labelling of the non-argumentative units and argumentation components—premises, claims, and major claims—and the argumentative relations—premise to claim or premise in a support or attack relation, and claim to major-claim in a for or against relation—in an end-to-end machine learning pipeline. This tightly integrated representation combines the component and relation identification sub- problems and enables a unitary solution for detecting argumentation structures. This new representation together with a new deep learning architecture composed of a mixed embedding method, a multi-head attention layer, two biLSTM layers, and a final linear layer obtain state-of-the-art accuracy on the Persuasive Essays dataset. An augmentation of the corpus (Paragraph version) by including copies of major claims has further increased the performance. Keywords deep learning model, unified representation, data augmentation, word embeddings, argumentation mining sub-tasks, natural language processing 1. Introduction Arguments consist of claims and premises and the relationships among them. Argu- mentation mining, a research topic in the field of Natural Language Processing (NLP) aims to identify the arguments in a text document and the internal structure of each argument. There are four subtasks of the problem. Since we are using the Persuasive Essay (PE) dataset [1] these subtasks can be made more precise: 1) Segmenting the argument components: separate the argumentative text from the non-argumentative text, 2) Labelling each argument component: whether the argumentative text is a Major-Claim, Claim, or Premise, 3) Determining which argumentation components are in a relationship: this is represented as the text distance (the number of sentences before or after) between a premise and its related argument component (in the PE corpus, which major-claim is related to a claim is not annotated), and 4) Classifying the stance of the relations Workshop on Computational Models of Natural Argument (CMNA 22), 12 September 2022, Cardiff, Wales ∗ Corresponding author. Envelope-Open msazid@uwo.ca (M. T. Sazid); mercer@csd.uwo.ca (R. E. Mercer) Orcid 0000-0002-2355-508X (M. T. Sazid); 0000-0002-0080-715X (R. E. Mercer) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Inter- national (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1 Muhammad Tawsif Sazid et al. CEUR Workshop Proceedings 1–12 between argument components: as “support” or “attack” between premises and claims or other premises; and between claims and major claims as “for” or “against”. Previous research has approached the development of a computational argumentation mining method from two distinct viewpoints. The first approach views the input as text and searches for a method to solve all four of the subtasks mentioned above. Stab and Gurevych [1] and Eger et al. [2] are noteworthy examples of this strategy. Eger et al. [2] produce the state-of-the-art method to which we compare our new method. Both of these works approach argumentation mining as a sequence tagging problem. They first detect entities and then predict the argument structure on top of that. The second view of the argumentation mining problem assumes the first subtask, the segmenting of the text into argumentative and non-argumentative components has been done, and the input to the method are the argumentative components. We compare some of our results with some of these works. The method proposed here takes the first approach, solving all four subtasks. As there are subtasks, previous argumentation mining works have decoupled various subtasks, solved them separately, and then combined the solutions. The end-to-end learning method proposed here differentiates itself from these previous works by approaching the problem in a unified manner. Our research contributions are summarized as follows: 1. Argumentation mining is formulated as a single problem by integrating all of its subtasks: separating the non-argumentative tokens from the argumentative tokens, labelling the argument components, identifying the related components, and classifying the stance of the relation. We show that combining all the subtasks results in improved performance. 2. By constructing this novel dense representation of the problem, we are able to achieve a better than previous performance using a stacked embedding model comprising two biLSTM layers, a multi-head attention layer, and a linear layer.1 3. We have developed an augmentation technique (this experiment has been done on the paragraph version of the PE corpus) based on the n-gram tokens that indicate the start of a major claim which has further improved the results on the PE corpus. With the new formulation of the problem, our model, Unified-AM, reaches state-of-the-art argument mining performance on detecting and labelling argument components and relations for the paragraph version of the PE corpus. 2. Related Work Computational argumentation mining deals with finding argumentation structures in text. Palau and Moens [3] established that argument mining would need to detect claims and premises and their relationships. Stab and Gurevych [1, 4] provided the PE dataset, a corpus annotated with a scheme that includes claims, premises, and also attack or support relations. Stab and Gurevych [1] addressed the argumentation problem by training independent models for each of 1 We will make our code publicly available once our paper is published. 2 Muhammad Tawsif Sazid et al. CEUR Workshop Proceedings 1–12 the subtasks and then combining them with an Integer Linear Programming Model for the end-to-end task. Eger et al. [2] achieved state-of-the-art performance on the PE corpus by addressing the problem as a sequence tagging problem. They have the best accuracy of 61.67% by using a modified version of the LSTM-ER model which had been introduced by Miwa and Bansal [5]. The LSTM-ER model uses a stacked architecture of Sequence and Tree LSTMs. Persing and Ng [6] presented the first findings on end-to-end argument mining in student essays using a pipeline approach by performing joint inference using an Integer Linear Programming (ILP) framework. Ferrara et al. [7] introduced an unsupervised approach, topic modeling, to detect claims and premises. Persing and Ng [8] have also developed an unsupervised machine learning method that provides all but the stance information for the relations. A number of works have investigated approaches for subtasks 2, 3, and 4. Early work is epitomized by Peldszus [9] and Pelszus and Stede [10]. Potash et al. [11], Kuribayashi et al. [12], and Bao et al. [13] are more recent. Niculae et al. [14] jointly approach unit type detections and relation predictions on their new CDCP dataset and the PE dataset. We investigated some neural architectures and how additional handcrafted features can help boost the accuracy on certain sequence tagging tasks. Ahmed et al. [15] achieved state-of-the-art performance in Part of Speech, Named Entity Recognition, and Chunking tasks by combining different learned vectors with word-level embeddings. Kuribayashi et al. [12] and Persing and Ng [8] also noted the importance of discourse connectives in the argumentation mining task. 3. Research Methodology Here we present our method to generate the argumentation structure for the PE data set. First, the data set is described. Then, we introduce the multi-label representation that considers argumentation mining as a single unified problem. Lastly, instead of presenting the final model with an ablation study, we present our method in a bottom-up style, starting with a base architecture to which we add, providing in Table 1 the performance increase given by each addition since we want to discuss the motivation for these additions. We compare the final model’s performance with that achieved by Eger et al. [2]. 3.1. Data Set Preparation The PE dataset that we are using in this paper was created by Stab and Gurevych [1] and was used in Eger et al. [2]. The essays are written on controversial topics so that the authors can make their opinions and take their stances. The corpus has been tagged with the BIO scheme. There are essay and paragraph versions of the data set. We have worked with the paragraph version of the corpus. The data set contains 1,587 paragraphs totaling 105,988 tokens in the train-set and 449 paragraphs, 29,537 tokens in the test-set2 . The development set has 12,657 tokens available in 199 paragraphs. 2 This differs slightly from what is detailed in Eger et al. [2] 3 Muhammad Tawsif Sazid et al. CEUR Workshop Proceedings 1–12 The argumentation structure can be viewed as a forest with each tree rooted by the author’s major claim. The claims are connected to all of the major claims with either ‘for’ or ‘against’ relations. Premises are related to exactly one claim or premise. Premises either ‘support’ or ‘attack’ the claims or premises. One important piece of information is that the argumentation structure is completely contained in the paragraph except for some relations from claims to major claims which are not in the same paragraph. We extracted the dataset at the paragraph level for this paper. The corpus is imbalanced as Eger et al. [2] has mentioned. 3.2. New Problem Formulation To integrate all of the sub-problems (argumentative and non-argumentative unit classifica- tion; major-claim, claim, and premise component classification; relation identification, and distance between 2 entities) into a single problem, we construct a binary vector of size 33 for our target labels. This novel unified representation has 33 indexes representing different components related to argumentative units. We are addressing the argumentation mining problem as a sequence tagging problem and classifying each word or token as beginning argumentative/continuation argumentative/non-argumentative; premise/claim/major- claim; support/attack; for/against; relative distance between the current component and the component it relates to. One of these distance indexes will represent the related component. The maximum distances from a premise to a claim suggested in Eger et al. [2] are +11 and -11 (the number of sentences after or before, respectively). Thus, we have constructed a dense unified representation of the argumentation mining problem. In this representation, the value ‘1’ signifies that the index belongs to that particular (non-)argumentative unit, or in the case of argumentative components, the index is the continuation or beginning of that component; ‘0’ indicates otherwise. By formulating the argumentation mining task as a multi-label problem, we don’t have to think of separated and decoupled solutions for each of the subtasks. As an example, the word “children” of the sentence “For instance, children immigrated to a new country will . . . ” would be represented by the vector [0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0] which indicates this is the beginning of a premise and that it is supporting a claim that is three sentences after it. After getting the model’s logit values during training and evaluation phases, we choose the index for each of the categories that has the highest logit value (see [16] for details) in that specific category (components, stances, and distance). 4. Description of the Deep Learning Model and the Hyper-Parameters Our deep learning model architecture (Unified-AM) includes: stacked embedding, axial positional embedding, a multi-head attention layer, a 2-layered biLSTM, and the final linear layer. The output of the model is optimized with BCEWithLogitsLoss. For its capability of retaining long-distance information from sequential texts, we use biLSTM for the paragraph level for the argumentation mining task. Before adding 4 Muhammad Tawsif Sazid et al. CEUR Workshop Proceedings 1–12 Table 1 Error Analysis and Comparison Between the Three Models (False Positives + False Negatives) Number of Wrong Predictions Major Claim Claim Premise Relations Non Argumentative Trained Embedding 1306 4011 4787 10004 3255 Stacked Embedding 1176 3215 3122 7653 2120 Unified-AM 1111 3082 2953 7301 2202 the axial positional embedding and the multi-head attention layer, our preliminary experimentation determined the number of biLSTM layers by using a trial and error methodology, i.e., we have tried two layers of biLSTM with one linear layer, one biLSTM layer with one linear layer, and so on. We have found one linear layer and two biLSTM layers achieve the best accuracy. The final architecture includes mixed embedding but in the model design we first experimented with a plain embedding layer instead. Lample et al. [17], have shown that a combination of different embeddings may work better than using only one embedding class. For the pre-trained mixed embedding, we use the memory-efficient stacked embedding class that Akbik et al. [18] introduced in their Flair framework for combining the FastText and Byte-pair embeddings. As our corpus contains unknown words in the test set and the whole corpus contains many suffix and prefix dependent words, we used these two types of embedding together. The final design decision was to include the multi-head attention [19] and the axial positional embedding for the positional information [20, 21]. To show the effects of each of these design decisions, we compare the number of wrong-predictions between our non-pre-trained embedding model, the pre-trained stacked embedding model, both without multi-head attention, and the final Unified-AM model. Table 1 shows the error analysis of these three stages of architecture design for the non-argumentative units, argumentative components, and relations. For each of the mentioned argumentative units we present the total number of errors (false negatives + false positives). For relations (support, attack, for, and against), we have combined the errors from each class and report this combined value. There are somewhat fewer wrong predictions when the stacked embedding is incorporated into the model. Without stacked embedding, the total number of wrong predictions for all of the classes on the paragraph level is 23,363. With the addition of stacked embedding the total number of wrong predictions becomes 17,286. After using this pre-trained embedding, the error rate is reduced by 26.01%. The total number of errors for the Unified-AM model is 16,649. The Unified-AM model further reduces the error rate by 3.69%. We have formulated the argumentation problem in a unified way. As a result, it has become a multi-class, multi-label problem. As it becomes a multi-label problem when we create a unified representation, we just want to choose the index for each of the categories that has the highest logit value in that specific category (components, stances, and distance). For this, we have created a function to interpret our multi-label outputs from the Unified-AM model. After trying several hyperparameter values for each of the different components we 5 Muhammad Tawsif Sazid et al. CEUR Workshop Proceedings 1–12 Table 2 Experiment on the Paragraph Level with Unified-AM Compared to LSTM-ER [2] Paragraph Level Token C-F1 C-F1 R-F1 R-F1 Model C-F1 R-F1 Accuracy (100%) (50%) (100%) (50%) Unified-AM (Ours) 66.79% 68.88 78.22 51.14 56.41 60.00 67.32 LSTM-ER [2] 61.67% 70.83 77.19 45.52 50.05 55.42 60.72 have chosen the final values. We use dropout values of 0.5 for the linear layer, and 0.65 for the biLSTM layer of our architecture. We use the default dropout value (0.0) for the multi-head attention layer. We do not use any type of activation function in-between the layers. A learning rate of 0.001 has been used in all of the experimental design stages. The Adam optimizer is used throughout. During training, we have used random shuffling for all of the final experiments. We have trained our model around 1000-1100 epochs for all of the experiments except the data augmentation experiment (see Table 4). For determining the default training epochs (1000-1100) we have closely observed the development set accuracy value after every 5 epochs. If after 1100 epochs the development set accuracy stops increasing or starts fluctuating somewhat between a small range of accuracy values, we have stopped the training procedure. We also observe the training loss and find that when it reaches around 0.0005 loss value, the model has the highest development set accuracy. If we further train and decrease the loss value, it does not help to improve the accuracy value of the development set. As we have also increased the original PE corpus by augmenting the data in our augmentation experiments (see Section 5.2), we also increase the training epochs to reach around the 0.0005 training loss which has given us improvements regarding the C-F1, R-F1 and F1 scores. 5. Experiments and Results 5.1. Experiments on the original version of the PE corpus We have used the multi-head attention module and have fed the stacked embedding representation for all tokens to the query, key and value matrices that we are using for solving the argumentation mining task. For our 400-dimension embedding class we use four heads for the multi-head attention layer for this experiment. This new approach achieves the highest token level accuracy of 66.79% in our argumentation mining task. Table 2 summarizes the result for this experiment, including the F1 measure for the component and relation tasks, and a global F1 score. The results from Eger et al. [2] have been included for comparison. Now, compared to the Eger et al. [2] decoupled method for computing the relation identification, this task in Unified-AM is coupled with the component identification task due to the unified representation of the problem, which has led to the better performance. We have used the distance values from -11 to +11 that were observed by Eger et al. [2] in the PE data set. Regarding this problem, our target label vector is: Y = [Non-Argumentative, Beginning, Continuation, Major-Claim, 6 Muhammad Tawsif Sazid et al. CEUR Workshop Proceedings 1–12 Table 3 Precision, Recall and F1-score for the Argumentation Mining Classes for Unified-AM Class Precision Recall F1 Score Token Percentage Non-Argumentative 88.38 88.27 88.33 32.20 Major-Claim 73.87 74.18 74.02 7.41 Claim 65.37 58.05 61.48 15.41 Premise 88.01 90.87 89.42 44.99 Support 86.79 89.69 88.22 42.61 For 60.96 57.05 58.94 12.77 Attack 32.52 26.77 29.37 2.38 Against 60.81 29.97 40.15 2.64 Claim, Premise, Support, For, Attack, Against, (-11 to +11)] (33 labels). For this task we have used our final model architecture. Table 2 shows the results. In Table 3, we present individual precision, recall and F1 score for the 8-label rep- resentation of the components and relations that are available in the PE corpus. We observe low precision and recall score for the claim tokens even though the class is not the least frequent one in the PE corpus. We observe similarity as the lowest agreement score among the human annotators is also for the claim [1]. Unified-AM also finds it difficult to predict the claim tokens in the corpus. 5.2. Data Augmentation Experiment on the Paragraph Version of the PE Corpus We now turn to the final argumentation model performance improvement. Adding linguistic information to a model has been successful for low level NLP tasks [15]. We have observed (as did Kuribayashi et al. [12], and Persing and Ng [8]) that many major claims are prefaced by a reasonably small set of n-grams. An n-gram is a continuous sequence of 𝑛 words. Some examples of the n-grams that are found in the PE corpus are: ‘I firmly believe that’, ‘In conclusion ,’, ‘Hence ,’, and ‘Firstly ,’. We consider augmenting the corpus by using these n-grams to increase the frequency of the Major Claim component type which is the least frequent component available in the PE corpus. In this experimental setup, we have augmented the PE training dataset (which consists of paragraphs). Below, we describe the augmentation technique that we have used to augment the PE corpus. We also compare the performance of Unified-AM on both of the augmented and original corpora. We have augmented the paragraph-level corpus with new paragraphs. These new paragraphs are copies of those paragraphs that contain one of the 108 n-gram tokens that occur immediately before the major claim tokens but have had the n-gram randomly swapped with a same size n-gram token. This augmentation increases the number of major claim tokens in the whole corpus but with different introductory n-grams. We have hypothesized that if we increase the root element, i.e., the major claim components of the corpus, by swapping frequently occurring n-gram tokens that appear immediately before the component, it would help the model to accurately detect this type of component 7 Muhammad Tawsif Sazid et al. CEUR Workshop Proceedings 1–12 and differentiate between the three types of components that are available in the PE corpus. We have shown below an example of the original paragraph and the augmented paragraph after applying the described augmentation method: Original Paragraph: “It is always said that competition can effectively promote the development of economy . In order to survive in the competition , companies continue to improve their products and service , and as a result , the whole society prospers . However , when we discuss the issue of competition or cooperation , what we are concerned about is not the whole society , but the development of an individuals whole life . I firmly believe that we should attach more importance to cooperation during primary education.” Augmented Paragraph: “It is always said that competition can effectively promote the development of economy . In order to survive in the competition , companies continue to improve their products and service , and as a result , the whole society prospers . However , when we discuss the issue of competition or cooperation , what we are concerned about is not the whole society , but the development of an individuals whole life . I truly believe that we should attach more importance to cooperation during primary education.” Description of the Augmentation Process: In this particular example we have substi- tuted the n-gram “I firmly believe that” with an equal size randomly chosen n-gram “I truly believe that” from our collected n-gram list. The words following in that particular sentence are major claim tokens. Here, the n-grams consist of 4 words. By using the augmentation technique, we have increased the number of Major Claim to- kens by approximately 4000. Also, because Claim, Premise, and Non-argumentative com- ponents occur in these paragraphs, the number of Claim, Premise, and Non-argumentative tokens have increased by around 2000, 1000, and 8000, respectively. After creating the augmented corpus, we have trained our Unified-AM model on the corpus. We have achieved the highest token level accuracy on the paragraph-level argumentation corpus. Previously, without augmentation, we have achieved 66.79% token level accuracy on the PE dataset (see Table 2) and after applying the augmentation methodology we have achieved the highest token level accuracy of 68.02%. Also, all other performance measures have been improved significantly. We further worked on the paragraph-level augmentation and trained it more (1500-1600 epochs). Table 4 shows the results related to the augmented datasets. If we compare Unified-AM’s performance between the augmented corpus and the original corpus (see Table 4), the model has much higher token level accuracy, C-F1, R-F1, and F1 scores when we apply augmentation techniques on the training corpus. We have reached the highest component C-F1(100%) score of 71.35% where Eger et al. [2] has obtained 70.83%. We present the token level improvements below and compare them with the original PE corpus results. In the test set, we have in total 2,134 major claim tokens, 4,238 claim tokens, 13,728 premise tokens, and 9,437 non-argumentative tokens. Our goal is to increase the major claim tokens which can be considered as the root of the argumentation structure. The results provided below show the overall token level improvements that we get compared to the original paragraph version of the PE corpus. These results indicate that the augmentation technique has significantly improved the previous predictions regarding the major claim, claim, and premise tokens. For the original corpus (Paragraph level) we see Correct Major Claim Tokens: 1542; Correct Claim Tokens (with Stance: 8 Muhammad Tawsif Sazid et al. CEUR Workshop Proceedings 1–12 Table 4 Experiment on the Augmented Corpus with Unified-AM Original Paragraph Corpus Token C-F1 C-F1 R-F1 R-F1 F1 F1 Model Accuracy (100%) (50%) (100%) (50%) (100%) (50%) Unified-AM 66.79% 68.88 78.22 51.14 56.41 60.00 67.32 Augmented Corpus (Addition of New Paragraphs) Token C-F1 C-F1 R-F1 R-F1 F1 F1 Model Accuracy (100%) (50%) (100%) (50%) (100%) (50%) Unified-AM 67.02% 71.03 79.82 52.50 58.25 61.77 69.04 Augmented Corpus (Addition of New Paragraphs) Training Epochs: 1500-1600 Token C-F1 C-F1 R-F1 R-F1 F1 F1 Model Accuracy (100%) (50%) (100%) (50%) (100%) (50%) Unified-AM 68.03% 71.35 80.21 54.27 59.46 62.81 69.83 For, Against): 2057; Correct Premise Tokens (with Stance: Support, Attack and Distance -11 to + 11): 7329; and Correct Non-Argumentative Tokens: 8217. For the Augmented Corpus (Addition of New Paragraphs) these numbers increase to 1633, 2287, 7612, 8266, respectively. Given extra training (Training Epochs: 1500-1600) the numbers for Claim and Premise Tokens increase and for Major Claim and Non-Argumentative Tokens decrease: 1597, 2344, 7956, 8196, respectively. 6. Error Analysis We have measured the distance prediction accuracy of the Unified-AM model and compare it with that of Eger et al. [2]. Also, we compare our results with the works [13] which do not consider subtask 1 while solving the other subtasks related to argumentation mining. We observe a higher accuracy of predicting longer distance in the paragraphs. One of the key strategies that we have followed for all of these experimental setups: We ensure the models share all of their learned parameters while solving any particular subtask (component detection and labelling, relation classification, or accurate distance prediction) of the main Argumentation Mining problem. This denser representation of the whole argumentation task enables our neural models to share all of the parameters while making predictions for each of the subtasks which has led to a high performance. Eger et al. [2] showed that LSTM-ER model’s probability of correctness given true distance is below 40% and it becomes below 20% when the distances are larger than 3. But in our case, our analysis shows above 50% accuracy for distances 1, 2, and 3 (for Unified-AM). Our final model has higher accuracy regarding smaller distances but its prediction accuracy declines as we observe larger distance values in the PE corpus. Recalling that subtask 1 of the argumentation mining problem is the separation of the argumentative text from the non-argumentative text, we compare Unified-AM’s performance with some of the recent works where the output of subtask 1 of the argumen- tation problem has already been obtained. We look for 100% accuracy of span detection (successful segmentation of the argumentative text from the non-argumentative text) by 9 Muhammad Tawsif Sazid et al. CEUR Workshop Proceedings 1–12 Table 5 Comparison between Unified-AMs with other models (which do not consider subtask 1) Argument Component Type Classification Method Macro MC Claim Premise Joint-ILP [1] 82.6 89.1 68.2 90.3 St-SVM-full [14] 77.6 78.2 64.5 90.2 Joint-PN [11] 84.9 89.4 73.2 92.1 Span-LSTM [12] 87.3 - - - Span-LSTM-Trans [13] 87.5 93.8 76.4 92.2 BERT-Trans [13] 88.4 93.2 78.8 93.1 Unified-AM (Ours) 89.18 92.30 78.25 96.98 Unified-AM and only on those spans do we measure F1 values for individual component type identification to compare our results with the other models (which assume subtask 1 is given). Lastly we calculate the macro-F1 score. Table 5 contains the results. We obtain the highest macro-F1 score of 89.18% when we do not consider subtask 1. We have the highest individual F1 score for the premise tokens (96.98%) which boosts the macro F1 score considerably. 7. Conclusions and Future Work In this work, we show that rather than using a complex stacked architecture for a problem having different subtasks (where all the subtasks are related), we can have a compact and unified representation of all the sub-problems and can tackle it as a single problem with less complicated architectures. We obtain an improved performance over Eger et al. [2] in recognizing the argument components and relations. We further improve this result by introducing the Flair stacked embedding [18] to represent the text input. We introduce a multi-head attention layer to the neural architecture which leads to the highest accuracy on the PE corpus. Observing that the imbalanced corpus may be creating problems for this model to learn certain underrepresented features of the corpus, we have used the standard technique of data augmentation to achieve further gains in performance. We have created one augmented version of the PE training corpus by using different combinations of the n-grams that occur immediately before approximately two-thirds of the major claim components (see Section 5.2) in the paragraph version of the corpus. By using the augmentation methodology, we further improve the Unified-AM model’s performance on the test set. We have obtained the highest token level accuracy, C-F1, R-F1, and the global F1 score (which is the combination of both C-F1 and R-F1 scores) on the paragraph version of the PE corpus by applying the augmentation technique. Shared parameter values across different subtasks enhanced the accuracy score and also the model’s capability for accurate detection of components, relations and distance. Future work includes applying Unified-AM on the essay level of the PE corpus, using contextual embeddings to enhance the representations of the argumentative texts, and testing an appropriately modified model on other datasets (e.g., the CDCP dataset [14]). 10 Muhammad Tawsif Sazid et al. CEUR Workshop Proceedings 1–12 Acknowledgments This research is partially funded by The Natural Sciences and Engineering Research Council of Canada (NSERC) through a Discovery Grant to Robert E. Mercer. We also acknowledge the helpful comments provided by the reviewers. References [1] C. Stab, I. Gurevych, Parsing argumentation structures in persuasive essays, Com- putational Linguistics 43 (2017) 619–659. [2] S. Eger, J. Daxenberger, I. Gurevych, Neural end-to-end learning for computational argumentation mining, in: Proc. of 55th Ann. Meet. of Assoc. for Comp. Ling. (Vol. 1: Long Papers), 2017, pp. 11–22. [3] R. M. Palau, M.-F. Moens, Argumentation mining: the detection, classification and structure of arguments in text, in: Proc. of the 12th International Conference on Artificial Intelligence and Law, 2009, pp. 98–107. [4] C. Stab, I. Gurevych, Annotating argument components and relations in persuasive essays, in: Proc. of the 25th International Conference on Computational Linguistics: Technical Papers, 2014, pp. 1501–1510. [5] M. Miwa, M. Bansal, End-to-end relation extraction using LSTMs on sequences and tree structures, in: Proc. of the 54th Ann. Meet. of the Assoc. for Comp. Ling. (Vol. 1: Long Papers), 2016, pp. 1105–1116. [6] I. Persing, V. Ng, End-to-end argumentation mining in student essays, in: Proc. of the 2016 Conf. of the N. American Chap. of the Assoc. for Comp. Ling. Human Language Technologies, 2016, pp. 1384–1394. [7] A. Ferrara, S. Montanelli, G. Petasis, Unsupervised detection of argumentative units though topic modeling techniques, in: Proceedings of the 4th Workshop on Argument Mining, 2017, pp. 97–107. [8] I. Persing, V. Ng, Unsupervised argumentation mining in student essays, in: Proceedings of the 12th Language Resources and Evaluation Conference, 2020, pp. 6795–6803. [9] A. Peldszus, Towards segment-based recognition of argumentation structure in short texts, in: Proceedings of the First Workshop on Argumentation Mining, 2014, pp. 88–97. [10] A. Peldszus, M. Stede, Joint prediction in MST-style discourse parsing for argumen- tation mining, in: Proc. of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 938–948. [11] P. Potash, A. Romanov, A. Rumshisky, Here’s my point: Joint pointer architecture for argument mining, in: Proc. of the 2017 Conf. on Empirical Methods in Natural Language Processing, 2017, pp. 1364–1373. [12] T. Kuribayashi, H. Ouchi, N. Inoue, P. Reisert, T. Miyoshi, J. Suzuki, K. Inui, An empirical study of span representations in argumentation structure parsing, 11 Muhammad Tawsif Sazid et al. CEUR Workshop Proceedings 1–12 in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4691–4698. [13] J. Bao, C. Fan, J. Wu, Y. Dang, J. Du, R. Xu, A neural transition-based model for argumentation mining, in: Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conf. on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 6354–6364. [14] V. Niculae, J. Park, C. Cardie, Argument mining with structured SVMs and RNNs, in: Proc. of the 55th Annual Meeting of the Assoc. for Computational Linguistics (Volume 1: Long Papers), 2017, pp. 985–995. [15] M. Ahmed, M. R. Samee, R. E. Mercer, Improving neural sequence labelling using additional linguistic information, in: 2018 17th IEEE Int. Conf. on Machine Learning and Applications, 2018, pp. 650–657. [16] M. T. Sazid, A Unified Representation and Deep Learning Architecture for Per- suasive Essays in English, MSc Thesis, The University of Western Ontario, Lon- don, Ontario, Canada, 2022. Electronic Thesis and Dissertation Repository. 8497. https://ir.lib.uwo.ca/etd/8497. [17] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, C. Dyer, Neural archi- tectures for named entity recognition, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 260–270. [18] A. Akbik, T. Bergmann, D. Blythe, K. Rasul, S. Schweter, R. Vollgraf, Flair: An easy-to-use framework for state-of-the-art nlp, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), 2019, pp. 54–59. [19] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in neural information processing systems, 2017, pp. 5998–6008. [20] J. Ho, N. Kalchbrenner, D. Weissenborn, T. Salimans, Axial attention in multidi- mensional transformers, arXiv preprint arXiv:1912.12180 (2019). [21] N. Kitaev, L. Kaiser, A. Levskaya, Reformer: The efficient transformer, in: International Conference on Learning Representations, 2020, p. 12pp. 12