Using Clustering Techniques to Identify Arguments in Legal Documents Prakash Poudyal Teresa Gonçalves Paulo Quaresma Department of Computer Science & Department of Informatics Department of Informatics Engineering University of Evora University of Evora Kathmandu University Evora, Portugal Evora, Portugal Dhulikhel, Nepal tcg@uevora.pt pq@uevora.pt prakash@ku.edu.np ABSTRACT In general to automatically identify a legal argument within A proposal to automatically identify arguments in legal documents an unstructured text, three stages or modules are used in current is presented. In this approach, cluster algorithms are applied to practice. The first stage is to identify the argumentative and non- argumentative sentences in order to identify arguments. One po- argumentative sentences, the second stage is to identify the bound- tential problem with this process is that an argumentative sentence aries of arguments and the third stage is to distinguish the argu- belonging to one specific argument can also simultaneously be part ment’s components (premise and conclusion). of another, distinct argument. To address this issue, a Fuzzy c-means To date, second stage processing has been performed by identify- (FCM) clustering algorithm was used and the proposed approach ing the boundaries of arguments, an extensively explored method was evaluated with a set of case-law decisions from the European in the AI & Law literature. In this paper, we propose a clustering Court of Human Rights (ECHR). An extensive evaluation of the technique that groups argumentative sentences into a cluster of most relevant and discriminant features to this task was performed potential arguments with associated probabilities. An overview of and the obtained results are presented. our approach to the task is shown in Figure 1. The task is complex, In the context of this work two additional algorithms were de- since components of one argument (premise or conclusion) can veloped: 1) the “Distribution of Sentence to the Cluster Algorithm" also be involved in other arguments. In the example shown in (DSCA) was developed to transfer fuzzy membership values (be- figure 1 there are 2 distinct arguments (A and B). In this example tween 0 and 1) generated by the FCM to a set of clusters; 2) the sentence 2 belongs to argument A and also to Argument B. It is “Appropriate Cluster Identification Algorithm" (ACIA) to evaluate important to point out that, for instance, in the European Court the proposed clusters against the gold-standard clusters defined by of Human Rights corpus (ECHR) situations similar with this one human experts. (and even more complex) appear. Figure 2 shows a real example, The overall results are quite promising and may be the basis for where one sentence (marked in yellow) belongs to three arguments further research work and extensions. (7, 8, and 9) and is followed by another sentence, which belongs to a different argument (6), which is followed by another sentence KEYWORDS belonging to arguments (7 and 8). To cluster such sentences, we propose to use a Fuzzy c-means (FCM) clustering algorithm [3] Machine Learning, Fuzzy clustering algorithm, argument mining, that provides a membership value ranging from 0 to 1 for each Legal documents, Natural Language Processing. cluster of the sentence. These membership values are key assets of ACM Reference Format: the FCM, since they allow us to associate each sentence with more Prakash Poudyal, Teresa Gonçalves, and Paulo Quaresma. 2019. Using Clus- than one cluster/argument. The performance of the FCM depends tering Techniques to Identify Arguments in Legal Documents. In Proceedings heavily on the selection of features that are used. In the context of Third Workshop on Automated Semantic Analysis of Information in Legal Text (ASAIL 2019). Montreal, QC, Canada, 8 pages. of our work we focused mainly on four kinds of features: N-gram, word2vec, sentence closeness and ‘combined features’. Our aim is 1 INTRODUCTION to identify the best performing set of features and techniques to cluster components to form an argument. Advances in communication technology, accessibility of media After extracting the features associated with each text, the FCM devices and mushrooming of social media has caused the number is used to obtain a cluster membership value for every sentence. To of individuals expressing opinions to grow exponentially. As a determine the composition of each cluster, we developed a specific result, a massive amount of electronic documents is generated daily, algorithm: the “Distribution of Sentence to the Cluster Algorithm" including news editorials, discussion forums and judicial decisions (DSCA). containing legal arguments. In turn, rapid development of current To evaluate the performance of the system, a second algorithm, research into argument mining is raising new challenges for natural the “Appropriate Cluster Identification Algorithm" (ACIA) was also language processing in various fields. developed to map each cluster of the system’s output to the closest In: Proceedings of the Third Workshop on Automated Semantic Analysis of Information matching cluster in the gold-standard dataset. in Legal Text (ASAIL 2019), June 21, 2019, Montreal, QC, Canada. The rest of the paper is organized as follows: section 3 contains © 2019 Copyright held by the owner/author(s). Copying permitted for private and academic purposes. a brief introduction to the datasets and to the measures used for Published at http://ceur-ws.org evaluating the performance of the system. In Section 4 we describe ASAIL 2019, June 21, 2019, Montreal, QC, Canada Prakash Poudyal, Teresa Gonçalves, and Paulo Quaresma Figure 1: Overview of the Architecture Figure 2: Example from the ECHR corpus the proposed architecture including a description of features, a work we propose a clustering approach, which aims to overcome discussion on determining the optimum number of clusters, and this restriction and is based in the relatedness of sentences. the newly developed DSCA and ACIA algorithms. Section 5 eval- In another related work, Sobhani et al. [24] have applied ar- uates the performance of all of our experiments. Lastly, Section 6 gument mining techniques to user comments aiming to perform addresses conclusions and prospects for future work. stance classification and argument identification. Their work has quite different goals and they assume a predefined number of ar- guments, transforming the problem into a classification problem (tag sentences with the most adequate argument). They were able 2 STATE OF THE ART to obtain an f-measure of 0.49 for the argument tagging procedure. In the argument mining field, there has been very limited research Moreover, user comments are typically simple sentences and do about using clustering techniques to identify and group argumen- not have an inner argumentative structure. tative sentences into arguments. One of the most related research J. Savelka and K. Ashley [22] have proposed to use machine work was done by Rachel Mochales-Palau and M. Francine Moens learning techniques to predict the usefulness of sentences for the [13, 14, 16]. They used the ECHR corpus, which was manually interpretation of the meaning of statutory terms. They explored annotated and revised, and they were able to obtain around 80% the use of syntactical, semantic and structural features in the clas- accuracy in the detection of argumentative sentences using a statis- sification process and they were able to obtain an accuracy higher tical classifier. Then, they propose to detect argument limits using than 0.69. context-free grammars (CFG) to take into account the structure of Regarding argument relations, Stab and Gurevych [25] proposed documents or to use semantic distance measures to cluster related an annotation scheme to model arguments and their relations. Their sentences. The CFG approach approach was applied to a limited sub- approach was to identify the relation (i.e. ‘support’ or ‘attack’) set of documents and obtained around 60% accuracy. However, they between the components of arguments. Their technique indicates did not present any result for a semantic based approach. From the which premises belong to a claim and constitute the structure of example in figure 2 it is clear that a CFG approach is not powerful arguments. enough to identify correctly the argument structure, as arguments can be interleaved and may not have a sequential structure. In this Using Clustering Techniques to Identify Arguments in Legal Documents ASAIL 2019, June 21, 2019, Montreal, QC, Canada Lawrence et al. [10] performed a manual analysis as well as an 4 SYSTEM ARCHITECTURE automated analysis to detect the boundaries of an argument. To Our proposal is to cluster argumentative sentences and, thereby, train and test for automatic analysis, the authors relied on help to identify legal arguments. As shown in figure 1 there are sev- from experts to analyze the text manually. For the automatic analy- eral phases: feature extraction; clustering algorithm; and argument sis, they used two Naive Bayes classifiers; one to identify the first building. In order to apply the fuzzy clustering algorithm we need word of a proposition and the other to identify the last word. [8], first to identify the optimum number of clusters in the respective [21], and [17] continued this boundary approach, using the Condi- case-law file and, after running it, we need to convert the generate tional Random Fields (CRF) algorithm to segment the argument’s soft clustering values to hard clustering. components. Lippi and Torroni [11] have a survey paper about the use of machine learning technologies for argument mining. The paper 4.1 Feature Extraction analyses several approaches made by different authors regarding ar- Typically features are values that represent a sentence and are suit- gument boundary detection. This review article emphasizes that the able for a machine learning algorithm to handle. It is essential to boundary detection problems depend upon the adopted argument select the most appropriate and precise features to train a machine models. However, and as already referred, the non-sequential struc- learning algorithm so that the model can be successfully applied ture of arguments in the ECHR corpus creates new and complex to new data. Therefore, good discriminant features are needed to problems, which can not be handled by simple boundary detection correlate similarities between sentences and also address the se- approaches. quential nature of sentences (since the majority of the components Conrad [7] applied a k-means clustering algorithm over plain- of an argument are presented in order). To address this require- tiff claims in ‘Premises Liability’, ‘Medical Malpractice’ and ‘Racial ment, the following features were used: N-gram [5], word2vec, and Discrimination’ suits. The authors applied their technique to dis- sentence closeness (discussed below). Another feature set can also tinguish more effective plaintiff claims from less effective ones be obtained by combining these three existing features into what using an ‘award_quotient’ metric to segregate the claims. Besides we called “Combined Features". Each kind of features is discussed award_quotient, the authors used features to help differentiate one below. cluster’s properties from another. The authors mention that they Word2vec: The word2vec approach was proposed by [12] and also tried aggregative, partial and graphical features, but didn’t find can be implemented in two different ways: as a ‘Continuous Bag of anything that yielded a performance superior to k-means. Words’ (CBW) or as a ‘Skip gram’. With Skip-grams, context words are predicted from selected words in the text, whereas with CBW, a word vector is predicted from the context of adjacent words. A 3 CORPUS SELECTION AND EVALUATION Wikipedia dump of 05-02-2016 was used as input to the word2vec PROCEDURES implementation of Gensim [19], where 100 dimension vectors were We selected case-law documents from the European Court of Hu- generated for each word. From the training set, each word of the man Rights (ECHR)1 annotated by R. Mochales [14]. The corpus sentence is looked up and its corresponding vector found among is composed of 20 Decision and 22 Judgment categories released the generated word vectors. Then, the average of all vectors of the before 20 October 1999 by the European Commission on Human words presented in the sentence is taken and considered to be the Rights. Both categories include similar information, however, the ‘sentence vector’. ‘Decision’ present the information briefly with an average word Sentence Closeness: Sentence closeness is the reciprocal of the length of 3,500 words, whereas for ‘Judgment’ the average word inter sentence distance (i.e. the distance between sentences) counted length is 10,000 words. We have 9257 sentences, out of which 7097 in units of whole sentences. To capture the sequential nature of (77%) of them are non-argumentative and 2160 (23%) argumentative sentences, distance is a useful feature that helps to determine which sentences. Details about the ECHR corpus is available in [18]. sentences belong to which argument. The highest scoring sentence Regarding evaluation, we used the standard Precision, Recall, is considered to be the origin sentence (with a score of 1) from which and F-measure [2, 20] measures. Furthermore, we also used cluster all other distances are measured. With the exception of the origin purity to evaluate the quality of the obtained clusters. We computed sentence, ‘closeness’ scores should decrease monotonically as they the cluster purity [23] by counting the number of correctly assigned move away from the origin. Furthermore, as meaning and concepts entities and dividing by the total number of N . Formally flow from one sentence to another, this implies that sentences whose ‘closeness’ is high are good candidates for being clustered 1 Õ together i.e. they belong to the same argument. Equation 2 was ClusterPurity (φ, C) = max e=1..k |wd ∩ c e | (1) N d =1..k used to calculate the ‘closeness’ for each pair of sentences. where N is the summation of the total number of elements in all clusters, φ = {w 1 , w 1 , · · · w k } is the set of clusters and c = 1 Closeness(s 1 , s 2 ) = (2) {c 1 , c 1 , · · · c k } is the set of classes. We interpret wd as the set of 1 + |n(s 1 ) − n(s 2 )| sentences in wd and c e as the set of sentences in c e in Equation 1. where n is a function which calculates the number of sentences from the beginning of the text until the sentence of its argument. Combined Features: The previously presented features (N- 1 http://hudoc.echr.coe.int/sites/eng gram+‘Sentence Closeness’+Word2vec) were combined into a new ASAIL 2019, June 21, 2019, Montreal, QC, Canada Prakash Poudyal, Teresa Gonçalves, and Paulo Quaresma feature in an attempt to improve the performance of the clustering methodology to predict the adequate number of clusters. Improve- algorithms. ments in the results are expected to be achieved in the future as more discriminant features are used. 4.2 Identification of the optimum number of clusters 4.3 Clustering Algorithm An argument cluster is a set of sentences which together comprise a After extracting the features, we used a standard Fuzzy c-means single, coherent legal argument. The process by which sentences are (FCM) Clustering algorithm [3] to generate membership values aggregated into arguments in this way is called clustering. To cluster ranging from 0 to 1 for each cluster. The number of clusters was sentences successfully into arguments, it is currently necessary to defined based on the algorithm proposed and described in section specify in advance how many clusters to expect within a corpus 4.2. We set the fuzziness value m ∈ {1.1, 1.3, 2.0}. and until recently, there has been no well-established approach to defining this. Techniques that claim to be able to define the 4.4 Distribution of Sentence to the Cluster optimum number of clusters in the FCM have been proposed by Algorithm (DSCA) [26] and Latent Dirichlet Allocation (LDA) [4]. The Distribution of Sentence to the Cluster Algorithm (DSCA) Employing the Xie and Beni approach [26], we determined ex- algorithm aims to transform the membership value generated by perimentally that an FCM with a fuzziness value of m set to 1.3 in FCM (between 0 and 1) into a set of clusters (soft to hard clustering concert with the features Word2Vec, N-gram and Sentence Close- problem). The FCM output represents a membership probability ness, yielded the best results with our particular set of case-law files. indicating how likely it is that the sentence belongs to a particular [26] technique selects the best candidate for the number of clus- cluster. DSCA is presented as Algorithm 1 ters after obtaining the minimum index value from that respective Membership values are represented by a matrix where each row cluster number. represents a sentence and each column is labelled with a cluster The Latent Dirichlet allocation (LDA) technique estimates the number (Ci ) ranging from 1 to C. To assign a sentence to the re- number of ‘topics’ existing within a text, which means estimat- spective cluster, a threshold value t needs to be specified to help ing the probabilities of groupings within the text. Inspired by the define boundaries between the clusters. The cluster assignment concept, it was decided to look for such groups within our own is done only if the difference between the maximum membership corpora, and to use the estimated number of topics as a proxy for value of the i th position is less than the threshold value for the the number of clusters. We selected the ‘CaoJuan2009’ method as a cluster, otherwise the sentence is rejected. The algorithm ends after metric which is the best LDA model based on density [15]. ‘Cao- conducting an iterative process through all positions in the matrix. Juan2009’ was tested and agreed well with the number of topics The concept of threshold value is discussed by [1] as well as [9]. (equivalent to our ‘number of clusters’) predicted by the minimum The authors claim that the definition of the appropriate threshold index value [6]. value should be determined by experimentation. After applying Figure 3 illustrates the results for each experiment: the first is the DSCA algorithm, we were able to obtain a proposal for legal the gold-standard, the second is using Xie and Beni’s proposal, and arguments: the identified clusters and their sentences. the third is from LDA [6]. In the case of Xie and Beni, it can be observed that case-law files 02, 31, 32, 39, 42 find the closest number 5 RESULTS AND EVALUATION of clusters to the gold-standard, whereas for other case-law files the differences are greater. In order to perform an evaluation of the performance of our system, In case of Cao et al.’s prediction: Case-law files 40 and 41 finds we needed to find the best mapping between our system’s clusters the correct data required for identification, whereas other case-law and the existing gold-standard data clusters from the ECHR. There- files present a slight difference, but not as big as that observed by fore, we proposed and developed a new algorithm the “Appropriate Xie and Beni. The exact accuracy score achieved was an identical Cluster Identification Algorithm ”(ACIA) to solve this problem. This 8% for both LDA and Xie and Beni. algorithm maps the argument predicted by our system to the closest We also used equation 3, to calculate the difference between the matching argument in the gold-standard corpus. Here, we describe number of clusters of the gold-standard (Cд ) and the ones predicted the details of the algorithm. by our system (Cs ). If they differ only by one unit then we consid- ered the prediction is "almost correct"; otherwise it’s an incorrect 5.1 Appropriate Cluster Identification prediction. The result of applying this filter shows that the accuracy Algorithm (ACIA) of the closeness scores increase in value to 58% for LDA and 42% The ACIA algorithm aims to find the best mapping between the for Xie and Beni, respectively. system’s predicted clusters and the gold-standard dataset clusters. A formal description of the ACIA algorithm is presented in Appen- |Cs − Cд | ≤ 1 (3) dix A but the general idea is the following: where Cs is the cluster number given by the prediction system, and • Select the best pair mapping between the clusters Cд is the cluster number of the gold-standard. • Remove these nodes from the set of clusters From the analysis, it’s possible to conclude that LDA achieves • Iterate until there is no available pair of clusters a greater accuracy than Xie and Beni and is much closer to the • The final mapping is composed by the set of the selected gold-standard. As a consequence, in our experiments we used this pair mappings Using Clustering Techniques to Identify Arguments in Legal Documents ASAIL 2019, June 21, 2019, Montreal, QC, Canada Figure 3: Argument numbers of gold-standard vs. System Prediction (proposed by Xie and Beni and Cao et al.) Algorithm 1: Distribution of Sentence to the Cluster Algo- In figure 4 we can see the relevance of performing an optimized rithm (DSCA) mapping between the system’s predicted clusters and the gold- data arguments. We can observe that the value of ‘After ACIA’ 1. Denote the matrix of the sentences x cluster by (ai j ) ∈ [0, 1], (square symbol) is higher (above 0.3 for all files), whereas in the i = 1, 2, 3 · · · S and j = 1, 2, 3 · · · C such that i stands for case of a sequential mapping between the two clusters ‘Before sentence and j stands for cluster. ACIA’ (diamond symbol), the maximum value never exceeds is 0.3. 2. Pre-selected threshold (t) is defined 3. for each i do do 5.2 Performance Measurement imax = max(ai j ) ∀i for each j do do The experiment was conducted with the features mentioned in if (imax − ai j ) < t then section 4.1 with fuzziness parameters m ∈ {1.1, 1.3, 2.0} and thresh- select sentence i for cluster j old value t ∈ {0.0001, 0.00001, 0.000001} used for the conversion else from a soft to a hard clustering. For the reason of space, we present reject i; the results for the features and parameters that score the highest end f 1 value in most of the case-law files. Table 1 presents the per- end formance result (precision, recall and f 1 , cluster purity) showing end the N-gram, Sentence closeness, Word2vec and ‘Combine features’ end using a threshold value t = 0.00001 and FCM fuzziness (m) = 1.3. Along with this, we include the number of sentences of each case- law file. The highest f 1 value of each case-law file obtained from each feature is highlighted in bold and underlined. Case-law files After identifying the best mapping, the f 1 measure is calculated for 03, 13, 16, 31, 32 and 42 obtained the highest value using Word2vec each cluster and the overall average f-measure value is obtained. features. Case-law file 02 scored the highest f 1 value with N-gram, and case-law files 30, 35 and 41, the highest f 1 with the combined features. Likewise, case-law file 40 scored the highest f 1 value with the Sentence Closeness feature. From this analysis, we can conclude that Word2vec seems to be the best overall approach. In comparison to Word2vec, N-gram did not perform as well. The main reason for this effect is that N-gram uses a bag of words approach which is not effective in finding similarities between sentences, and the results show that the performance of N-gram depends upon the number of sentences; if the number of the sen- tences in the case-law file is high, then N-gram performance is poor. Sentence Closeness is another important feature that helps to understand the sequential context of the sentence. The sentence following an argumentative sentence often has a huge impact on the argument, as the meaning/context of a sentence usually flows Figure 4: F 1 score before and after applying ACIA sequentially. The results in this table show that the performance of ASAIL 2019, June 21, 2019, Montreal, QC, Canada Prakash Poudyal, Teresa Gonçalves, and Paulo Quaresma Case #S N-gram Sentence Closeness Word2vec Combined Feature Pre Rec f1 Purity Pre Rec f1 Purity Pre Rec f1 Purity Pre Rec f1 Purity 02 15 0.698 0.485 0.573 0.625 0.342 0.221 0.268 0.412 0.656 0.367 0.470 0.563 0.625 0.450 0.523 0.600 03 15 0.619 0.429 0.506 0.563 0.405 0.333 0.366 0.412 0.714 0.429 0.536 0.600 0.524 0.381 0.441 0.533 13 20 0.508 0.628 0.561 0.500 0.413 0.344 0.375 0.400 0.602 0.581 0.591 0.550 0.342 0.344 0.343 0.400 16 33 0.125 1.000 0.222 0.125 0.437 0.481 0.458 0.424 0.449 0.449 0.449 0.424 0.125 1.000 0.222 0.125 30 25 0.265 1.000 0.419 0.263 0.252 0.275 0.263 0.320 0.351 0.363 0.357 0.360 0.272 1.000 0.428 0.275 31 15 0.317 0.714 0.439 0.313 0.524 0.571 0.547 0.533 0.595 0.524 0.557 0.533 0.429 0.500 0.462 0.400 32 17 0.335 0.785 0.470 0.326 0.481 0.393 0.433 0.474 0.648 0.485 0.555 0.529 0.500 0.396 0.442 0.474 35 13 0.429 0.414 0.421 0.467 0.619 0.414 0.496 0.571 0.667 0.414 0.511 0.615 0.845 0.636 0.726 0.769 39 17 0.352 0.588 0.440 0.346 0.400 0.431 0.415 0.421 0.362 0.525 0.429 0.368 0.310 0.613 0.412 0.250 40 14 0.400 0.370 0.384 0.467 0.587 0.530 0.557 0.533 0.519 0.520 0.520 0.533 0.400 0.420 0.410 0.467 41 12 0.517 0.563 0.539 0.500 0.625 0.625 0.625 0.583 0.438 0.438 0.438 0.417 0.683 0.625 0.653 0.583 42 18 0.464 0.440 0.452 0.389 0.433 0.414 0.424 0.389 0.643 0.486 0.553 0.500 0.431 0.598 0.501 0.414 Table 1: Precision, Recall and f 1 , Cluster Purity value according to Case-law and the number of sentences Sentence Closeness is satisfactory, but still lacking in comparison values. It is apparent from the data in Table 1 that f 1 and cluster to Word2vec. The Combined feature also has an impact, as it is a purity are well correlated. combination of Word2vec, N-gram and Sentence Closeness. The Overall, the results obtained – average accuracy of 0.59, macro Combined feature offered the highest f 1 value for those case-law f-measure of 0.497 and cluster purity of 0.499 – from the proposed files for which Word2vec did not offer significant results, with the framework are quite promising, even if they cannot be easily com- exception of case-law files 02, 39 and 40. Overall, 66% of the case- pared with other researchers’ results. The most related work is the law files obtained the highest f 1 using Word2vec and Combined one by Mochales and Moens [13, 16]; they obtained a 60% accuracy feature. result in the argumentation structure detection task. It is important Furthermore, in the case of N-gram and the Combined feature, to refer they did not present the precision and recall measures and we found recall is further elevated by up to 1, but precision is very that they tried to handle a much more simple problem, because low for case-law files that have a large number of sentences. This they assumed sequential argumentative structures. is because the n-gram feature is inappropriate for such case-law Sobhani et al. [24] obtained a very similar f-measure value (0.49), files. If the feature is not sufficiently discriminating enough to but also with a much less complex task: classification of sentences distinguish among sentence categories, applying the FCM provides from a predefined argument list. equal membership probability values (or very close to equal ) for Goudas [8] obtained an accuracy of 42%, while segmenting the every category, essentially providing no useful information. As a argumentative sentences using Conditional Random Fields (CRF). result, during the process of forming hard clusters, such sentences Lawrence [10] precision and recall for identifying argument struc- get equally distributed over all clusters. ture using automatically segmented propositions was 33.3% and Table 1 also presents the cluster purity value of each feature 50.0%, respectively. obtained for each case-law file. Word2vec was found to play the Stab and Gurevych [25] also encountered problems dealing with leading role in case-law files 03, 13, 16, 30, 31, 32, 40, and 42. Sentence ‘support’ and ‘attack’ relations. The main reason for this was that Closeness scored highest in four case-law files: 16, 31, 39 and 41. their approach was unable to identify the correct target of a relation, However, case-law 16 and 31 tied with Word2vec. Overall the purity especially in a paragraph with multiple claims or reasoning chains. values are satisfactory, except in case-law file 16 and 33 with the Combined feature and N-gram. Case-law 16 which had 33 sentences 6 CONCLUSION AND FUTURE WORK had the lowest value (0.125) from Combined and N-gram features. We proposed a new clustering technique for grouping argumen- Similarly, case-law 30, which has 25 sentences, obtained 0.275. On tative sentences in legal documents. We also proposed and imple- the other hand, case-law 35, which had 13 sentences, scored 0.726 mented an evaluation procedure for the proposed system and an (the highest value) using the Combined feature. From this analysis, approach to identify the total number of arguments in a case-law we can conclude that having a greater number of sentences also document. Overall, the results that we achieved are satisfactory affects the clustering quality negatively and that Word2vec is the and quite promising. The macro f 1 and average cluster purity score dominant feature for obtaining acceptable f 1 and cluster purity for system prediction using Word2vec feature in case-law files that have 4 to 8 arguments is 0.497 and 0.499 respectively. Using Clustering Techniques to Identify Arguments in Legal Documents ASAIL 2019, June 21, 2019, Montreal, QC, Canada For future work, we intend to add and evaluate more features, [19] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling such as ‘semantic similarity’ ones, aiming to improve these results. with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/ Moreover, as an extension of this work we are working on: a) 884893/en. the identification, in each cluster/argument, of sentences acting [20] Mendes M.E.S. Rodrigues and L Sacks. 2004. A scalable hierarchical fuzzy clus- tering algorithm for text mining. In Proceedings of the 5th international conference either as a premises or conclusions; b) the creation of a graph on recent advances in soft computing. 269–274. representation of the argument structure of each document (attack, [21] Christos Sardianos, Ioannis Manousos Katakis, Georgios Petasis, and Vangelis support, and rebuttal arguments). Karkaletsis. 2015. Argument Extraction from News.. In ArgMining@ HLT-NAACL. 56–66. [22] Jaromir Savelka and Kevin D Ashley. 2016. Extracting case law sentences for ACKNOWLEDGEMENT argumentation about the meaning of statutory terms. In Proceedings of the Third Workshop on Argument Mining (ArgMining2016). 50–59. The authors would like to express deep gratitude to EMMA-WEST [23] Hinrich Schütze, Christopher D Manning, and Prabhakar Raghavan. 2008. Intro- Project in the framework of the EU Erasmus Mundus Action 2 duction to information retrieval. Vol. 39. Cambridge University Press. and Agatha Project SI IDT number 18022 (Intelligent analysis sys- [24] Parinaz Sobhani, Diana Inkpen, and Stan Matwin. 2015. From argumentation min- ing to stance classification. In Proceedings of the 2nd Workshop on Argumentation tem of open of sources information for surveillance/crime control), Mining. 67–77. ALENTEJO 2020 for their invaluable support. Further, the authors [25] Christian Stab and Iryna Gurevych. 2014. Annotating Argument Components and Relations in Persuasive Essays. In Proceedings of the 25th International Con- would also like to extend sincere thanks to the reviewers for their ference on Computational Linguistics (COLING 2014). Dublin City University and constructive comments and suggestions. Association for Computational Linguistics, 1501–1510. [26] Xuanli Lisa Xie and Gerardo Beni. 1991. A Validity Measure for Fuzzy Clustering. IEEE Trans. Pattern Anal. Mach. Intell. 13, 8 (Aug. 1991), 841–847. https://doi.org/ REFERENCES 10.1109/34.85677 [1] Moh’d Belal Al-Zoubi, Amjad Hudaib, and Bashar Al-Shboul. 2007. A fast fuzzy clustering algorithm. In Proceedings of the 6th WSEAS Int. Conf. on Artificial Intelligence, Knowledge Engineering and Data Bases, Vol. 3. 28–32. A APPENDIX [2] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. 1999. Modern information retrieval. Vol. 463. ACM press New York. A.1 Appropriate Cluster Identification [3] James C Bezdek, Robert Ehrlich, and William Full. 1984. FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences 10, 2-3 (1984), 191–203. Algorithm (ACIA) [4] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet Let A be the system’s cluster set and the B the gold standard cluster allocation. Journal of machine Learning research 3, Jan (2003), 993–1022. [5] Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and set, respectively, having a cardinality of n: A = {a 1 , · · · , an } and Jenifer C Lai. 1992. Class-based n-gram models of natural language. Computa- B = {b1 , · · · , bn }. We define the matrix F = { fi j } where fi j = ai b j tional linguistics 18, 4 (1992), 467–479. [6] Juan Cao, Tian Xia, Jintao Li, Yongdong Zhang, and Sheng Tang. 2009. A density- with ai ∈ A and b j ∈ B. Here, F = { fi j } is the f-measure value based method for adaptive LDA model selection. Neurocomputing 72, 7-9 (2009), calculated by taking cluster i from A and cluster j from B. 1775–1781. We denote by (F )i j the matrix formed from the matrix F by remov- [7] Jack G. Conrad and Khalid Al-Kofahi. 2017. Scenario Analytics: Analyzing Jury Verdicts to Evaluate Legal Case Outcomes. In Proceedings of the 16th Edition of ing the j th column and i th row the International Conference on Articial Intelligence and Law (ICAIL ’17). ACM, New York, NY, USA, 29–37. https://doi.org/10.1145/3086512.3086516 [8] Theodosis Goudas, Christos Louizos, Georgios Petasis, and Vangelis Karkaletsis. State 1 : Initialization 2014. Argument extraction from news, blogs, and social media.. In Hellenic F o = (fi j )n×n Conference on Artificial Intelligence. Springer, 287–299. [9] A. K. Jain, M. N. Murty, and P. J. Flynn. 1999. Data clustering: a review. ACM computing surveys (CSUR) 31, 3 (1999), 264–323. [10] John Lawrence, Chris Reed, Colin Allen, Simon McAlister, and Andrew Raven- scroft. 2014. Mining Arguments From 19th Century Philosophical Texts Using Ro (−1, −1) = ∅ Topic Based Modelling. In Proceedings of the First Workshop on Argumentation Mining. Association for Computational Linguistics, Baltimore, Maryland, 79–87. i.e. Nodes are connected with the cost value C=0 to form a tree struc- https://doi.org/10.3115/v1/W14-2111 [11] Marco Lippi and Paolo Torroni. 2016. Argumentation mining: State of the art and ture. emerging trends. ACM Transactions on Internet Technology (TOIT) 16, 2 (2016), 10. State 2 : From k = 0 to n, iterate. At each k step, we have F (k) (i, j) [12] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their Compositionality. In and R (k ) (i, j) Advances in neural information processing systems. 3111–3119. [13] Raquel Mochales and Marie-Francine Moens. 2011. Argumentation mining. Artificial Intelligence and Law 19, 1 (2011), 1–22. Find all maximum n elements of F (k ) (i, j) o (k ) [14] Raquel Mochales-Palau and Marie-Francine Moens. 2007. Study on sentence Let Mk = (i, j)| fi j is themaximum element o f F (k) (i, j) relations in the automatic detection of argumentation in legal cases. FRONTIERS IN ARTIFICIAL INTELLIGENCE AND APPLICATIONS 165 (2007), 89. i.e. Maximum f-measure value is selected and place in tree structure; [15] Nidhi. [n. d.]. Number of Topics for LDA on poems from Elliston Poetry Archive. Available at http://www.rpubs.com/MNidhi/NumberoftopicsLDA (2017-03-31). [16] Raquel Mochales Palau and Marie-Francine Moens. 2009. Argumentation Mining: State 3: For each element (i, j) ∈ Mk , update route The Detection, Classification and Structure of Arguments in Text. In Proceedings of the 12th International Conference on Artificial Intelligence and Law (ICAIL ’09). ACM, New York, NY, USA, 98–107. https://doi.org/10.1145/1568234.1568246 R (k +1) (i, j) = R (k ) (i, j) ∪ {(i, j)} [17] Joonsuk Park, Arzoo Katiyar, and Bishan Yang. 2015. Conditional random fields for identifying appropriate types of support for propositions in online user and matrix comments. In Proceedings of the 2nd Workshop on Argumentation Mining. 39–44. F (k +1) (i, j) = F (k )   [18] Prakash Poudyal, Teresa. Goncalves, and Paulo. Quaresma. 2016. Experiments on identification of argumentative sentences. In 10th International Conference on ij Software, Knowledge, Information Management Applications (SKIMA). 398–403. https://doi.org/10.1109/SKIMA.2016.7916254 Do it for all elements (i, j) of Mk ASAIL 2019, June 21, 2019, Montreal, QC, Canada Prakash Poudyal, Teresa Gonçalves, and Paulo Quaresma Stop when k = n. i.e. Procedure repeat again for other remaining values; State 4 : {For each route, calculate total cost} Õ TC R (k ) (i, j) = fi j (i, j)∈R (k ) (i, j) i.e. The total cost of each route is calculated. State 5: Select one of the maximum values TC Ro (i, j) , and its route Ro (i, j) = {(i 1 , j 1 ), · · · , (i n , jn )} i.e. The route with the maximum scores is selected. After identifying the appropriate cluster (argument) with respect to the gold-standard; an f-measure is calculated between the i th cluster of the system as recommended by the ACIA and the j th cluster of the gold-standard. After that, the average f-measure value is calculated.