Using Clustering Techniques to Identify Arguments in Legal
                             Documents
                Prakash Poudyal                                         Teresa Gonçalves                             Paulo Quaresma
     Department of Computer Science &                               Department of Informatics                    Department of Informatics
               Engineering                                             University of Evora                          University of Evora
           Kathmandu University                                         Evora, Portugal                              Evora, Portugal
             Dhulikhel, Nepal                                            tcg@uevora.pt                                pq@uevora.pt
            prakash@ku.edu.np

ABSTRACT                                                                                 In general to automatically identify a legal argument within
A proposal to automatically identify arguments in legal documents                     an unstructured text, three stages or modules are used in current
is presented. In this approach, cluster algorithms are applied to                     practice. The first stage is to identify the argumentative and non-
argumentative sentences in order to identify arguments. One po-                       argumentative sentences, the second stage is to identify the bound-
tential problem with this process is that an argumentative sentence                   aries of arguments and the third stage is to distinguish the argu-
belonging to one specific argument can also simultaneously be part                    ment’s components (premise and conclusion).
of another, distinct argument. To address this issue, a Fuzzy c-means                    To date, second stage processing has been performed by identify-
(FCM) clustering algorithm was used and the proposed approach                         ing the boundaries of arguments, an extensively explored method
was evaluated with a set of case-law decisions from the European                      in the AI & Law literature. In this paper, we propose a clustering
Court of Human Rights (ECHR). An extensive evaluation of the                          technique that groups argumentative sentences into a cluster of
most relevant and discriminant features to this task was performed                    potential arguments with associated probabilities. An overview of
and the obtained results are presented.                                               our approach to the task is shown in Figure 1. The task is complex,
   In the context of this work two additional algorithms were de-                     since components of one argument (premise or conclusion) can
veloped: 1) the “Distribution of Sentence to the Cluster Algorithm"                   also be involved in other arguments. In the example shown in
(DSCA) was developed to transfer fuzzy membership values (be-                         figure 1 there are 2 distinct arguments (A and B). In this example
tween 0 and 1) generated by the FCM to a set of clusters; 2) the                      sentence 2 belongs to argument A and also to Argument B. It is
“Appropriate Cluster Identification Algorithm" (ACIA) to evaluate                     important to point out that, for instance, in the European Court
the proposed clusters against the gold-standard clusters defined by                   of Human Rights corpus (ECHR) situations similar with this one
human experts.                                                                        (and even more complex) appear. Figure 2 shows a real example,
   The overall results are quite promising and may be the basis for                   where one sentence (marked in yellow) belongs to three arguments
further research work and extensions.                                                 (7, 8, and 9) and is followed by another sentence, which belongs
                                                                                      to a different argument (6), which is followed by another sentence
KEYWORDS                                                                              belonging to arguments (7 and 8). To cluster such sentences, we
                                                                                      propose to use a Fuzzy c-means (FCM) clustering algorithm [3]
Machine Learning, Fuzzy clustering algorithm, argument mining,
                                                                                      that provides a membership value ranging from 0 to 1 for each
Legal documents, Natural Language Processing.
                                                                                      cluster of the sentence. These membership values are key assets of
ACM Reference Format:                                                                 the FCM, since they allow us to associate each sentence with more
Prakash Poudyal, Teresa Gonçalves, and Paulo Quaresma. 2019. Using Clus-
                                                                                      than one cluster/argument. The performance of the FCM depends
tering Techniques to Identify Arguments in Legal Documents. In Proceedings
                                                                                      heavily on the selection of features that are used. In the context
of Third Workshop on Automated Semantic Analysis of Information in Legal
Text (ASAIL 2019). Montreal, QC, Canada, 8 pages.                                     of our work we focused mainly on four kinds of features: N-gram,
                                                                                      word2vec, sentence closeness and ‘combined features’. Our aim is
1    INTRODUCTION                                                                     to identify the best performing set of features and techniques to
                                                                                      cluster components to form an argument.
Advances in communication technology, accessibility of media
                                                                                         After extracting the features associated with each text, the FCM
devices and mushrooming of social media has caused the number
                                                                                      is used to obtain a cluster membership value for every sentence. To
of individuals expressing opinions to grow exponentially. As a
                                                                                      determine the composition of each cluster, we developed a specific
result, a massive amount of electronic documents is generated daily,
                                                                                      algorithm: the “Distribution of Sentence to the Cluster Algorithm"
including news editorials, discussion forums and judicial decisions
                                                                                      (DSCA).
containing legal arguments. In turn, rapid development of current
                                                                                         To evaluate the performance of the system, a second algorithm,
research into argument mining is raising new challenges for natural
                                                                                      the “Appropriate Cluster Identification Algorithm" (ACIA) was also
language processing in various fields.
                                                                                      developed to map each cluster of the system’s output to the closest
In: Proceedings of the Third Workshop on Automated Semantic Analysis of Information   matching cluster in the gold-standard dataset.
in Legal Text (ASAIL 2019), June 21, 2019, Montreal, QC, Canada.
                                                                                         The rest of the paper is organized as follows: section 3 contains
© 2019 Copyright held by the owner/author(s). Copying permitted for private and
academic purposes.                                                                    a brief introduction to the datasets and to the measures used for
Published at http://ceur-ws.org                                                       evaluating the performance of the system. In Section 4 we describe
ASAIL 2019, June 21, 2019, Montreal, QC, Canada                                Prakash Poudyal, Teresa Gonçalves, and Paulo Quaresma


                                                 Figure 1: Overview of the Architecture


                                               Figure 2: Example from the ECHR corpus


the proposed architecture including a description of features, a       work we propose a clustering approach, which aims to overcome
discussion on determining the optimum number of clusters, and          this restriction and is based in the relatedness of sentences.
the newly developed DSCA and ACIA algorithms. Section 5 eval-              In another related work, Sobhani et al. [24] have applied ar-
uates the performance of all of our experiments. Lastly, Section 6     gument mining techniques to user comments aiming to perform
addresses conclusions and prospects for future work.                   stance classification and argument identification. Their work has
                                                                       quite different goals and they assume a predefined number of ar-
                                                                       guments, transforming the problem into a classification problem
                                                                       (tag sentences with the most adequate argument). They were able
2   STATE OF THE ART                                                   to obtain an f-measure of 0.49 for the argument tagging procedure.
In the argument mining field, there has been very limited research     Moreover, user comments are typically simple sentences and do
about using clustering techniques to identify and group argumen-       not have an inner argumentative structure.
tative sentences into arguments. One of the most related research          J. Savelka and K. Ashley [22] have proposed to use machine
work was done by Rachel Mochales-Palau and M. Francine Moens           learning techniques to predict the usefulness of sentences for the
[13, 14, 16]. They used the ECHR corpus, which was manually            interpretation of the meaning of statutory terms. They explored
annotated and revised, and they were able to obtain around 80%         the use of syntactical, semantic and structural features in the clas-
accuracy in the detection of argumentative sentences using a statis-   sification process and they were able to obtain an accuracy higher
tical classifier. Then, they propose to detect argument limits using   than 0.69.
context-free grammars (CFG) to take into account the structure of          Regarding argument relations, Stab and Gurevych [25] proposed
documents or to use semantic distance measures to cluster related      an annotation scheme to model arguments and their relations. Their
sentences. The CFG approach approach was applied to a limited sub-     approach was to identify the relation (i.e. ‘support’ or ‘attack’)
set of documents and obtained around 60% accuracy. However, they       between the components of arguments. Their technique indicates
did not present any result for a semantic based approach. From the     which premises belong to a claim and constitute the structure of
example in figure 2 it is clear that a CFG approach is not powerful    arguments.
enough to identify correctly the argument structure, as arguments
can be interleaved and may not have a sequential structure. In this
Using Clustering Techniques to Identify Arguments in Legal Documents                               ASAIL 2019, June 21, 2019, Montreal, QC, Canada


    Lawrence et al. [10] performed a manual analysis as well as an                4     SYSTEM ARCHITECTURE
automated analysis to detect the boundaries of an argument. To                    Our proposal is to cluster argumentative sentences and, thereby,
train and test for automatic analysis, the authors relied on help                 to identify legal arguments. As shown in figure 1 there are sev-
from experts to analyze the text manually. For the automatic analy-               eral phases: feature extraction; clustering algorithm; and argument
sis, they used two Naive Bayes classifiers; one to identify the first             building. In order to apply the fuzzy clustering algorithm we need
word of a proposition and the other to identify the last word. [8],               first to identify the optimum number of clusters in the respective
[21], and [17] continued this boundary approach, using the Condi-                 case-law file and, after running it, we need to convert the generate
tional Random Fields (CRF) algorithm to segment the argument’s                    soft clustering values to hard clustering.
components.
    Lippi and Torroni [11] have a survey paper about the use of
machine learning technologies for argument mining. The paper
                                                                                  4.1    Feature Extraction
analyses several approaches made by different authors regarding ar-               Typically features are values that represent a sentence and are suit-
gument boundary detection. This review article emphasizes that the                able for a machine learning algorithm to handle. It is essential to
boundary detection problems depend upon the adopted argument                      select the most appropriate and precise features to train a machine
models. However, and as already referred, the non-sequential struc-               learning algorithm so that the model can be successfully applied
ture of arguments in the ECHR corpus creates new and complex                      to new data. Therefore, good discriminant features are needed to
problems, which can not be handled by simple boundary detection                   correlate similarities between sentences and also address the se-
approaches.                                                                       quential nature of sentences (since the majority of the components
    Conrad [7] applied a k-means clustering algorithm over plain-                 of an argument are presented in order). To address this require-
tiff claims in ‘Premises Liability’, ‘Medical Malpractice’ and ‘Racial            ment, the following features were used: N-gram [5], word2vec, and
Discrimination’ suits. The authors applied their technique to dis-                sentence closeness (discussed below). Another feature set can also
tinguish more effective plaintiff claims from less effective ones                 be obtained by combining these three existing features into what
using an ‘award_quotient’ metric to segregate the claims. Besides                 we called “Combined Features". Each kind of features is discussed
award_quotient, the authors used features to help differentiate one               below.
cluster’s properties from another. The authors mention that they                      Word2vec: The word2vec approach was proposed by [12] and
also tried aggregative, partial and graphical features, but didn’t find           can be implemented in two different ways: as a ‘Continuous Bag of
anything that yielded a performance superior to k-means.                          Words’ (CBW) or as a ‘Skip gram’. With Skip-grams, context words
                                                                                  are predicted from selected words in the text, whereas with CBW,
                                                                                  a word vector is predicted from the context of adjacent words. A
3      CORPUS SELECTION AND EVALUATION
                                                                                  Wikipedia dump of 05-02-2016 was used as input to the word2vec
       PROCEDURES                                                                 implementation of Gensim [19], where 100 dimension vectors were
We selected case-law documents from the European Court of Hu-                     generated for each word. From the training set, each word of the
man Rights (ECHR)1 annotated by R. Mochales [14]. The corpus                      sentence is looked up and its corresponding vector found among
is composed of 20 Decision and 22 Judgment categories released                    the generated word vectors. Then, the average of all vectors of the
before 20 October 1999 by the European Commission on Human                        words presented in the sentence is taken and considered to be the
Rights. Both categories include similar information, however, the                 ‘sentence vector’.
‘Decision’ present the information briefly with an average word                       Sentence Closeness: Sentence closeness is the reciprocal of the
length of 3,500 words, whereas for ‘Judgment’ the average word                    inter sentence distance (i.e. the distance between sentences) counted
length is 10,000 words. We have 9257 sentences, out of which 7097                 in units of whole sentences. To capture the sequential nature of
(77%) of them are non-argumentative and 2160 (23%) argumentative                  sentences, distance is a useful feature that helps to determine which
sentences. Details about the ECHR corpus is available in [18].                    sentences belong to which argument. The highest scoring sentence
   Regarding evaluation, we used the standard Precision, Recall,                  is considered to be the origin sentence (with a score of 1) from which
and F-measure [2, 20] measures. Furthermore, we also used cluster                 all other distances are measured. With the exception of the origin
purity to evaluate the quality of the obtained clusters. We computed              sentence, ‘closeness’ scores should decrease monotonically as they
the cluster purity [23] by counting the number of correctly assigned              move away from the origin. Furthermore, as meaning and concepts
entities and dividing by the total number of N . Formally                         flow from one sentence to another, this implies that sentences
                                                                                  whose ‘closeness’ is high are good candidates for being clustered
                                        1 Õ                                       together i.e. they belong to the same argument. Equation 2 was
     ClusterPurity (φ, C) =                        max e=1..k |wd ∩ c e |   (1)
                                        N  d =1..k                                used to calculate the ‘closeness’ for each pair of sentences.

    where N is the summation of the total number of elements in
all clusters, φ = {w 1 , w 1 , · · · w k } is the set of clusters and c =                                                            1
                                                                                                 Closeness(s 1 , s 2 ) =                             (2)
{c 1 , c 1 , · · · c k } is the set of classes. We interpret wd as the set of                                              1 + |n(s 1 ) − n(s 2 )|
sentences in wd and c e as the set of sentences in c e in Equation 1.
                                                                                  where n is a function which calculates the number of sentences
                                                                                  from the beginning of the text until the sentence of its argument.
                                                                                     Combined Features: The previously presented features (N-
1 http://hudoc.echr.coe.int/sites/eng                                             gram+‘Sentence Closeness’+Word2vec) were combined into a new
ASAIL 2019, June 21, 2019, Montreal, QC, Canada                                      Prakash Poudyal, Teresa Gonçalves, and Paulo Quaresma


feature in an attempt to improve the performance of the clustering          methodology to predict the adequate number of clusters. Improve-
algorithms.                                                                 ments in the results are expected to be achieved in the future as
                                                                            more discriminant features are used.
4.2    Identification of the optimum number of
       clusters                                                             4.3     Clustering Algorithm
An argument cluster is a set of sentences which together comprise a         After extracting the features, we used a standard Fuzzy c-means
single, coherent legal argument. The process by which sentences are         (FCM) Clustering algorithm [3] to generate membership values
aggregated into arguments in this way is called clustering. To cluster      ranging from 0 to 1 for each cluster. The number of clusters was
sentences successfully into arguments, it is currently necessary to         defined based on the algorithm proposed and described in section
specify in advance how many clusters to expect within a corpus              4.2. We set the fuzziness value m ∈ {1.1, 1.3, 2.0}.
and until recently, there has been no well-established approach
to defining this. Techniques that claim to be able to define the            4.4     Distribution of Sentence to the Cluster
optimum number of clusters in the FCM have been proposed by                         Algorithm (DSCA)
[26] and Latent Dirichlet Allocation (LDA) [4].                             The Distribution of Sentence to the Cluster Algorithm (DSCA)
    Employing the Xie and Beni approach [26], we determined ex-             algorithm aims to transform the membership value generated by
perimentally that an FCM with a fuzziness value of m set to 1.3 in          FCM (between 0 and 1) into a set of clusters (soft to hard clustering
concert with the features Word2Vec, N-gram and Sentence Close-              problem). The FCM output represents a membership probability
ness, yielded the best results with our particular set of case-law files.   indicating how likely it is that the sentence belongs to a particular
[26] technique selects the best candidate for the number of clus-           cluster. DSCA is presented as Algorithm 1
ters after obtaining the minimum index value from that respective              Membership values are represented by a matrix where each row
cluster number.                                                             represents a sentence and each column is labelled with a cluster
    The Latent Dirichlet allocation (LDA) technique estimates the           number (Ci ) ranging from 1 to C. To assign a sentence to the re-
number of ‘topics’ existing within a text, which means estimat-             spective cluster, a threshold value t needs to be specified to help
ing the probabilities of groupings within the text. Inspired by the         define boundaries between the clusters. The cluster assignment
concept, it was decided to look for such groups within our own              is done only if the difference between the maximum membership
corpora, and to use the estimated number of topics as a proxy for
                                                                            value of the i th position is less than the threshold value for the
the number of clusters. We selected the ‘CaoJuan2009’ method as a
                                                                            cluster, otherwise the sentence is rejected. The algorithm ends after
metric which is the best LDA model based on density [15]. ‘Cao-
                                                                            conducting an iterative process through all positions in the matrix.
Juan2009’ was tested and agreed well with the number of topics
                                                                            The concept of threshold value is discussed by [1] as well as [9].
(equivalent to our ‘number of clusters’) predicted by the minimum
                                                                            The authors claim that the definition of the appropriate threshold
index value [6].
                                                                            value should be determined by experimentation. After applying
    Figure 3 illustrates the results for each experiment: the first is
                                                                            the DSCA algorithm, we were able to obtain a proposal for legal
the gold-standard, the second is using Xie and Beni’s proposal, and
                                                                            arguments: the identified clusters and their sentences.
the third is from LDA [6]. In the case of Xie and Beni, it can be
observed that case-law files 02, 31, 32, 39, 42 find the closest number
                                                                            5     RESULTS AND EVALUATION
of clusters to the gold-standard, whereas for other case-law files
the differences are greater.                                                In order to perform an evaluation of the performance of our system,
    In case of Cao et al.’s prediction: Case-law files 40 and 41 finds      we needed to find the best mapping between our system’s clusters
the correct data required for identification, whereas other case-law        and the existing gold-standard data clusters from the ECHR. There-
files present a slight difference, but not as big as that observed by       fore, we proposed and developed a new algorithm the “Appropriate
Xie and Beni. The exact accuracy score achieved was an identical            Cluster Identification Algorithm ”(ACIA) to solve this problem. This
8% for both LDA and Xie and Beni.                                           algorithm maps the argument predicted by our system to the closest
    We also used equation 3, to calculate the difference between the        matching argument in the gold-standard corpus. Here, we describe
number of clusters of the gold-standard (Cд ) and the ones predicted        the details of the algorithm.
by our system (Cs ). If they differ only by one unit then we consid-
ered the prediction is "almost correct"; otherwise it’s an incorrect        5.1     Appropriate Cluster Identification
prediction. The result of applying this filter shows that the accuracy              Algorithm (ACIA)
of the closeness scores increase in value to 58% for LDA and 42%            The ACIA algorithm aims to find the best mapping between the
for Xie and Beni, respectively.                                             system’s predicted clusters and the gold-standard dataset clusters.
                                                                            A formal description of the ACIA algorithm is presented in Appen-
                            |Cs − Cд | ≤ 1                           (3)    dix A but the general idea is the following:
where Cs is the cluster number given by the prediction system, and                • Select the best pair mapping between the clusters
Cд is the cluster number of the gold-standard.                                    • Remove these nodes from the set of clusters
   From the analysis, it’s possible to conclude that LDA achieves                 • Iterate until there is no available pair of clusters
a greater accuracy than Xie and Beni and is much closer to the                    • The final mapping is composed by the set of the selected
gold-standard. As a consequence, in our experiments we used this                    pair mappings
Using Clustering Techniques to Identify Arguments in Legal Documents                     ASAIL 2019, June 21, 2019, Montreal, QC, Canada


        Figure 3: Argument numbers of gold-standard vs. System Prediction (proposed by Xie and Beni and Cao et al.)


 Algorithm 1: Distribution of Sentence to the Cluster Algo-                 In figure 4 we can see the relevance of performing an optimized
 rithm (DSCA)                                                            mapping between the system’s predicted clusters and the gold-
                                                                         data arguments. We can observe that the value of ‘After ACIA’
  1. Denote the matrix of the sentences x cluster by (ai j ) ∈ [0, 1],
                                                                         (square symbol) is higher (above 0.3 for all files), whereas in the
   i = 1, 2, 3 · · · S and j = 1, 2, 3 · · · C such that i stands for
                                                                         case of a sequential mapping between the two clusters ‘Before
   sentence and j stands for cluster.
                                                                         ACIA’ (diamond symbol), the maximum value never exceeds is 0.3.
  2. Pre-selected threshold (t) is defined
  3. for each i do do                                                    5.2    Performance Measurement
      imax = max(ai j ) ∀i
      for each j do do                                                   The experiment was conducted with the features mentioned in
          if (imax − ai j ) < t then                                     section 4.1 with fuzziness parameters m ∈ {1.1, 1.3, 2.0} and thresh-
                select sentence i for cluster j                          old value t ∈ {0.0001, 0.00001, 0.000001} used for the conversion
                else                                                     from a soft to a hard clustering. For the reason of space, we present
                      reject i;                                          the results for the features and parameters that score the highest
                end                                                      f 1 value in most of the case-law files. Table 1 presents the per-
          end                                                            formance result (precision, recall and f 1 , cluster purity) showing
      end                                                                the N-gram, Sentence closeness, Word2vec and ‘Combine features’
  end                                                                    using a threshold value t = 0.00001 and FCM fuzziness (m) = 1.3.
                                                                         Along with this, we include the number of sentences of each case-
                                                                         law file. The highest f 1 value of each case-law file obtained from
                                                                         each feature is highlighted in bold and underlined. Case-law files
After identifying the best mapping, the f 1 measure is calculated for    03, 13, 16, 31, 32 and 42 obtained the highest value using Word2vec
each cluster and the overall average f-measure value is obtained.        features. Case-law file 02 scored the highest f 1 value with N-gram,
                                                                         and case-law files 30, 35 and 41, the highest f 1 with the combined
                                                                         features. Likewise, case-law file 40 scored the highest f 1 value with
                                                                         the Sentence Closeness feature. From this analysis, we can conclude
                                                                         that Word2vec seems to be the best overall approach.
                                                                             In comparison to Word2vec, N-gram did not perform as well.
                                                                         The main reason for this effect is that N-gram uses a bag of words
                                                                         approach which is not effective in finding similarities between
                                                                         sentences, and the results show that the performance of N-gram
                                                                         depends upon the number of sentences; if the number of the sen-
                                                                         tences in the case-law file is high, then N-gram performance is
                                                                         poor.
                                                                             Sentence Closeness is another important feature that helps to
                                                                         understand the sequential context of the sentence. The sentence
                                                                         following an argumentative sentence often has a huge impact on
                                                                         the argument, as the meaning/context of a sentence usually flows
     Figure 4: F 1 score before and after applying ACIA                  sequentially. The results in this table show that the performance of
ASAIL 2019, June 21, 2019, Montreal, QC, Canada                                               Prakash Poudyal, Teresa Gonçalves, and Paulo Quaresma


   Case    #S                N-gram                        Sentence Closeness                      Word2vec                       Combined Feature

                  Pre     Rec      f1     Purity    Pre      Rec      f1      Purity    Pre      Rec      f1     Purity   Pre       Rec     f1       Purity
    02     15    0.698   0.485   0.573    0.625    0.342     0.221   0.268      0.412   0.656    0.367   0.470   0.563    0.625    0.450   0.523     0.600
    03     15    0.619   0.429   0.506    0.563    0.405     0.333   0.366      0.412   0.714    0.429   0.536   0.600    0.524    0.381   0.441     0.533
    13     20    0.508   0.628   0.561    0.500    0.413     0.344   0.375      0.400   0.602    0.581   0.591   0.550    0.342    0.344   0.343     0.400
    16     33    0.125   1.000   0.222    0.125    0.437     0.481   0.458      0.424   0.449    0.449   0.449   0.424    0.125    1.000   0.222     0.125
    30     25    0.265   1.000   0.419    0.263    0.252     0.275   0.263      0.320   0.351    0.363   0.357   0.360    0.272    1.000   0.428     0.275
    31     15    0.317   0.714   0.439    0.313    0.524     0.571   0.547      0.533   0.595    0.524   0.557   0.533    0.429    0.500   0.462     0.400
    32     17    0.335   0.785   0.470    0.326    0.481     0.393   0.433      0.474   0.648    0.485   0.555   0.529    0.500    0.396   0.442     0.474
    35     13    0.429   0.414   0.421    0.467    0.619     0.414   0.496      0.571   0.667    0.414   0.511   0.615    0.845    0.636   0.726     0.769
    39     17    0.352   0.588   0.440    0.346    0.400     0.431   0.415    0.421     0.362    0.525   0.429   0.368    0.310    0.613   0.412     0.250
    40     14    0.400   0.370   0.384    0.467    0.587     0.530   0.557      0.533   0.519    0.520   0.520   0.533    0.400    0.420   0.410     0.467
    41     12    0.517   0.563   0.539    0.500    0.625     0.625   0.625    0.583     0.438    0.438   0.438   0.417    0.683    0.625   0.653     0.583
    42     18    0.464   0.440   0.452    0.389    0.433     0.414   0.424      0.389   0.643    0.486   0.553   0.500    0.431    0.598   0.501     0.414

            Table 1: Precision, Recall and f 1 , Cluster Purity value according to Case-law and the number of sentences


Sentence Closeness is satisfactory, but still lacking in comparison                values. It is apparent from the data in Table 1 that f 1 and cluster
to Word2vec. The Combined feature also has an impact, as it is a                   purity are well correlated.
combination of Word2vec, N-gram and Sentence Closeness. The                           Overall, the results obtained – average accuracy of 0.59, macro
Combined feature offered the highest f 1 value for those case-law                  f-measure of 0.497 and cluster purity of 0.499 – from the proposed
files for which Word2vec did not offer significant results, with the               framework are quite promising, even if they cannot be easily com-
exception of case-law files 02, 39 and 40. Overall, 66% of the case-               pared with other researchers’ results. The most related work is the
law files obtained the highest f 1 using Word2vec and Combined                     one by Mochales and Moens [13, 16]; they obtained a 60% accuracy
feature.                                                                           result in the argumentation structure detection task. It is important
    Furthermore, in the case of N-gram and the Combined feature,                   to refer they did not present the precision and recall measures and
we found recall is further elevated by up to 1, but precision is very              that they tried to handle a much more simple problem, because
low for case-law files that have a large number of sentences. This                 they assumed sequential argumentative structures.
is because the n-gram feature is inappropriate for such case-law                      Sobhani et al. [24] obtained a very similar f-measure value (0.49),
files. If the feature is not sufficiently discriminating enough to                 but also with a much less complex task: classification of sentences
distinguish among sentence categories, applying the FCM provides                   from a predefined argument list.
equal membership probability values (or very close to equal ) for                     Goudas [8] obtained an accuracy of 42%, while segmenting the
every category, essentially providing no useful information. As a                  argumentative sentences using Conditional Random Fields (CRF).
result, during the process of forming hard clusters, such sentences                Lawrence [10] precision and recall for identifying argument struc-
get equally distributed over all clusters.                                         ture using automatically segmented propositions was 33.3% and
    Table 1 also presents the cluster purity value of each feature                 50.0%, respectively.
obtained for each case-law file. Word2vec was found to play the                       Stab and Gurevych [25] also encountered problems dealing with
leading role in case-law files 03, 13, 16, 30, 31, 32, 40, and 42. Sentence        ‘support’ and ‘attack’ relations. The main reason for this was that
Closeness scored highest in four case-law files: 16, 31, 39 and 41.                their approach was unable to identify the correct target of a relation,
However, case-law 16 and 31 tied with Word2vec. Overall the purity                 especially in a paragraph with multiple claims or reasoning chains.
values are satisfactory, except in case-law file 16 and 33 with the
Combined feature and N-gram. Case-law 16 which had 33 sentences                     6   CONCLUSION AND FUTURE WORK
had the lowest value (0.125) from Combined and N-gram features.                    We proposed a new clustering technique for grouping argumen-
Similarly, case-law 30, which has 25 sentences, obtained 0.275. On                 tative sentences in legal documents. We also proposed and imple-
the other hand, case-law 35, which had 13 sentences, scored 0.726                  mented an evaluation procedure for the proposed system and an
(the highest value) using the Combined feature. From this analysis,                approach to identify the total number of arguments in a case-law
we can conclude that having a greater number of sentences also                     document. Overall, the results that we achieved are satisfactory
affects the clustering quality negatively and that Word2vec is the                 and quite promising. The macro f 1 and average cluster purity score
dominant feature for obtaining acceptable f 1 and cluster purity                   for system prediction using Word2vec feature in case-law files that
                                                                                   have 4 to 8 arguments is 0.497 and 0.499 respectively.
Using Clustering Techniques to Identify Arguments in Legal Documents                                             ASAIL 2019, June 21, 2019, Montreal, QC, Canada


   For future work, we intend to add and evaluate more features,                            [19] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling
such as ‘semantic similarity’ ones, aiming to improve these results.                             with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges
                                                                                                 for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/
Moreover, as an extension of this work we are working on: a)                                     884893/en.
the identification, in each cluster/argument, of sentences acting                           [20] Mendes M.E.S. Rodrigues and L Sacks. 2004. A scalable hierarchical fuzzy clus-
                                                                                                 tering algorithm for text mining. In Proceedings of the 5th international conference
either as a premises or conclusions; b) the creation of a graph                                  on recent advances in soft computing. 269–274.
representation of the argument structure of each document (attack,                          [21] Christos Sardianos, Ioannis Manousos Katakis, Georgios Petasis, and Vangelis
support, and rebuttal arguments).                                                                Karkaletsis. 2015. Argument Extraction from News.. In ArgMining@ HLT-NAACL.
                                                                                                 56–66.
                                                                                            [22] Jaromir Savelka and Kevin D Ashley. 2016. Extracting case law sentences for
ACKNOWLEDGEMENT                                                                                  argumentation about the meaning of statutory terms. In Proceedings of the Third
                                                                                                 Workshop on Argument Mining (ArgMining2016). 50–59.
The authors would like to express deep gratitude to EMMA-WEST                               [23] Hinrich Schütze, Christopher D Manning, and Prabhakar Raghavan. 2008. Intro-
Project in the framework of the EU Erasmus Mundus Action 2                                       duction to information retrieval. Vol. 39. Cambridge University Press.
and Agatha Project SI IDT number 18022 (Intelligent analysis sys-                           [24] Parinaz Sobhani, Diana Inkpen, and Stan Matwin. 2015. From argumentation min-
                                                                                                 ing to stance classification. In Proceedings of the 2nd Workshop on Argumentation
tem of open of sources information for surveillance/crime control),                              Mining. 67–77.
ALENTEJO 2020 for their invaluable support. Further, the authors                            [25] Christian Stab and Iryna Gurevych. 2014. Annotating Argument Components
                                                                                                 and Relations in Persuasive Essays. In Proceedings of the 25th International Con-
would also like to extend sincere thanks to the reviewers for their                              ference on Computational Linguistics (COLING 2014). Dublin City University and
constructive comments and suggestions.                                                           Association for Computational Linguistics, 1501–1510.
                                                                                            [26] Xuanli Lisa Xie and Gerardo Beni. 1991. A Validity Measure for Fuzzy Clustering.
                                                                                                 IEEE Trans. Pattern Anal. Mach. Intell. 13, 8 (Aug. 1991), 841–847. https://doi.org/
REFERENCES                                                                                       10.1109/34.85677
 [1] Moh’d Belal Al-Zoubi, Amjad Hudaib, and Bashar Al-Shboul. 2007. A fast fuzzy
     clustering algorithm. In Proceedings of the 6th WSEAS Int. Conf. on Artificial
     Intelligence, Knowledge Engineering and Data Bases, Vol. 3. 28–32.                     A APPENDIX
 [2] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. 1999. Modern information retrieval.
     Vol. 463. ACM press New York.                                                          A.1 Appropriate Cluster Identification
 [3] James C Bezdek, Robert Ehrlich, and William Full. 1984. FCM: The fuzzy c-means
     clustering algorithm. Computers & Geosciences 10, 2-3 (1984), 191–203.                     Algorithm (ACIA)
 [4] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet             Let A be the system’s cluster set and the B the gold standard cluster
     allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.
 [5] Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and           set, respectively, having a cardinality of n: A = {a 1 , · · · , an } and
     Jenifer C Lai. 1992. Class-based n-gram models of natural language. Computa-           B = {b1 , · · · , bn }. We define the matrix F = { fi j } where fi j = ai b j
     tional linguistics 18, 4 (1992), 467–479.
 [6] Juan Cao, Tian Xia, Jintao Li, Yongdong Zhang, and Sheng Tang. 2009. A density-
                                                                                            with ai ∈ A and b j ∈ B. Here, F = { fi j } is the f-measure value
     based method for adaptive LDA model selection. Neurocomputing 72, 7-9 (2009),          calculated by taking cluster i from A and cluster j from B.
     1775–1781.                                                                             We denote by (F )i j the matrix formed from the matrix F by remov-
 [7] Jack G. Conrad and Khalid Al-Kofahi. 2017. Scenario Analytics: Analyzing Jury
     Verdicts to Evaluate Legal Case Outcomes. In Proceedings of the 16th Edition of        ing the j th column and i th row
     the International Conference on Articial Intelligence and Law (ICAIL ’17). ACM,
     New York, NY, USA, 29–37. https://doi.org/10.1145/3086512.3086516
 [8] Theodosis Goudas, Christos Louizos, Georgios Petasis, and Vangelis Karkaletsis.
                                                                                            State 1 : Initialization
     2014. Argument extraction from news, blogs, and social media.. In Hellenic
                                                                                                                               F o = (fi j )n×n
     Conference on Artificial Intelligence. Springer, 287–299.
 [9] A. K. Jain, M. N. Murty, and P. J. Flynn. 1999. Data clustering: a review. ACM
     computing surveys (CSUR) 31, 3 (1999), 264–323.
[10] John Lawrence, Chris Reed, Colin Allen, Simon McAlister, and Andrew Raven-
     scroft. 2014. Mining Arguments From 19th Century Philosophical Texts Using                                               Ro (−1, −1) = ∅
     Topic Based Modelling. In Proceedings of the First Workshop on Argumentation
     Mining. Association for Computational Linguistics, Baltimore, Maryland, 79–87.         i.e. Nodes are connected with the cost value C=0 to form a tree struc-
     https://doi.org/10.3115/v1/W14-2111
[11] Marco Lippi and Paolo Torroni. 2016. Argumentation mining: State of the art and
                                                                                            ture.
     emerging trends. ACM Transactions on Internet Technology (TOIT) 16, 2 (2016),
     10.                                                                                    State 2 : From k = 0 to n, iterate. At each k step, we have F (k) (i, j)
[12] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean. 2013.
     Distributed representations of words and phrases and their Compositionality. In        and R (k ) (i, j)
     Advances in neural information processing systems. 3111–3119.
[13] Raquel Mochales and Marie-Francine Moens. 2011. Argumentation mining.
     Artificial Intelligence and Law 19, 1 (2011), 1–22.
                                                                                               Find all maximum
                                                                                                         n          elements of F (k ) (i, j)                o
                                                                                                                  (k )
[14] Raquel Mochales-Palau and Marie-Francine Moens. 2007. Study on sentence                   Let Mk = (i, j)| fi j is themaximum element o f F (k) (i, j)
     relations in the automatic detection of argumentation in legal cases. FRONTIERS
     IN ARTIFICIAL INTELLIGENCE AND APPLICATIONS 165 (2007), 89.                               i.e. Maximum f-measure value is selected and place in tree structure;
[15] Nidhi. [n. d.]. Number of Topics for LDA on poems from Elliston Poetry Archive.
     Available at http://www.rpubs.com/MNidhi/NumberoftopicsLDA (2017-03-31).
[16] Raquel Mochales Palau and Marie-Francine Moens. 2009. Argumentation Mining:
                                                                                            State 3: For each element (i, j) ∈ Mk , update route
     The Detection, Classification and Structure of Arguments in Text. In Proceedings
     of the 12th International Conference on Artificial Intelligence and Law (ICAIL ’09).
     ACM, New York, NY, USA, 98–107. https://doi.org/10.1145/1568234.1568246
                                                                                                                   R (k +1) (i, j) = R (k ) (i, j) ∪ {(i, j)}
[17] Joonsuk Park, Arzoo Katiyar, and Bishan Yang. 2015. Conditional random fields
     for identifying appropriate types of support for propositions in online user
                                                                                               and matrix
     comments. In Proceedings of the 2nd Workshop on Argumentation Mining. 39–44.
                                                                                                                         F (k +1) (i, j) = F (k )
                                                                                                                                                 
[18] Prakash Poudyal, Teresa. Goncalves, and Paulo. Quaresma. 2016. Experiments
     on identification of argumentative sentences. In 10th International Conference on                                                                ij
     Software, Knowledge, Information Management Applications (SKIMA). 398–403.
     https://doi.org/10.1109/SKIMA.2016.7916254                                                Do it for all elements (i, j) of Mk
ASAIL 2019, June 21, 2019, Montreal, QC, Canada                         Prakash Poudyal, Teresa Gonçalves, and Paulo Quaresma


   Stop when k = n.

i.e. Procedure repeat again for other remaining values;

State 4 : {For each route, calculate total cost}
                                       Õ
                    TC R (k ) (i, j) =         fi j
                                         (i, j)∈R (k ) (i, j)

i.e. The total cost of each route is calculated.

State 5: Select one of the maximum values
                                  TC Ro (i, j) ,
and its route

                   Ro (i, j) = {(i 1 , j 1 ), · · · , (i n , jn )}

i.e. The route with the maximum scores is selected.

   After identifying the appropriate cluster (argument) with respect
to the gold-standard; an f-measure is calculated between the i th
cluster of the system as recommended by the ACIA and the j th
cluster of the gold-standard. After that, the average f-measure value
is calculated.