=Paper= {{Paper |id=Vol-2385/paper2 |storemode=property |title=Using Clustering Techniques to Identify Arguments in Legal Documents |pdfUrl=https://ceur-ws.org/Vol-2385/paper2.pdf |volume=Vol-2385 |authors=Prakash Poudyal,Teresa Gonçalves,Paulo Quaresma |dblpUrl=https://dblp.org/rec/conf/icail/PoudyalGQ19 }} ==Using Clustering Techniques to Identify Arguments in Legal Documents== https://ceur-ws.org/Vol-2385/paper2.pdf

Using Clustering Techniques to Identify Arguments in Legal
Documents
Prakash Poudyal Teresa Gonçalves Paulo Quaresma
Department of Computer Science & Department of Informatics Department of Informatics
Engineering University of Evora University of Evora
Kathmandu University Evora, Portugal Evora, Portugal
Dhulikhel, Nepal tcg@uevora.pt pq@uevora.pt
prakash@ku.edu.np

ABSTRACT In general to automatically identify a legal argument within
A proposal to automatically identify arguments in legal documents an unstructured text, three stages or modules are used in current
is presented. In this approach, cluster algorithms are applied to practice. The first stage is to identify the argumentative and non-
argumentative sentences in order to identify arguments. One po- argumentative sentences, the second stage is to identify the bound-
tential problem with this process is that an argumentative sentence aries of arguments and the third stage is to distinguish the argu-
belonging to one specific argument can also simultaneously be part ment’s components (premise and conclusion).
of another, distinct argument. To address this issue, a Fuzzy c-means To date, second stage processing has been performed by identify-
(FCM) clustering algorithm was used and the proposed approach ing the boundaries of arguments, an extensively explored method
was evaluated with a set of case-law decisions from the European in the AI & Law literature. In this paper, we propose a clustering
Court of Human Rights (ECHR). An extensive evaluation of the technique that groups argumentative sentences into a cluster of
most relevant and discriminant features to this task was performed potential arguments with associated probabilities. An overview of
and the obtained results are presented. our approach to the task is shown in Figure 1. The task is complex,
In the context of this work two additional algorithms were de- since components of one argument (premise or conclusion) can
veloped: 1) the “Distribution of Sentence to the Cluster Algorithm" also be involved in other arguments. In the example shown in
(DSCA) was developed to transfer fuzzy membership values (be- figure 1 there are 2 distinct arguments (A and B). In this example
tween 0 and 1) generated by the FCM to a set of clusters; 2) the sentence 2 belongs to argument A and also to Argument B. It is
“Appropriate Cluster Identification Algorithm" (ACIA) to evaluate important to point out that, for instance, in the European Court
the proposed clusters against the gold-standard clusters defined by of Human Rights corpus (ECHR) situations similar with this one
human experts. (and even more complex) appear. Figure 2 shows a real example,
The overall results are quite promising and may be the basis for where one sentence (marked in yellow) belongs to three arguments
further research work and extensions. (7, 8, and 9) and is followed by another sentence, which belongs
to a different argument (6), which is followed by another sentence
KEYWORDS belonging to arguments (7 and 8). To cluster such sentences, we
propose to use a Fuzzy c-means (FCM) clustering algorithm [3]
Machine Learning, Fuzzy clustering algorithm, argument mining,
that provides a membership value ranging from 0 to 1 for each
Legal documents, Natural Language Processing.
cluster of the sentence. These membership values are key assets of
ACM Reference Format: the FCM, since they allow us to associate each sentence with more
Prakash Poudyal, Teresa Gonçalves, and Paulo Quaresma. 2019. Using Clus-
than one cluster/argument. The performance of the FCM depends
tering Techniques to Identify Arguments in Legal Documents. In Proceedings
heavily on the selection of features that are used. In the context
of Third Workshop on Automated Semantic Analysis of Information in Legal
Text (ASAIL 2019). Montreal, QC, Canada, 8 pages. of our work we focused mainly on four kinds of features: N-gram,
word2vec, sentence closeness and ‘combined features’. Our aim is
1 INTRODUCTION to identify the best performing set of features and techniques to
cluster components to form an argument.
Advances in communication technology, accessibility of media
After extracting the features associated with each text, the FCM
devices and mushrooming of social media has caused the number
is used to obtain a cluster membership value for every sentence. To
of individuals expressing opinions to grow exponentially. As a
determine the composition of each cluster, we developed a specific
result, a massive amount of electronic documents is generated daily,
algorithm: the “Distribution of Sentence to the Cluster Algorithm"
including news editorials, discussion forums and judicial decisions
(DSCA).
containing legal arguments. In turn, rapid development of current
To evaluate the performance of the system, a second algorithm,
research into argument mining is raising new challenges for natural
the “Appropriate Cluster Identification Algorithm" (ACIA) was also
language processing in various fields.
developed to map each cluster of the system’s output to the closest
In: Proceedings of the Third Workshop on Automated Semantic Analysis of Information matching cluster in the gold-standard dataset.
in Legal Text (ASAIL 2019), June 21, 2019, Montreal, QC, Canada.
The rest of the paper is organized as follows: section 3 contains
© 2019 Copyright held by the owner/author(s). Copying permitted for private and
academic purposes. a brief introduction to the datasets and to the measures used for
Published at http://ceur-ws.org evaluating the performance of the system. In Section 4 we describe
ASAIL 2019, June 21, 2019, Montreal, QC, Canada Prakash Poudyal, Teresa Gonçalves, and Paulo Quaresma

Figure 1: Overview of the Architecture

Figure 2: Example from the ECHR corpus

the proposed architecture including a description of features, a work we propose a clustering approach, which aims to overcome
discussion on determining the optimum number of clusters, and this restriction and is based in the relatedness of sentences.
the newly developed DSCA and ACIA algorithms. Section 5 eval- In another related work, Sobhani et al. [24] have applied ar-
uates the performance of all of our experiments. Lastly, Section 6 gument mining techniques to user comments aiming to perform
addresses conclusions and prospects for future work. stance classification and argument identification. Their work has
quite different goals and they assume a predefined number of ar-
guments, transforming the problem into a classification problem
(tag sentences with the most adequate argument). They were able
2 STATE OF THE ART to obtain an f-measure of 0.49 for the argument tagging procedure.
In the argument mining field, there has been very limited research Moreover, user comments are typically simple sentences and do
about using clustering techniques to identify and group argumen- not have an inner argumentative structure.
tative sentences into arguments. One of the most related research J. Savelka and K. Ashley [22] have proposed to use machine
work was done by Rachel Mochales-Palau and M. Francine Moens learning techniques to predict the usefulness of sentences for the
[13, 14, 16]. They used the ECHR corpus, which was manually interpretation of the meaning of statutory terms. They explored
annotated and revised, and they were able to obtain around 80% the use of syntactical, semantic and structural features in the clas-
accuracy in the detection of argumentative sentences using a statis- sification process and they were able to obtain an accuracy higher
tical classifier. Then, they propose to detect argument limits using than 0.69.
context-free grammars (CFG) to take into account the structure of Regarding argument relations, Stab and Gurevych [25] proposed
documents or to use semantic distance measures to cluster related an annotation scheme to model arguments and their relations. Their
sentences. The CFG approach approach was applied to a limited sub- approach was to identify the relation (i.e. ‘support’ or ‘attack’)
set of documents and obtained around 60% accuracy. However, they between the components of arguments. Their technique indicates
did not present any result for a semantic based approach. From the which premises belong to a claim and constitute the structure of
example in figure 2 it is clear that a CFG approach is not powerful arguments.
enough to identify correctly the argument structure, as arguments
can be interleaved and may not have a sequential structure. In this
Using Clustering Techniques to Identify Arguments in Legal Documents ASAIL 2019, June 21, 2019, Montreal, QC, Canada

Lawrence et al. [10] performed a manual analysis as well as an 4 SYSTEM ARCHITECTURE
automated analysis to detect the boundaries of an argument. To Our proposal is to cluster argumentative sentences and, thereby,
train and test for automatic analysis, the authors relied on help to identify legal arguments. As shown in figure 1 there are sev-
from experts to analyze the text manually. For the automatic analy- eral phases: feature extraction; clustering algorithm; and argument
sis, they used two Naive Bayes classifiers; one to identify the first building. In order to apply the fuzzy clustering algorithm we need
word of a proposition and the other to identify the last word. [8], first to identify the optimum number of clusters in the respective
[21], and [17] continued this boundary approach, using the Condi- case-law file and, after running it, we need to convert the generate
tional Random Fields (CRF) algorithm to segment the argument’s soft clustering values to hard clustering.
components.
Lippi and Torroni [11] have a survey paper about the use of
machine learning technologies for argument mining. The paper
4.1 Feature Extraction
analyses several approaches made by different authors regarding ar- Typically features are values that represent a sentence and are suit-
gument boundary detection. This review article emphasizes that the able for a machine learning algorithm to handle. It is essential to
boundary detection problems depend upon the adopted argument select the most appropriate and precise features to train a machine
models. However, and as already referred, the non-sequential struc- learning algorithm so that the model can be successfully applied
ture of arguments in the ECHR corpus creates new and complex to new data. Therefore, good discriminant features are needed to
problems, which can not be handled by simple boundary detection correlate similarities between sentences and also address the se-
approaches. quential nature of sentences (since the majority of the components
Conrad [7] applied a k-means clustering algorithm over plain- of an argument are presented in order). To address this require-
tiff claims in ‘Premises Liability’, ‘Medical Malpractice’ and ‘Racial ment, the following features were used: N-gram [5], word2vec, and
Discrimination’ suits. The authors applied their technique to dis- sentence closeness (discussed below). Another feature set can also
tinguish more effective plaintiff claims from less effective ones be obtained by combining these three existing features into what
using an ‘award_quotient’ metric to segregate the claims. Besides we called “Combined Features". Each kind of features is discussed
award_quotient, the authors used features to help differentiate one below.
cluster’s properties from another. The authors mention that they Word2vec: The word2vec approach was proposed by [12] and
also tried aggregative, partial and graphical features, but didn’t find can be implemented in two different ways: as a ‘Continuous Bag of
anything that yielded a performance superior to k-means. Words’ (CBW) or as a ‘Skip gram’. With Skip-grams, context words
are predicted from selected words in the text, whereas with CBW,
a word vector is predicted from the context of adjacent words. A
3 CORPUS SELECTION AND EVALUATION
Wikipedia dump of 05-02-2016 was used as input to the word2vec
PROCEDURES implementation of Gensim [19], where 100 dimension vectors were
We selected case-law documents from the European Court of Hu- generated for each word. From the training set, each word of the
man Rights (ECHR)1 annotated by R. Mochales [14]. The corpus sentence is looked up and its corresponding vector found among
is composed of 20 Decision and 22 Judgment categories released the generated word vectors. Then, the average of all vectors of the
before 20 October 1999 by the European Commission on Human words presented in the sentence is taken and considered to be the
Rights. Both categories include similar information, however, the ‘sentence vector’.
‘Decision’ present the information briefly with an average word Sentence Closeness: Sentence closeness is the reciprocal of the
length of 3,500 words, whereas for ‘Judgment’ the average word inter sentence distance (i.e. the distance between sentences) counted
length is 10,000 words. We have 9257 sentences, out of which 7097 in units of whole sentences. To capture the sequential nature of
(77%) of them are non-argumentative and 2160 (23%) argumentative sentences, distance is a useful feature that helps to determine which
sentences. Details about the ECHR corpus is available in [18]. sentences belong to which argument. The highest scoring sentence
Regarding evaluation, we used the standard Precision, Recall, is considered to be the origin sentence (with a score of 1) from which
and F-measure [2, 20] measures. Furthermore, we also used cluster all other distances are measured. With the exception of the origin
purity to evaluate the quality of the obtained clusters. We computed sentence, ‘closeness’ scores should decrease monotonically as they
the cluster purity [23] by counting the number of correctly assigned move away from the origin. Furthermore, as meaning and concepts
entities and dividing by the total number of N . Formally flow from one sentence to another, this implies that sentences
whose ‘closeness’ is high are good candidates for being clustered
1 Õ together i.e. they belong to the same argument. Equation 2 was
ClusterPurity (φ, C) = max e=1..k |wd ∩ c e | (1)
N d =1..k used to calculate the ‘closeness’ for each pair of sentences.

where N is the summation of the total number of elements in
all clusters, φ = {w 1 , w 1 , · · · w k } is the set of clusters and c = 1
Closeness(s 1 , s 2 ) = (2)
{c 1 , c 1 , · · · c k } is the set of classes. We interpret wd as the set of 1 + |n(s 1 ) − n(s 2 )|
sentences in wd and c e as the set of sentences in c e in Equation 1.
where n is a function which calculates the number of sentences
from the beginning of the text until the sentence of its argument.
Combined Features: The previously presented features (N-
1 http://hudoc.echr.coe.int/sites/eng gram+‘Sentence Closeness’+Word2vec) were combined into a new
ASAIL 2019, June 21, 2019, Montreal, QC, Canada Prakash Poudyal, Teresa Gonçalves, and Paulo Quaresma

feature in an attempt to improve the performance of the clustering methodology to predict the adequate number of clusters. Improve-
algorithms. ments in the results are expected to be achieved in the future as
more discriminant features are used.
4.2 Identification of the optimum number of
clusters 4.3 Clustering Algorithm
An argument cluster is a set of sentences which together comprise a After extracting the features, we used a standard Fuzzy c-means
single, coherent legal argument. The process by which sentences are (FCM) Clustering algorithm [3] to generate membership values
aggregated into arguments in this way is called clustering. To cluster ranging from 0 to 1 for each cluster. The number of clusters was
sentences successfully into arguments, it is currently necessary to defined based on the algorithm proposed and described in section
specify in advance how many clusters to expect within a corpus 4.2. We set the fuzziness value m ∈ {1.1, 1.3, 2.0}.
and until recently, there has been no well-established approach
to defining this. Techniques that claim to be able to define the 4.4 Distribution of Sentence to the Cluster
optimum number of clusters in the FCM have been proposed by Algorithm (DSCA)
[26] and Latent Dirichlet Allocation (LDA) [4]. The Distribution of Sentence to the Cluster Algorithm (DSCA)
Employing the Xie and Beni approach [26], we determined ex- algorithm aims to transform the membership value generated by
perimentally that an FCM with a fuzziness value of m set to 1.3 in FCM (between 0 and 1) into a set of clusters (soft to hard clustering
concert with the features Word2Vec, N-gram and Sentence Close- problem). The FCM output represents a membership probability
ness, yielded the best results with our particular set of case-law files. indicating how likely it is that the sentence belongs to a particular
[26] technique selects the best candidate for the number of clus- cluster. DSCA is presented as Algorithm 1
ters after obtaining the minimum index value from that respective Membership values are represented by a matrix where each row
cluster number. represents a sentence and each column is labelled with a cluster
The Latent Dirichlet allocation (LDA) technique estimates the number (Ci ) ranging from 1 to C. To assign a sentence to the re-
number of ‘topics’ existing within a text, which means estimat- spective cluster, a threshold value t needs to be specified to help
ing the probabilities of groupings within the text. Inspired by the define boundaries between the clusters. The cluster assignment
concept, it was decided to look for such groups within our own is done only if the difference between the maximum membership
corpora, and to use the estimated number of topics as a proxy for
value of the i th position is less than the threshold value for the
the number of clusters. We selected the ‘CaoJuan2009’ method as a
cluster, otherwise the sentence is rejected. The algorithm ends after
metric which is the best LDA model based on density [15]. ‘Cao-
conducting an iterative process through all positions in the matrix.
Juan2009’ was tested and agreed well with the number of topics
The concept of threshold value is discussed by [1] as well as [9].
(equivalent to our ‘number of clusters’) predicted by the minimum
The authors claim that the definition of the appropriate threshold
index value [6].
value should be determined by experimentation. After applying
Figure 3 illustrates the results for each experiment: the first is
the DSCA algorithm, we were able to obtain a proposal for legal
the gold-standard, the second is using Xie and Beni’s proposal, and
arguments: the identified clusters and their sentences.
the third is from LDA [6]. In the case of Xie and Beni, it can be
observed that case-law files 02, 31, 32, 39, 42 find the closest number
5 RESULTS AND EVALUATION
of clusters to the gold-standard, whereas for other case-law files
the differences are greater. In order to perform an evaluation of the performance of our system,
In case of Cao et al.’s prediction: Case-law files 40 and 41 finds we needed to find the best mapping between our system’s clusters
the correct data required for identification, whereas other case-law and the existing gold-standard data clusters from the ECHR. There-
files present a slight difference, but not as big as that observed by fore, we proposed and developed a new algorithm the “Appropriate
Xie and Beni. The exact accuracy score achieved was an identical Cluster Identification Algorithm ”(ACIA) to solve this problem. This
8% for both LDA and Xie and Beni. algorithm maps the argument predicted by our system to the closest
We also used equation 3, to calculate the difference between the matching argument in the gold-standard corpus. Here, we describe
number of clusters of the gold-standard (Cд ) and the ones predicted the details of the algorithm.
by our system (Cs ). If they differ only by one unit then we consid-
ered the prediction is "almost correct"; otherwise it’s an incorrect 5.1 Appropriate Cluster Identification
prediction. The result of applying this filter shows that the accuracy Algorithm (ACIA)
of the closeness scores increase in value to 58% for LDA and 42% The ACIA algorithm aims to find the best mapping between the
for Xie and Beni, respectively. system’s predicted clusters and the gold-standard dataset clusters.
A formal description of the ACIA algorithm is presented in Appen-
|Cs − Cд | ≤ 1 (3) dix A but the general idea is the following:
where Cs is the cluster number given by the prediction system, and • Select the best pair mapping between the clusters
Cд is the cluster number of the gold-standard. • Remove these nodes from the set of clusters
From the analysis, it’s possible to conclude that LDA achieves • Iterate until there is no available pair of clusters
a greater accuracy than Xie and Beni and is much closer to the • The final mapping is composed by the set of the selected
gold-standard. As a consequence, in our experiments we used this pair mappings
Using Clustering Techniques to Identify Arguments in Legal Documents ASAIL 2019, June 21, 2019, Montreal, QC, Canada

Figure 3: Argument numbers of gold-standard vs. System Prediction (proposed by Xie and Beni and Cao et al.)

Algorithm 1: Distribution of Sentence to the Cluster Algo- In figure 4 we can see the relevance of performing an optimized
rithm (DSCA) mapping between the system’s predicted clusters and the gold-
data arguments. We can observe that the value of ‘After ACIA’
1. Denote the matrix of the sentences x cluster by (ai j ) ∈ [0, 1],
(square symbol) is higher (above 0.3 for all files), whereas in the
i = 1, 2, 3 · · · S and j = 1, 2, 3 · · · C such that i stands for
case of a sequential mapping between the two clusters ‘Before
sentence and j stands for cluster.
ACIA’ (diamond symbol), the maximum value never exceeds is 0.3.
2. Pre-selected threshold (t) is defined
3. for each i do do 5.2 Performance Measurement
imax = max(ai j ) ∀i
for each j do do The experiment was conducted with the features mentioned in
if (imax − ai j ) < t then section 4.1 with fuzziness parameters m ∈ {1.1, 1.3, 2.0} and thresh-
select sentence i for cluster j old value t ∈ {0.0001, 0.00001, 0.000001} used for the conversion
else from a soft to a hard clustering. For the reason of space, we present
reject i; the results for the features and parameters that score the highest
end f 1 value in most of the case-law files. Table 1 presents the per-
end formance result (precision, recall and f 1 , cluster purity) showing
end the N-gram, Sentence closeness, Word2vec and ‘Combine features’
end using a threshold value t = 0.00001 and FCM fuzziness (m) = 1.3.
Along with this, we include the number of sentences of each case-
law file. The highest f 1 value of each case-law file obtained from
each feature is highlighted in bold and underlined. Case-law files
After identifying the best mapping, the f 1 measure is calculated for 03, 13, 16, 31, 32 and 42 obtained the highest value using Word2vec
each cluster and the overall average f-measure value is obtained. features. Case-law file 02 scored the highest f 1 value with N-gram,
and case-law files 30, 35 and 41, the highest f 1 with the combined
features. Likewise, case-law file 40 scored the highest f 1 value with
the Sentence Closeness feature. From this analysis, we can conclude
that Word2vec seems to be the best overall approach.
In comparison to Word2vec, N-gram did not perform as well.
The main reason for this effect is that N-gram uses a bag of words
approach which is not effective in finding similarities between
sentences, and the results show that the performance of N-gram
depends upon the number of sentences; if the number of the sen-
tences in the case-law file is high, then N-gram performance is
poor.
Sentence Closeness is another important feature that helps to
understand the sequential context of the sentence. The sentence
following an argumentative sentence often has a huge impact on
the argument, as the meaning/context of a sentence usually flows
Figure 4: F 1 score before and after applying ACIA sequentially. The results in this table show that the performance of
ASAIL 2019, June 21, 2019, Montreal, QC, Canada Prakash Poudyal, Teresa Gonçalves, and Paulo Quaresma

Case #S N-gram Sentence Closeness Word2vec Combined Feature

Pre Rec f1 Purity Pre Rec f1 Purity Pre Rec f1 Purity Pre Rec f1 Purity
02 15 0.698 0.485 0.573 0.625 0.342 0.221 0.268 0.412 0.656 0.367 0.470 0.563 0.625 0.450 0.523 0.600
03 15 0.619 0.429 0.506 0.563 0.405 0.333 0.366 0.412 0.714 0.429 0.536 0.600 0.524 0.381 0.441 0.533
13 20 0.508 0.628 0.561 0.500 0.413 0.344 0.375 0.400 0.602 0.581 0.591 0.550 0.342 0.344 0.343 0.400
16 33 0.125 1.000 0.222 0.125 0.437 0.481 0.458 0.424 0.449 0.449 0.449 0.424 0.125 1.000 0.222 0.125
30 25 0.265 1.000 0.419 0.263 0.252 0.275 0.263 0.320 0.351 0.363 0.357 0.360 0.272 1.000 0.428 0.275
31 15 0.317 0.714 0.439 0.313 0.524 0.571 0.547 0.533 0.595 0.524 0.557 0.533 0.429 0.500 0.462 0.400
32 17 0.335 0.785 0.470 0.326 0.481 0.393 0.433 0.474 0.648 0.485 0.555 0.529 0.500 0.396 0.442 0.474
35 13 0.429 0.414 0.421 0.467 0.619 0.414 0.496 0.571 0.667 0.414 0.511 0.615 0.845 0.636 0.726 0.769
39 17 0.352 0.588 0.440 0.346 0.400 0.431 0.415 0.421 0.362 0.525 0.429 0.368 0.310 0.613 0.412 0.250
40 14 0.400 0.370 0.384 0.467 0.587 0.530 0.557 0.533 0.519 0.520 0.520 0.533 0.400 0.420 0.410 0.467
41 12 0.517 0.563 0.539 0.500 0.625 0.625 0.625 0.583 0.438 0.438 0.438 0.417 0.683 0.625 0.653 0.583
42 18 0.464 0.440 0.452 0.389 0.433 0.414 0.424 0.389 0.643 0.486 0.553 0.500 0.431 0.598 0.501 0.414

Table 1: Precision, Recall and f 1 , Cluster Purity value according to Case-law and the number of sentences

Sentence Closeness is satisfactory, but still lacking in comparison values. It is apparent from the data in Table 1 that f 1 and cluster
to Word2vec. The Combined feature also has an impact, as it is a purity are well correlated.
combination of Word2vec, N-gram and Sentence Closeness. The Overall, the results obtained – average accuracy of 0.59, macro
Combined feature offered the highest f 1 value for those case-law f-measure of 0.497 and cluster purity of 0.499 – from the proposed
files for which Word2vec did not offer significant results, with the framework are quite promising, even if they cannot be easily com-
exception of case-law files 02, 39 and 40. Overall, 66% of the case- pared with other researchers’ results. The most related work is the
law files obtained the highest f 1 using Word2vec and Combined one by Mochales and Moens [13, 16]; they obtained a 60% accuracy
feature. result in the argumentation structure detection task. It is important
Furthermore, in the case of N-gram and the Combined feature, to refer they did not present the precision and recall measures and
we found recall is further elevated by up to 1, but precision is very that they tried to handle a much more simple problem, because
low for case-law files that have a large number of sentences. This they assumed sequential argumentative structures.
is because the n-gram feature is inappropriate for such case-law Sobhani et al. [24] obtained a very similar f-measure value (0.49),
files. If the feature is not sufficiently discriminating enough to but also with a much less complex task: classification of sentences
distinguish among sentence categories, applying the FCM provides from a predefined argument list.
equal membership probability values (or very close to equal ) for Goudas [8] obtained an accuracy of 42%, while segmenting the
every category, essentially providing no useful information. As a argumentative sentences using Conditional Random Fields (CRF).
result, during the process of forming hard clusters, such sentences Lawrence [10] precision and recall for identifying argument struc-
get equally distributed over all clusters. ture using automatically segmented propositions was 33.3% and
Table 1 also presents the cluster purity value of each feature 50.0%, respectively.
obtained for each case-law file. Word2vec was found to play the Stab and Gurevych [25] also encountered problems dealing with
leading role in case-law files 03, 13, 16, 30, 31, 32, 40, and 42. Sentence ‘support’ and ‘attack’ relations. The main reason for this was that
Closeness scored highest in four case-law files: 16, 31, 39 and 41. their approach was unable to identify the correct target of a relation,
However, case-law 16 and 31 tied with Word2vec. Overall the purity especially in a paragraph with multiple claims or reasoning chains.
values are satisfactory, except in case-law file 16 and 33 with the
Combined feature and N-gram. Case-law 16 which had 33 sentences 6 CONCLUSION AND FUTURE WORK
had the lowest value (0.125) from Combined and N-gram features. We proposed a new clustering technique for grouping argumen-
Similarly, case-law 30, which has 25 sentences, obtained 0.275. On tative sentences in legal documents. We also proposed and imple-
the other hand, case-law 35, which had 13 sentences, scored 0.726 mented an evaluation procedure for the proposed system and an
(the highest value) using the Combined feature. From this analysis, approach to identify the total number of arguments in a case-law
we can conclude that having a greater number of sentences also document. Overall, the results that we achieved are satisfactory
affects the clustering quality negatively and that Word2vec is the and quite promising. The macro f 1 and average cluster purity score
dominant feature for obtaining acceptable f 1 and cluster purity for system prediction using Word2vec feature in case-law files that
have 4 to 8 arguments is 0.497 and 0.499 respectively.
Using Clustering Techniques to Identify Arguments in Legal Documents ASAIL 2019, June 21, 2019, Montreal, QC, Canada

For future work, we intend to add and evaluate more features, [19] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling
such as ‘semantic similarity’ ones, aiming to improve these results. with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges
for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/
Moreover, as an extension of this work we are working on: a) 884893/en.
the identification, in each cluster/argument, of sentences acting [20] Mendes M.E.S. Rodrigues and L Sacks. 2004. A scalable hierarchical fuzzy clus-
tering algorithm for text mining. In Proceedings of the 5th international conference
either as a premises or conclusions; b) the creation of a graph on recent advances in soft computing. 269–274.
representation of the argument structure of each document (attack, [21] Christos Sardianos, Ioannis Manousos Katakis, Georgios Petasis, and Vangelis
support, and rebuttal arguments). Karkaletsis. 2015. Argument Extraction from News.. In ArgMining@ HLT-NAACL.
56–66.
[22] Jaromir Savelka and Kevin D Ashley. 2016. Extracting case law sentences for
ACKNOWLEDGEMENT argumentation about the meaning of statutory terms. In Proceedings of the Third
Workshop on Argument Mining (ArgMining2016). 50–59.
The authors would like to express deep gratitude to EMMA-WEST [23] Hinrich Schütze, Christopher D Manning, and Prabhakar Raghavan. 2008. Intro-
Project in the framework of the EU Erasmus Mundus Action 2 duction to information retrieval. Vol. 39. Cambridge University Press.
and Agatha Project SI IDT number 18022 (Intelligent analysis sys- [24] Parinaz Sobhani, Diana Inkpen, and Stan Matwin. 2015. From argumentation min-
ing to stance classification. In Proceedings of the 2nd Workshop on Argumentation
tem of open of sources information for surveillance/crime control), Mining. 67–77.
ALENTEJO 2020 for their invaluable support. Further, the authors [25] Christian Stab and Iryna Gurevych. 2014. Annotating Argument Components
and Relations in Persuasive Essays. In Proceedings of the 25th International Con-
would also like to extend sincere thanks to the reviewers for their ference on Computational Linguistics (COLING 2014). Dublin City University and
constructive comments and suggestions. Association for Computational Linguistics, 1501–1510.
[26] Xuanli Lisa Xie and Gerardo Beni. 1991. A Validity Measure for Fuzzy Clustering.
IEEE Trans. Pattern Anal. Mach. Intell. 13, 8 (Aug. 1991), 841–847. https://doi.org/
REFERENCES 10.1109/34.85677
[1] Moh’d Belal Al-Zoubi, Amjad Hudaib, and Bashar Al-Shboul. 2007. A fast fuzzy
clustering algorithm. In Proceedings of the 6th WSEAS Int. Conf. on Artificial
Intelligence, Knowledge Engineering and Data Bases, Vol. 3. 28–32. A APPENDIX
[2] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. 1999. Modern information retrieval.
Vol. 463. ACM press New York. A.1 Appropriate Cluster Identification
[3] James C Bezdek, Robert Ehrlich, and William Full. 1984. FCM: The fuzzy c-means
clustering algorithm. Computers & Geosciences 10, 2-3 (1984), 191–203. Algorithm (ACIA)
[4] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet Let A be the system’s cluster set and the B the gold standard cluster
allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.
[5] Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and set, respectively, having a cardinality of n: A = {a 1 , · · · , an } and
Jenifer C Lai. 1992. Class-based n-gram models of natural language. Computa- B = {b1 , · · · , bn }. We define the matrix F = { fi j } where fi j = ai b j
tional linguistics 18, 4 (1992), 467–479.
[6] Juan Cao, Tian Xia, Jintao Li, Yongdong Zhang, and Sheng Tang. 2009. A density-
with ai ∈ A and b j ∈ B. Here, F = { fi j } is the f-measure value
based method for adaptive LDA model selection. Neurocomputing 72, 7-9 (2009), calculated by taking cluster i from A and cluster j from B.
1775–1781. We denote by (F )i j the matrix formed from the matrix F by remov-
[7] Jack G. Conrad and Khalid Al-Kofahi. 2017. Scenario Analytics: Analyzing Jury
Verdicts to Evaluate Legal Case Outcomes. In Proceedings of the 16th Edition of ing the j th column and i th row
the International Conference on Articial Intelligence and Law (ICAIL ’17). ACM,
New York, NY, USA, 29–37. https://doi.org/10.1145/3086512.3086516
[8] Theodosis Goudas, Christos Louizos, Georgios Petasis, and Vangelis Karkaletsis.
State 1 : Initialization
2014. Argument extraction from news, blogs, and social media.. In Hellenic
F o = (fi j )n×n
Conference on Artificial Intelligence. Springer, 287–299.
[9] A. K. Jain, M. N. Murty, and P. J. Flynn. 1999. Data clustering: a review. ACM
computing surveys (CSUR) 31, 3 (1999), 264–323.
[10] John Lawrence, Chris Reed, Colin Allen, Simon McAlister, and Andrew Raven-
scroft. 2014. Mining Arguments From 19th Century Philosophical Texts Using Ro (−1, −1) = ∅
Topic Based Modelling. In Proceedings of the First Workshop on Argumentation
Mining. Association for Computational Linguistics, Baltimore, Maryland, 79–87. i.e. Nodes are connected with the cost value C=0 to form a tree struc-
https://doi.org/10.3115/v1/W14-2111
[11] Marco Lippi and Paolo Torroni. 2016. Argumentation mining: State of the art and
ture.
emerging trends. ACM Transactions on Internet Technology (TOIT) 16, 2 (2016),
10. State 2 : From k = 0 to n, iterate. At each k step, we have F (k) (i, j)
[12] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean. 2013.
Distributed representations of words and phrases and their Compositionality. In and R (k ) (i, j)
Advances in neural information processing systems. 3111–3119.
[13] Raquel Mochales and Marie-Francine Moens. 2011. Argumentation mining.
Artificial Intelligence and Law 19, 1 (2011), 1–22.
Find all maximum
n elements of F (k ) (i, j) o
(k )
[14] Raquel Mochales-Palau and Marie-Francine Moens. 2007. Study on sentence Let Mk = (i, j)| fi j is themaximum element o f F (k) (i, j)
relations in the automatic detection of argumentation in legal cases. FRONTIERS
IN ARTIFICIAL INTELLIGENCE AND APPLICATIONS 165 (2007), 89. i.e. Maximum f-measure value is selected and place in tree structure;
[15] Nidhi. [n. d.]. Number of Topics for LDA on poems from Elliston Poetry Archive.
Available at http://www.rpubs.com/MNidhi/NumberoftopicsLDA (2017-03-31).
[16] Raquel Mochales Palau and Marie-Francine Moens. 2009. Argumentation Mining:
State 3: For each element (i, j) ∈ Mk , update route
The Detection, Classification and Structure of Arguments in Text. In Proceedings
of the 12th International Conference on Artificial Intelligence and Law (ICAIL ’09).
ACM, New York, NY, USA, 98–107. https://doi.org/10.1145/1568234.1568246
R (k +1) (i, j) = R (k ) (i, j) ∪ {(i, j)}
[17] Joonsuk Park, Arzoo Katiyar, and Bishan Yang. 2015. Conditional random fields
for identifying appropriate types of support for propositions in online user
and matrix
comments. In Proceedings of the 2nd Workshop on Argumentation Mining. 39–44.
F (k +1) (i, j) = F (k )

[18] Prakash Poudyal, Teresa. Goncalves, and Paulo. Quaresma. 2016. Experiments
on identification of argumentative sentences. In 10th International Conference on ij
Software, Knowledge, Information Management Applications (SKIMA). 398–403.
https://doi.org/10.1109/SKIMA.2016.7916254 Do it for all elements (i, j) of Mk
ASAIL 2019, June 21, 2019, Montreal, QC, Canada Prakash Poudyal, Teresa Gonçalves, and Paulo Quaresma

Stop when k = n.

i.e. Procedure repeat again for other remaining values;

State 4 : {For each route, calculate total cost}
Õ
TC R (k ) (i, j) = fi j
(i, j)∈R (k ) (i, j)

i.e. The total cost of each route is calculated.

State 5: Select one of the maximum values
TC Ro (i, j) ,
and its route

Ro (i, j) = {(i 1 , j 1 ), · · · , (i n , jn )}

i.e. The route with the maximum scores is selected.

After identifying the appropriate cluster (argument) with respect
to the gold-standard; an f-measure is calculated between the i th
cluster of the system as recommended by the ACIA and the j th
cluster of the gold-standard. After that, the average f-measure value
is calculated.