=Paper=
{{Paper
|id=Vol-2844/ethics1
|storemode=property
|title=The Impact of Using Machine Learning for the Thematic Classification on Legal Documents
|pdfUrl=https://ceur-ws.org/Vol-2844/ethics1.pdf
|volume=Vol-2844
|authors=Aris Kosmopoulos,Stavroula Fikari,George Giannakopoulos
|dblpUrl=https://dblp.org/rec/conf/setn/KosmopoulosFG20
}}
==The Impact of Using Machine Learning for the Thematic Classification on Legal Documents==
<pdf width="1500px">https://ceur-ws.org/Vol-2844/ethics1.pdf</pdf>
<pre>
           The Impact of Using Machine Learning for the Thematic
                    Classification on Legal Documents
               Aris Kosmopoulos                                           Stavroula Fikari                         George Giannakopoulos
              A.I. Researcher                                             Attorney-at-law                                 A.I. Researcher
     SciFY PNPC and NCSR Demokritos                                Legal Informatics Consultant                  SciFY PNPC and NCSR Demokritos
              Athens, Greece                                            Nomiki Bibliothiki                                Athens, Greece
             akosmo@scify.org                                             Athens, Greece                             ggianna@iit.demokritos.gr
                                                                        stavroula@nb.org

ABSTRACT                                                                                   Although Legal AI offers several opportunities of AI applications,
Gradually, the adaptation of Artificial Intelligence (AI) in various                   several ethical dilemmas must also be taken into consideration. For
domains is becoming a fact. Although the legal domain offers sev-                      example allowing a computer program to create human laws, or
eral such opportunities, the ethical dilemmas that arise must be                       even act as a judge, are indeed some very sensitive scenarios. But
taken into serious consideration. In this work we demonstrate a real                   is this always case?
case scenario where the infusion of AI into a preexisting procedure                        Facilitating the work of a human expert is a much less restrictive
can empower the human and facilitate the whole process of legal                        scenario in terms of ethical dilemmas. In this paper we focus on
document annotation, as a supporting workflow related to legal AI.                     presenting a real-world application of AI in a legal setting. Nomiki
Furthermore, we discuss the ethical aspects of AI adoption, pointing                   Bibliothiki1 , a major legal content provider, has developed a web-
out that the related ethical impact between different scenarios can                    site (legal content platform2 ) providing to legal professionals easy
vary greatly, offering the presented use case as an example of an                      access to a full range of legal documents (legislation, case-law and
AI application in the borderline of legal AI.                                          other official legal documents, legal doctrine, templates of legal
                                                                                       acts), which can support legal decision-making. A main concern is
CCS CONCEPTS                                                                           how the platform can arrange and classify this content in order to
                                                                                       deliver quick, accurate and valid search results.
• Applied computing → Law; Annotation; • Computing method-                                 Among legal documents to be processed and analyzed are the
ologies → Supervised learning by classification.                                       administrative acts published in Issue B of the Official Govern-
                                                                                       ment Gazette of Greece. A legal annotator must assign one or more
KEYWORDS                                                                               subject-matter categories and legal terms chosen out of a hierar-
legal AI, document classification, multi-label classification, annota-                 chical index (which is part of a thesaurus). The solution offered by
tion                                                                                   AI – designed and implemented by SciFY PNPC3 , an AI technology
                                                                                       transfer and digital transformation not-for-profit company – was
                                                                                       an automated classification process that proposed such categories
1    INTRODUCTION
                                                                                       and legal terms to the legal annotator. The benefit of this automated
An important aspect of Artificial Intelligence (AI) is the develop-                    process is impressive and allows the annotator to perform the task
ment of software that behaves and works like humans do. On the                         much faster.
other hand, AI does not always try to replace humans. The emph-                            The contributions of this work are the following:
facilitation of a human task is such a case, where AI can speed up
the completion an undertaken task and increase productivity.                                • The outline of a real-world use of AI use for legal domain
   AI can be applied in various domains and each domain naturally                             tasks.
has certain characteristics and limitations that must be taken into                         • A discussion on the benefits and the presence of ethical risks
account. Legal AI can refer to many different things, which can be                            in this use case, but also a widening of the discussion to
grouped in two main categories:                                                               imply future, related concerns.
                                                                                          The rest of the document is structured as follows. In Section 2
     • Legal issues arising from the use of AI systems similar to
                                                                                       we present some related work, while in Section 3 we describe the
       those arising from other innovative products and solutions
                                                                                       use case in more detail. In Section 4 we discuss some ethical aspects
       and concerning the statutory and regulatory framework
                                                                                       of the task and we conclude the paper in Section 5.
       (data protection, consumers’ rights, IP rights, competition).
     • Employment of AI techniques and methods to produce tools
       and solutions assisting the legal professionals in every-day
                                                                                       2    RELATED WORK
       practice.                                                                       One of the essential steps in the analysis of large document col-
                                                                                       lections is the thematic classification of these documents. As the
                                                                                       volume of data increases significantly, manual analysis requires
WAIEL2020, September 3, 2020, Athens, Greece
                                                                                       1 https://www.nb.org/
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
                                                                                       2 https://www.qualex.gr
License Attribution 4.0 International (CC BY 4.0).
                                                                                       3 https://www.scify.gr/
effort and time. For that reason, over the last decades one of the            By 2018, the classification of the administrative acts published
main concerns of data scientists consists in designing processes of       in Issue B of the Greek Official Government Gazette was being
document analysis which tackle this challenge. Thus, they started         performed manually. The legal annotator was searching and choos-
experimenting with the implementation of automatic methods of             ing the relevant legal terms in a dedicated software tool (Figure
classification. In this section, we refer to the most related applica-    1), repeatedly for each term and for each separate legal act in two
tions of classification, since the literature of text classification in   steps:
general is immense (cf. [1, 2, 7–9]).
                                                                              • Assignment of one or more subject-matter categories chosen
    In [4] a semi-automatic method, based on keyword classification
                                                                                out of a drop-down list.
of documents, assigns appropriate branches of knowledge to doc-
                                                                              • Assignment of legal terms chosen out of a hierarchical index
uments of Polish digital Libraries by using clustering algorithms.
                                                                                (which is part of a thesaurus).
The experiment was conducted with the assistance of human anno-
tators. The method was evaluated to be applicable to the thematic             An AI solution of multi-label classification was designed and
classification of documents in large digital collections.                 implemented by SciFY (Science For You). SciFY is a not-for-profit or-
    An experiment of using machine learning (ML) techniques to            ganization that implements digital transformation initiatives in the
classify sentences in Dutch legislation was used in [5]. These results    fields of Artificial Intelligence, assistive technologies, entrepreneur-
are compared to the results of a pattern-based classifier and the         ship, e-participation and education.
conclusion was that pattern-based approach is preferable.                     As in every machine learning training process the quality and
    A domain specific approach regarding the classification of laws       quantity of data greatly affects the expected performance. This
is presented in [3]. The system can compute similarities between          process was greatly facilitated by the excellent-quality, annotated
small snippets of large heterogeneous laws. Another approach of           data provided by Nomiki Bibliothiki. The provided legal document
classification and labeling of European laws is described in [11].        data were well structured and consistent, qualities ascertained by
The authors state that the segmentation of each legal document            appropriate quality assurance processes. Another important factor
into several parts can greatly improve the quality of labeling.           for the success of the use case was the quantity of training instances
    Another important consideration regarding legal document clas-        per class, which in most cases was sufficient (in the order of tens
sification is whether linguistic information can help the classifiers.    of instances) in order to train a classification model.
In [6] the authors evaluate the usefulness of adding lemmatization            For each class (categories and legal terms), given that sufficient
and part-of-speech in the classification pipeline and conclude that       training instances existed, we trained a binary classifier (approx-
the results were in fact improved.                                        imately 1700 classifiers were used, one for each category / term).
    The limitations and perspectives of AI application in predictive      A bag-of-words approach was used in order to extract features
justice was studied in [10]. The paper focuses on the Federal Court       from the legal documents (around 85 thousands of documents were
of Canada and examines the use of various state of the art methods        used in total as training instances). A feature selection process was
of natural language processing and machine learning algorithms.           also applied to remove rare features and speed up the training and
Another case of application of AI in the legal domain is that of          prediction processes, without negatively affecting the performance.
automatic summarization of legal texts. Such a goal was that of the       During prediction, each instance (legal act) is evaluated by each
SALOMON project presented in [12] that was applied to Belgian             classifier. When the classifier predicts with sufficient confidence
criminal cases.                                                           that the document should be assigned the category label, the label is
    In this work, we do not focus on the classification itself, but       suggested to the human annotator as a plausible option (cf. Figure
rather on the use case of classifying legal documents, as a support       2).
tool to efficient and effective legal content delivery. We also discuss       The performance of the suggestion is impressive: the internal
ethical implications, but also the value added through the use of AI      tests on the actual workflow of the annotators showed a success rate
in this setting.                                                          of 98% (perceived estimated accuracy of the end user) in legal acts
                                                                          of standard and repetitive regulations. As a result, all the annotator
                                                                          has to do now, is accept all or part of the proposed terms in one
3   USE CASE DESCRIPTION                                                  move, instead of searching the drop- down lists.
The so called “information crisis” in legal domain is a general phe-          We should notice, though, that the semi-automatic process of
nomenon, meaning that the legal professionals need to access large        classification still remains a human-supervised method in order to
volumes of legal information in order to treat a case and solve a         avoid implied annotation risks (i.e. not using scarce classes which
legal problem. This crisis is aggravated by the diversity of legal        are not proposed by the algorithm) and the instruction given to
sources to be consulted and, thus, the challenge mostly consists of       annotators is to consider the addition of not proposed terms that
locating and evaluating information delivered by various sources          are assessed as relevant or even to reject non pertinent proposed
so as select and cite pertinent documents.                                terms.
   The content – provided to professionals by Nomiki Bibliothiki              In any case, the time saved is significant, since for the stan-
– to support legal decision-making, must always be indexed and            dardized legal acts, which is the majority (almost 70%), the time
classified in order to be delivered quickly and accurately. To accom-     annotation time was reduced by 50%. Given that, the legal annota-
plish this goal, several techniques of multi-level legal analysis are     tors can now focus on more complex tasks of legal analysis, such
applied, like indexing and classification. But the rapidly increasing     as the consolidation of legal texts and the creation of links between
volume and complexity of data requires effort and time.                   related texts.
    Figure 1: Tool used by a legal annotator in order to assign subject-matter categories and legal terms to a legal document.


                                                                              can be caused by inherent idiosyncrasies of an algorithm.
                                                                              This bias can be problematic in cases where the output of
                                                                              an algorithm implies or explicitly leads to a specific judicial
                                                                              outcome, e.g. a verdict.
                                                                            data bias, which describes the implicit bias added to a machine
                                                                              learning algorithm, through the selection of training data.
                                                                              There exist several cases of such bias, again leading to unjust
                                                                              outcomes for a given legal setting.
                                                                            explainability, which describes the danger of not being able
                                                                              to explain a decision of a machine learning system, while the
                                                                              decision significantly impacts a human subject. The usual
                                                                              reason for this risk related to the mathematical modeling of
                                                                              a problem in an AI system, which cannot provide a humanly-
                                                                              understandable response of the "why?" a decision was taken.
                                                                              The "explanation" is essentially a complex mathematical func-
                                                                              tion, which may be impossible to interpret in meaningful
                                                                              terms.
Figure 2: Legal annotator is facilitated by categories and
                                                                            default decisions, which refers to the danger of taking judi-
terms predictions provided by AI.
                                                                              cial decisions, without offering the possibility of rebuttal to
                                                                              the impacted subject.
                                                                            agency and accountability of a decision, which refers to the
   Based on the above, the gain from the integration of AI compo-
                                                                              challenge of assigning accountability to a person for a given
nents in the workflows of Nomiki Vivliothiki is clear. In the next
                                                                              decision, in the case when the decision was made by an AI
section, we discuss ethical aspects of the system under the prism
                                                                              system.
of legal AI ethical risks.

4    DISCUSSION OF THE ETHICAL ASPECTS
                                                                           All the above challenges arise in cases where the legal process is
In the legal setting, there exist a number of subtle dangers in using   directly affected by an AI supporting system. In this paper, how-
AI, most notably:                                                       ever, we claim that there exist borderline applications of AI in the
     algorithmic bias, which describes the preference that an al-       legal setting, where the above risks are mitigated. Essentially, these
        gorithm may contain towards a specific decision. This bias      borderline applications refer to functions of AI in the information
gathering process, where there is always a human in the loop, and             • The analysis of legal documents through powerful citators,
there exist at least two levels of validation for the AI outcomes.              which allow the history tracking of a legal text and its treat-
   In our use case, the AI system works to help the annotation of               ment by official factors.
content related to legal settings. In other words, the AI is meant            • The automatic summarization of documents.
to help humans in increasing the indexability and retrievability              • The production of litigation analytics.
of documents related to a legal setting. The AI decision is, thus, a        Another direction is that of Predictive Analytics Solutions. AI
suggestion to be validated by a human (the annotator) in the related     tools utilize case law, public records, dockets, and jury verdicts
quality assurance (QA) process. The results of this process allow        to identify patterns in past and current data and then analyze the
legal professionals - the end users - to retrieve information related    facts of a lawyer’s case to provide an intelligent prediction of the
to their work, e.g. laws and decisions referring to similar cases. At    outcome. Those tools can be extremely useful to legal practitioners
this level, again a human is to finally decide what is related and       and they are widely used in the USA and Canada. On the contrary
what is not. Thus, the AI decisions are validated twice.                 in Europe there is a reticence due to ethical issues.4
   A hidden risk in this process is the fact that, once the end users       Predictive Analytics methods can be applied to develop more
increase their confidence towards the system, they may rely more         advanced tools for legal risk assessment and legal risk management.
and more on the document that the system retrieves. We consider             Legal risk can be defined in general as the risk of loss incurred
the worst case scenario, where a critically related document was         to an organization or an individual due to factors related to legal
mistakenly classified by the system and, thus, is not retrieved as       issues. The various aspects of legal risk can be classified into the
relevant to the end user query. It is possible that the outcome of the   following broad categories:
legal process is, thus, affected by the lack of this documentation.
                                                                              • Litigation risk: potential legal disputes arising from business
   Such a risk can be mitigated by two simple actions. The first
                                                                                activities.
relates to the validation of suggested classification tags by more
                                                                              • Contractual risk: failure to fulfill contractual obligations by
than one human, minimizing the risk of erroneous tags. The sec-
                                                                                a contractual party resulting in liabilities and damages.
ond relates to the training of the end users, so that they utilize a
                                                                              • Regulatory risk: modifications in legislation imposing new
minimum number of different queries to retrieve documents related
                                                                                compliance practices and costs.
to their case.
                                                                              • Compliance risk: failure to comply with laws and regulations
   Cross-referencing the above discussion with the main identified
                                                                                resulting in sanctions and penalties.
risks of legal AI, we can see that:
                                                                            The legal uncertainty in the aspects mentioned above can affect
    • algorithmic and data bias is reduced through the quality           a business or a market significantly and cause serious financial or
      assurance processes. Furthermore, even if there is bias, it        other losses. The solutions and products offered use AI techniques
      does not directly affect the judicial processes, even though       which take into consideration and analyze legal data relevant to the
      it may alter the flow of information towards the interested        circumstances of the person or entity concerned and assist them in
      parties. In any case, the final decision still relies on humans.   developing an effective risk management strategy.
    • explainability may not be of real value in this setting, since
      the classification decision has limited impact and is easy to      5     CONCLUSION
      change, if the human annotator has a different opinion.            In this work we described a real-world application of AI in a legal
    • the use of AI in our setting is not a part of the judicial pro-    setting. We showed how the infusion of AI into a pre-existing
      cesses themselves, but a supporting workflow for the gath-         legal content generation process empowers the human and the
      ering of related information.                                      requirements for such an application. We highlighted aspects of
    • the agency and accountability of any decisions remains tied        this empowerment in the use case and showed how a human-in-the-
      to the end user, who has always been responsible for the           loop AI system can provide multiplicative effects to everyday work.
      search and verification of gathered information.                   We also described ethical aspects and challenges of the setting, but
                                                                         also of future prospects.
Based on the above analysis, we suggest that such ethical/impact
checklists could be useful to identify whether a given use case is a     REFERENCES
support process, as above, what are the related risks and how these       [1] Charu C Aggarwal and ChengXiang Zhai. 2012. A survey of text classification
                                                                              algorithms. In Mining text data. Springer, 163–222.
risks can be mitigated.                                                   [2] Berna Altınel and Murat Can Ganiz. 2018. Semantic text classification: A survey
   In the following paragraphs, we go beyond the current use case             of past and recent advances. Information Processing & Management 54, 6 (2018),
we described, highlighting possible future directions of legal tech-          1129–1153.
                                                                          [3] Guido Boella, Luigi Di Caro, and Llio Humphreys. 2011. Using classification to
nology in Greece.                                                             support legal knowledge engineers in the eunomos legal document management
   One possible future direction is that of Legal Research Solutions.         system. In Fifth international workshop on Juris-informatics (JURISIN).
Legal content providers use AI techniques to optimize legal research      [4] Łukasz Borchmann, Filip Gralinski, Rafał Jaworski, and Piotr Wierzchon. 2015.
                                                                              A semi-automatic method for thematic classification of documents in a large text
and deliver accurate results. The main features of such solutions             corpus. Corpus-Based Research in the Humanities (CRH) (2015), 13.
are:                                                                      [5] Emile de Maat, Kai Krabben, Radboud Winkels, et al. 2010. Machine Learning
                                                                              versus Knowledge Based Classification of Legal Texts.. In JURIX. 87–96.

    • The support of natural language search.                            4 See the case of France: statutory prohibition of court decisions analysis based on the
    • The recognition of legal terminology.                              judge profile.
[6] Teresa Gonçalves and Paulo Quaresma. 2005. Is linguistic information relevant for         [10] Marc Queudot and Marie-Jean Meurs. 2018. Artificial intelligence and predictive
    the classification of legal texts?. In Proceedings of the 10th international conference        justice: Limitations and perspectives. In International Conference on Industrial,
    on Artificial intelligence and law. 168–176.                                                   Engineering and Other Applications of Applied Intelligent Systems. Springer, 889–
[7] Vandana Korde and C Namrata Mahender. 2012. Text classification and classifiers:               897.
    A survey. International Journal of Artificial Intelligence & Applications 3, 2 (2012),    [11] Erich Schweighofer, Andreas Rauber, and Michael Dittenbach. 2001. Automatic
    85.                                                                                            text representation, classification and labeling in European law. In Proceedings of
[8] Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu,                     the 8th international conference on Artificial intelligence and law. 78–87.
    Laura Barnes, and Donald Brown. 2019. Text classification algorithms: A survey.           [12] Caroline Uyttendaele, Marie-Francine Moens, and Jos Dumortier. 1998. Salomon:
    Information 10, 4 (2019), 150.                                                                 automatic abstracting of legal cases for effective access to court decisions. Artifi-
[9] Nikiforos Pittaras, George Giannakopoulos, George Papadakis, and Vangelis                      cial Intelligence and Law 6, 1 (1998), 59–79.
    Karkaletsis. 2020. Text classification with semantically enriched word embed-
    dings. Natural Language Engineering (2020), 1–35.

</pre>