BO-ECLI Parser Engine: the Extensible European Solution for
             the Automatic Extraction of Legal Links
              Tommaso Agnoloni                                             Lorenzo Bacci                              Marc van Opijnen
    Institute of Legal Information Theory                   Institute of Legal Information Theory           Publications Office of the Netherlands
                and Techniques                                          and Techniques                                   UBR|KOOP
                  ITTIG-CNR                                               ITTIG-CNR                             The Hague, the Netherlands
                  Firenze, Italy                                          Firenze, Italy                      marc.opijnen@koop.overheid.nl
        tommaso.agnoloni@ittig.cnr.it                              lorenzo.bacci@ittig.cnr.it

ABSTRACT                                                                              relations of the current document with other legal (legislative or
This paper presents the BO-ECLI Parser Engine, an open source Java                    judicial) documents, formally expressed using uniform identifiers
framework for the automatic extraction of case-law and legislation                    (the aforementioned ECLI for case-law, ELI for legislation, national
references from case-law texts issued in the European context.                        identifiers, CELEX identifiers for European legal sources).
   Differences of languages and jurisdictions are tackled with an                        These relational metadata are at the same time among the most
extensible design that guides and facilitates the development of                      useful case-law metadata, in that they allow the enhancement of
pluggable national extensions, resulting in a considerably reduced                    legal information retrieval with relational search, and among the
effort with respect to the development of a full national legal link                  most difficult to have valued, especially for legacy data and for less
extractor from scratch.                                                               resourced languages and jurisdictions.
   Thanks to a well-defined pipeline of services that synthesize the                     In the legal domain, citations are an integral part of a text and
whole extraction process and to an internal annotation system that                    their instrumental use for a variety of purposes (substantive, pro-
is used to convey the information along the pipeline, the software                    cedural, argumentative) is a familiar tool for legal professionals
ensures both overall efficiency and flexibility in absolving language                 performing their daily duties. Search by relationship [1] is there-
and jurisdiction dependent tasks.                                                     fore popular among users as it conforms with the typical attitude
   Services can be provided either by the common part of the soft-                    of legal professionals confronted with the reconstruction of the
ware or by a national extension. For the implementation of services                   sources relevant for a legal issue at hand. Nonetheless it is poorly
performing rule-based textual analysis (like entity identification),                  supported by typical search engines relying on full text indexing
JFlex is used in the common part and recommended in the national                      and necessarily requires an explicit reference tagging to be dealt
extensions. Finally, through identifier generation services, the BOE-                 with by machines.
CLI Parser Engine can produce standard identifiers, like ECLI or                         Manual reference tagging is an extremely costly procedure, not
CELEX, for each recognized legal reference.                                           viable in the public domain and especially unable to cope with the
   Starting from a Template project, two different national exten-                    growing amount of data published in national case law databases.
sions have been successfully developed and tested in order to sup-                    Automatic legal reference extraction, on the other hand, has been
port the extraction of legal links from case-law texts written in the                 successfully applied in several national contexts [2], [3], [4], despite
Italian and Spanish languages.                                                        the complexity of coping with a diversity of styles, variants and
                                                                                      exceptions to existing drafting rules and citation guidelines.
KEYWORDS                                                                                 Due to the national specificities, national citation practices, and
                                                                                      language dependency of the task, the problem scales in complexity
Legal citations, Reference parsing, Multilinguality
                                                                                      when approaching it from a multilingual and multi-jurisdictional
                                                                                      perspective. Previous efforts in such direction within the EU-funded
1    INTRODUCTION                                                                     EUCases project [5] was limited to the extraction of references from
Among the goals of the European Case Law Identifier (ECLI) es-                        national case law to EU legal sources.
tablished in 20101 is the publication of national case-law by courts                     Starting from an analysis of approaches and existing solutions to
of European member States via the ECLI Search Engine on the                           the “Linking data” problem [6] and based on the results of a survey
European e-Justice Portal. Besides being uniformly identified, deci-                  on citation practices within EU and national Member States’ courts
sions should be equipped with a minimal set of structured metadata                    [7], the BO-ECLI Parser Engine presented in this work and devel-
describing their main features. Among the (optional) metadata pre-                    oped within the EU funded project “Building on ECLI”2 , tackles
scribed by the ECLI Metadata Scheme, references metadata describe                     the problem from an EU-wide multilingual and multi-jurisdictional
                                                                                      perspective. The aim of the proposed framework is to lower the en-
1 Council conclusions inviting the introduction of the European Case Law Identifier
                                                                                      try barrier for national data providers willing to develop their own
(ECLI) and a minimum set of uniform metadata for case law (CELEX:52011XG0429(01)).
                                                                                      legal reference extraction solution by providing a proven method-
In: Proceedings of the Second Workshop on Automated Semantic Analysis of Informa-
                                                                                      ology, shared common knowledge and reusable and extendable
tion in Legal Text (ASAIL 2017), June 16, 2017, London, UK.                           components.
Copyright © 2017 held by the authors. Copying permitted for private and academic
purposes.
                                                                                      2 http://bo-ecli.eu
Published at http://ceur-ws.org
ASAIL 2017, June 16, 2017, London, UK                                              Tommaso Agnoloni, Lorenzo Bacci, and Marc van Opijnen


                                                                            3   A PIPELINE OF SERVICES
                                                                            One way to synthesize a generic process of legal link extraction
                                                                            from texts is, first, to divide it into three consecutive phases:
                                                                                (1) the entity identification phase, where the fragments of text
                                                                                    that can potentially represent a feature of a citation are
                                                                                    identified and normalized;
                                                                                (2) the reference recognition phase, where patterns of identi-
                                                                                    fied features are read in order to decide whether they form
                                                                                    a legal reference or not;
                                                                                (3) the identifier generation phase, where the recognized legal
                                                                                    references are analyzed so that standard identifiers, and
                                                                                    possibly URLs, can be assigned to them.
Figure 1: The overall architecture of the BO-ECLI Parser En-                   Secondly, within every single phase, a number of different ser-
gine: a common framework with pluggable extensions for                      vices can be placed, each specialized in performing one task. For
supporting the parsing process of a text written in a specific              example, within the entity identification phase, there could be a
language or issued within a specific jurisdiction in order to               service specialized in the identification of case numbers.
get a collection of legal references and the original text with                By modelling the process of legal link extraction with a sequence
inline annotations.                                                         of services that belong to these three distinct phases it is possible
                                                                            to concretely achieve a separation in design between a common
                                                                            part and an extension part. Specifically, the common part (i.e. the
2    PARSER ENGINE                                                          framework) defines the classes, the interfaces and the methods that
                                                                            guide the implementation of any specific task and provides the
The BO-ECLI Parser Engine is an extensible framework for the
                                                                            default implementations for services common to every jurisdiction,
extraction of legal links from case-law texts. It is written in Java and
                                                                            like the generation of a CELEX identifier for legal references to
distributed as open source software3 . It targets citations to both case-
                                                                            European legislation. On the other hand, the extension part must
law and legislation, expressed as lists of textual features (authority,
                                                                            provide the implementations for services like the identification of
type of document, document number, date, etc.) or as common
                                                                            national issuing authorities, a strictly language dependent task.
names (i.e. aliases). Multiple citations, intended either as citations to
                                                                               In the Java domain the described separation between the com-
more than one partition of a single document or as citations to more
                                                                            mon framework and the national extensions is realized through the
than one document issued by a single authority, are also covered
                                                                            Service Provider Interface paradigm, which is part of standard Java
and distinct legal references are generated in correspondence to
                                                                            and its adoption is straightforward. Following SPI, the integration
each partition and each document. A distinguishable characteristic
                                                                            among the framework and all the different national extensions is as
of the software consists in the capability to be extended in order
                                                                            simple as publishing them as standard Java jar libraries and making
to support the extraction process from texts written in different
                                                                            them visible in the classpath.
languages or issued within different jurisdictions.
In order to realize such design, two practical steps are required:
       • dividing the process of legal link extraction into a generic
         and customizable sequence of atomic services, following a
         pipeline pattern;
       • defining an annotation system able to convey the work
         done by each service along the pipeline.
   Distributed as Java Libraries with a standard Java API, the soft-
ware can either be integrated within an existing environment for an
automatic batch parsing over a large corpus or be wrapped into an
even more interoperable HTTP API and queried by a remote user-
interface. Especially for this use case, the overall efficiency of the
software guarantees a quick response (in terms of user experience)
even with large case-law texts as input.
   While the input of the software is as simple as text and additional
metadata, the result of the parsing process consists in a collection of
legal reference objects, possibly accompanied with legal identifiers,       Figure 2: A schematic representation of the execution of a
and in the original text with added inline annotations, possibly with       sequence of atomic tasks through a pipeline of service im-
hyperlinks, in correspondence with the recognized citations.                plementations belonging to the three different phases of the
                                                                            parsing process.

3 http://gitlab.com/BO-ECLI/Engine
BO-ECLI Parser Engine                                                                                  ASAIL 2017, June 16, 2017, London, UK


Figure 3: Empty interfaces representing annotation categories like authorities, aliases and types of document are defined in
the framework and are implemented by Java Enumerations provided by the national extensions or by the framework itself,
thus realizing extensible lists of normalized values for annotations that share a common Java type.


4     ANNOTATION SYSTEM                                                    the output the syntactically correct annotation for such fragment
The BO-ECLI Parser Engine framework defines an internal annota-            of text. By exploiting the auxiliary methods, the implementor is
tion system to allow every service implementation, especially the          completely guided in producing a syntactically correct annotation
ones belonging to entity identification and reference recognition, to      with the allowed values for each annotation category.
save the specific results of their execution directly in the text. Thus,      Reference recognition service implementations can also benefit
every service can be seen as a module that receives an annotated           from specific auxiliary methods that are responsible for converting
text as input and produces an annotated text as output, enabling           the annotations of several entities forming a valid pattern into a
the complete customizability of the pipeline: depending on the             unique legal reference annotation.
language, jurisdiction or other specific metadata of the input text,
modules can be replaced, enabled, disabled or arranged differently.        5     SERVICE IMPLEMENTATION
Within the BO-ECLI Parser Engine, annotations are used to assign           The implementation of an annotation service belonging to either
a category (hence, a meaning) to a fragment of text, while, through        the entity identification or the reference recognition phase simply
normalization, annotated fragments of text can acquire a language          consists in a piece of software that analyzes an input text, possibly
independent value. For example, the Italian fragment of text “sent.
                                                                           already enriched with annotations, and produces an equivalent
della Corte Costituzionale”, meaning a judgment issued by the Ital-
ian Constitutional Court, at a certain point along the pipeline, is        output text, possibly with altered annotations. Since the framework
annotated as follows:                                                      provides the methods for accessing the input text and for producing
                                                                           the output text, it is up to the national implementor to decide, for
[BOECLI:CASELAW_TYPE:JUDGMENT]sent.[/BOECLI] della
                                                                           each service implementation, the textual tools to be used in order
[BOECLI:CASELAW_AUTHORITY:IT_COST]Corte Costituzionale[/BOECLI]
                                                                           to perform the matching. For example, those operations could be
The controlled lists for the annotation categories and annotation          realized with the methods of the Java String class, a set of regular
values can be provided either by the common framework or by                expressions and the Java regex package, proper lexical scanners, i.e.
the national extensions through Java Enumerations. Thanks to the           automata.
annotation system, the work of each service is conveyed and shared            The last approach is not only the most powerful and efficient, but
along the pipeline in a language independent way.                          also the most fitting for the implementation of an annotation service.
                                                                           The default implementations of the annotation services provided
4.1    Auxiliary methods                                                   by the framework make use of JFlex4 , a well-known lexical scanner
One of the peculiarities of the framework is to provide the imple-         generator for Java. Since the JFlex syntax allows for the insertion
mentor with a number of auxiliary methods that are used to produce         of Java code, a jflex file can directly make use of the auxiliary
all the annotations in a transparent way.                                  methods and Java Enumerations supplied by the framework and
    For each annotation category during the entity identification          by the national libraries.
phase, by passing a fragment of the input text and optionally a
normalized value as argument, an auxiliary method appends to               4 http://jflex.de
ASAIL 2017, June 16, 2017, London, UK                                             Tommaso Agnoloni, Lorenzo Bacci, and Marc van Opijnen


5.1    Default service implementations                                     to the European primary legislation, like the Treaty on European
A number of implementations for services that belong to each phase         Union or the Treaty on the Functioning of the European Union, so
of the legal link extraction process are provided by the framework         that a national implementor, by reusing such normalized values,
by default. Typically, a default implementation is supplied when           is guided in the implementation of an identification service that
the task that the service is in charge of can be considered language       covers those references expressed in his specific language. More-
independent, pertains to the European jurisdiction or is common            over, within the framework, a default service implementation for
in the European context.                                                   identifier generation automatically assigns, for each registered alias
                                                                           value pertaining to the European legislation, the correct CELEX
   5.1.1 ECLI identification. An ECLI code can be used within a            identifier.
case-law text as a feature of a more complete citation or as a citation
by itself. Since the ECLI code has a standardized syntax that doesn’t
                                                                               5.1.8 HTML rendering. At the end of the pipeline of services,
depend on the language used in the rest of the case-law text, a
                                                                           all temporary annotations are discarded and the original input text
default service for ECLI identification is implemented and supplied
                                                                           is only annotated with the “legal reference” annotation category in
by the framework.
                                                                           correspondence of the recognized references. A national implemen-
   5.1.2 Partitions identification. A partition is a hierarchical branch   tor can develop a rendering service in order to convert the internal
of partition elements like articles, paragraphs, letters, etc. While       annotation of legal references to a specific format of his choice. By
the identification of each partition element is a language dependent       default, the framework provides an HTML style rendering service
task, the framework provides a service that converts sequences of          that transforms the “legal reference” annotation category into the
partition element annotations into unique partition annotations,           <a> tag and uses the optional URL assigned during the identifier
correctly composing the branches, especially in case of multiple           generation phase as the value of the href attribute.
citations.
   5.1.3 Parties identification. The identification of the names of        6     NATIONAL EXTENSIONS
the parties in a citation should be generally considered as a language     National extensions are used by the framework to allow the ex-
dependent task. Nonetheless, the framework provides a default              traction process from texts written in specific languages or issued
service implementation for the identification of applicants and            within specific jurisdictions. In order to develop a new extension,
defendants relying on heuristics based on positioning, upper and           the implementor has to:
lower casing, the versus entity and the geographic identification
of a country member of the Council of Europe (as a defendant in                  • extend the controlled lists of normalized values for the
European Court for Human Rights citations).                                        annotations with the values pertaining to the new jurisdic-
                                                                                   tion;
   5.1.4 Reference recognition. After the entity identification phase,
                                                                                 • provide a certain number of service implementations for
the textual features that can potentially be part of a legal reference
                                                                                   entity identification, reference recognition and identifier
are annotated and normalized, hence they can be treated as lan-
                                                                                   generation;
guage independent entities. Although citation practices change
                                                                                 • export the project as a Java jar library for compatibility
from one jurisdiction to another, the framework provides a number
                                                                                   with the Service Provider Interface.
of default service implementations for reference recognition that
are able to cover the most typical citation patterns and, also, to
support multiple citations.                                                6.1    Template
   5.1.5 ECLI generation for European Courts. In those cases where         Along with the framework project, a Template project has been
a standard identifier can be simply generated as a composition of          developed in order to facilitate and encourage the adoption of the
the features extracted from the textual citation, the framework            software for the extraction of legal links in new languages and
provides a default service implementation to automatically assign          jurisdictions. The Template project provides a national implementor
an identifier to a legal reference. This is the case for the generation    with:
of ECLI for legal references that have the European Court of Human
Rights as the issuing authority, when the type of document, the                  • a complete Java project with organized packaging;
case number and the date are known.                                              • a configuration file for setting general parameters like the
                                                                                   author, language and jurisdiction of the extension;
   5.1.6 CELEX generation for European legislation. Another ser-                 • plain files of reusable macros of regular expressions to facil-
vice implementation supplied by the framework for the automatic                    itate the parsing of the annotations and to set up common
composition of a standard identifier is used for legislation references            language dependent expressions;
to European directives and regulations. For these types of docu-                 • several Java Enumerations extending the controlled lists
ment, when the referred document number and year are known, a                      of normalized values for annotations with customizable
CELEX identifier as well as its ELI identifier can be assigned to the              constants;
legal reference.                                                                 • a full pipeline of services pertaining to the different phases
  5.1.7 CELEX generation for European aliases. The framework                       of the legal links extraction process with dummy imple-
provides a controlled list of values of the main aliases pertaining                mentations in jflex.
BO-ECLI Parser Engine                                                                                       ASAIL 2017, June 16, 2017, London, UK


6.2    The Italian and Spanish extensions                               7     CONCLUSIONS
So far, following the Template project, two national extensions have    We presented the BO-ECLI Parser Engine, an open source frame-
been successfully developed for allowing the extraction of case-law     work for the automatic extraction of case-law and legislation refer-
and legislation references from case-law texts in the Italian and       ences from case-law texts issued in the European context.
Spanish languages. This section contains some brief considerations         Through a detailed description of its architecture and design,
concerning the development of such extensions.                          the paper showed how national extensions can be developed and
                                                                        plugged within the framework in order to add support for the
    6.2.1 Incremental annotations. The design of the BO-ECLI Parser     extraction process to different languages and jurisdictions. Specifi-
Engine, composed by a sequence of entity identification services        cally, thanks to a decomposition of the whole process into atomic
in charge of atomic tasks and accompanied with an incremental           tasks and to an internal annotation system, the framework is able
annotation system, makes the identification or disambiguation of        to provide a number of common services and resources that can be
complex entities possible, while keeping the code separated, read-      reused and extended by the national implementors.
able and upgradable. For example, it has been possible to correctly        By defining and providing a complete stack for legal links ex-
annotate the Italian Administrative Regional Tribunals in all their     traction, the implementation of a national extension is guided and
textual variants as well as the Spanish Provincial Court of Appeals,    straightforward, and the effort needed for the development of a
through the execution of distinct services responsible for the iden-    fully functional national extractor is considerably reduced.
tification of geographic entities, followed by the identification of       Along with the common framework, a Template project was also
sections and then the identification of local courts:                   created in order to encourage and facilitate the development of new
1) T.A.R. Sezione distaccata di Latina                                  national extensions. Moreover, the paper presents two concrete
2) T.A.R. Sezione distaccata di [BOECLI:GEO:IT_LT]                      national extensions that have already been developed by different
   Latina[/BOECLI]                                                      teams, proving both the feasibility and the straightforwardness of
3) T.A.R. [BOECLI:SECTION:IT_LT]Sezione distaccata                      the whole approach.
   di Latina[/BOECLI]                                                      The BOECLI Parser Engine, as well as the Template and the
4) [BOECLI:CASELAW_AUTHORITY:IT_TARLT]T.A.R.                            Italian and Spanish extensions, are open source projects. Their code
   Sezione distaccata di Latina[/BOECLI]                                and documentation are currently hosted on the GitLab software
                                                                        development platform5 .
The decoupling between annotation tasks hugely reduces the effort
needed for covering all the possible linguistic variants of such
                                                                        ACKNOWLEDGEMENT
complex entities since regular expressions are based on previous
annotation values that come from controlled lists, rather than on       This publication has been produced with the financial support of
their textual variants.                                                 the Justice Programme of the European Union. The contents of this
   In the example above, the fragment of Italian text is correctly      publication are the sole responsibility of the authors and can in no
annotated as a “case-law authority” with the value IT_TARLT.            way be taken to reflect the views of the European Commission.

   6.2.2 Custom identifier generation. For both extensions it has       REFERENCES
been possible to develop custom identifier generation services based    [1] Marc van Opijnen and Cristiana Santos. On the concept of relevance in legal
                                                                            information retrieval. Artificial Intelligence and Law, 25(1):65–87, 2017.
on a composition of the features of the legal references.               [2] Lorenzo Bacci, Enrico Francesconi, and MariaTeresa Sagri. A proposal for intro-
   Within the Italian extension, a service implementation is in             ducing the ecli standard in the italian judicial documentary system. In Proceedings
charge of the automatic generation of the ECLI for case-law refer-          of the 2013 Conference on Legal Knowledge and Information Systems: JURIX 2013:
                                                                            The Twenty-sixth Annual Conference, pages 49–58. IOS Press, Amsterdam (NL),
ences of high courts (Constitutional Court, Supreme Court, Council          2013.
of State and Court of Auditors) and another one is responsible          [3] Marc van Opijnen, Nico Verwer, and Jan Meijer. Beyond the experiment: The
for composing the national URN-NIR identifier for references to             extendable legal link extractor. In Workshop on Automated Detection, Extraction and
                                                                            Analysis of Semantic Information in Legal Texts, June 8-12 2015. held in conjunction
legislation, while in the Spanish extension the ECLI can be automat-        with the 2015 International Conference on Artificial Intelligence and Law (ICAIL)
ically produced when the ROJ (a Spanish identifier) is present in the       San Diego, CA, USA. Available at SSRN: https://ssrn.com/abstract=2626521.
                                                                        [4] A. Mowbray, P. Chung, and G. Greenleaf. A free access, automated law citator with
citation and hence among the features of the case-law reference.            international scope: the lawcite project. European Journal of Law and Technology,
                                                                            7(3), 2016. Available at: http://ejlt.org/article/view/496/691.
   6.2.3 Vocabulary extension. For both the extensions, the lists of    [5] Pavel Popov, Alexander Konstantinov, Hristo Konstantinov, and Livio Robaldo.
controlled values have been effectively extended by customizing the         EUCases project Deliverable D3.6, Report on Linking Tools’. 2014. Available
                                                                            at: http://eucases.eu/fileadmin/eucases/documents/eucases_d3.6_linkingtools_
Java Enumerations provided by the Template project. National case-          report_revised.pdf.
law authorities values are expressed following the ECLI convention      [6] Tommaso Agnoloni and Lorenzo Bacci. BO-ECLI project deliverable D2.1 Linking
for Court codes and include high courts as well as regional and             Data - analysis and existing solutions. 2016. Available at: http://bo-ecli.eu/uploads/
                                                                            deliverables/DeliverableWS2-D1.pdf.
local courts, while aliases values for national legislation include,    [7] Marc van Opijnen, Ginevra Peruginelli, Eleni Kefali, and Monica Palmirani. On-
for example, civil and penal codes.                                         line Publication of Court Decisions in the EU - Report of the Policy Group of the
                                                                            Project, ’Building on the European Case Law Identifier’. 2017. Available at: http:
                                                                            //bo-ecli.eu/uploads/deliverables/Deliverable%20WS0-D1.pdf.
   6.2.4 Pipeline. The pipeline included in the Template project,
that provides a number of suggested service implementations as
well as a default order of execution, has been used in both the
extensions and it has proven effective with only minor adjustments.     5 https://gitlab.com/BO-ECLI