=Paper=
{{Paper
|id=Vol-2143/paper4
|storemode=property
|title=BO-ECLI Parser Engine: the Extensible European Solution for the Automatic Extraction of Legal Links
|pdfUrl=https://ceur-ws.org/Vol-2143/paper4.pdf
|volume=Vol-2143
|authors=Tommaso Agnoloni,Lorenzo Bacci,Marc van Opijnen
|dblpUrl=https://dblp.org/rec/conf/icail/AgnoloniBO17
}}
==BO-ECLI Parser Engine: the Extensible European Solution for the Automatic Extraction of Legal Links==
BO-ECLI Parser Engine: the Extensible European Solution for
the Automatic Extraction of Legal Links
Tommaso Agnoloni Lorenzo Bacci Marc van Opijnen
Institute of Legal Information Theory Institute of Legal Information Theory Publications Office of the Netherlands
and Techniques and Techniques UBR|KOOP
ITTIG-CNR ITTIG-CNR The Hague, the Netherlands
Firenze, Italy Firenze, Italy marc.opijnen@koop.overheid.nl
tommaso.agnoloni@ittig.cnr.it lorenzo.bacci@ittig.cnr.it
ABSTRACT relations of the current document with other legal (legislative or
This paper presents the BO-ECLI Parser Engine, an open source Java judicial) documents, formally expressed using uniform identifiers
framework for the automatic extraction of case-law and legislation (the aforementioned ECLI for case-law, ELI for legislation, national
references from case-law texts issued in the European context. identifiers, CELEX identifiers for European legal sources).
Differences of languages and jurisdictions are tackled with an These relational metadata are at the same time among the most
extensible design that guides and facilitates the development of useful case-law metadata, in that they allow the enhancement of
pluggable national extensions, resulting in a considerably reduced legal information retrieval with relational search, and among the
effort with respect to the development of a full national legal link most difficult to have valued, especially for legacy data and for less
extractor from scratch. resourced languages and jurisdictions.
Thanks to a well-defined pipeline of services that synthesize the In the legal domain, citations are an integral part of a text and
whole extraction process and to an internal annotation system that their instrumental use for a variety of purposes (substantive, pro-
is used to convey the information along the pipeline, the software cedural, argumentative) is a familiar tool for legal professionals
ensures both overall efficiency and flexibility in absolving language performing their daily duties. Search by relationship [1] is there-
and jurisdiction dependent tasks. fore popular among users as it conforms with the typical attitude
Services can be provided either by the common part of the soft- of legal professionals confronted with the reconstruction of the
ware or by a national extension. For the implementation of services sources relevant for a legal issue at hand. Nonetheless it is poorly
performing rule-based textual analysis (like entity identification), supported by typical search engines relying on full text indexing
JFlex is used in the common part and recommended in the national and necessarily requires an explicit reference tagging to be dealt
extensions. Finally, through identifier generation services, the BOE- with by machines.
CLI Parser Engine can produce standard identifiers, like ECLI or Manual reference tagging is an extremely costly procedure, not
CELEX, for each recognized legal reference. viable in the public domain and especially unable to cope with the
Starting from a Template project, two different national exten- growing amount of data published in national case law databases.
sions have been successfully developed and tested in order to sup- Automatic legal reference extraction, on the other hand, has been
port the extraction of legal links from case-law texts written in the successfully applied in several national contexts [2], [3], [4], despite
Italian and Spanish languages. the complexity of coping with a diversity of styles, variants and
exceptions to existing drafting rules and citation guidelines.
KEYWORDS Due to the national specificities, national citation practices, and
language dependency of the task, the problem scales in complexity
Legal citations, Reference parsing, Multilinguality
when approaching it from a multilingual and multi-jurisdictional
perspective. Previous efforts in such direction within the EU-funded
1 INTRODUCTION EUCases project [5] was limited to the extraction of references from
Among the goals of the European Case Law Identifier (ECLI) es- national case law to EU legal sources.
tablished in 20101 is the publication of national case-law by courts Starting from an analysis of approaches and existing solutions to
of European member States via the ECLI Search Engine on the the “Linking data” problem [6] and based on the results of a survey
European e-Justice Portal. Besides being uniformly identified, deci- on citation practices within EU and national Member States’ courts
sions should be equipped with a minimal set of structured metadata [7], the BO-ECLI Parser Engine presented in this work and devel-
describing their main features. Among the (optional) metadata pre- oped within the EU funded project “Building on ECLI”2 , tackles
scribed by the ECLI Metadata Scheme, references metadata describe the problem from an EU-wide multilingual and multi-jurisdictional
perspective. The aim of the proposed framework is to lower the en-
1 Council conclusions inviting the introduction of the European Case Law Identifier
try barrier for national data providers willing to develop their own
(ECLI) and a minimum set of uniform metadata for case law (CELEX:52011XG0429(01)).
legal reference extraction solution by providing a proven method-
In: Proceedings of the Second Workshop on Automated Semantic Analysis of Informa-
ology, shared common knowledge and reusable and extendable
tion in Legal Text (ASAIL 2017), June 16, 2017, London, UK. components.
Copyright © 2017 held by the authors. Copying permitted for private and academic
purposes.
2 http://bo-ecli.eu
Published at http://ceur-ws.org
ASAIL 2017, June 16, 2017, London, UK Tommaso Agnoloni, Lorenzo Bacci, and Marc van Opijnen
3 A PIPELINE OF SERVICES
One way to synthesize a generic process of legal link extraction
from texts is, first, to divide it into three consecutive phases:
(1) the entity identification phase, where the fragments of text
that can potentially represent a feature of a citation are
identified and normalized;
(2) the reference recognition phase, where patterns of identi-
fied features are read in order to decide whether they form
a legal reference or not;
(3) the identifier generation phase, where the recognized legal
references are analyzed so that standard identifiers, and
possibly URLs, can be assigned to them.
Figure 1: The overall architecture of the BO-ECLI Parser En- Secondly, within every single phase, a number of different ser-
gine: a common framework with pluggable extensions for vices can be placed, each specialized in performing one task. For
supporting the parsing process of a text written in a specific example, within the entity identification phase, there could be a
language or issued within a specific jurisdiction in order to service specialized in the identification of case numbers.
get a collection of legal references and the original text with By modelling the process of legal link extraction with a sequence
inline annotations. of services that belong to these three distinct phases it is possible
to concretely achieve a separation in design between a common
part and an extension part. Specifically, the common part (i.e. the
2 PARSER ENGINE framework) defines the classes, the interfaces and the methods that
guide the implementation of any specific task and provides the
The BO-ECLI Parser Engine is an extensible framework for the
default implementations for services common to every jurisdiction,
extraction of legal links from case-law texts. It is written in Java and
like the generation of a CELEX identifier for legal references to
distributed as open source software3 . It targets citations to both case-
European legislation. On the other hand, the extension part must
law and legislation, expressed as lists of textual features (authority,
provide the implementations for services like the identification of
type of document, document number, date, etc.) or as common
national issuing authorities, a strictly language dependent task.
names (i.e. aliases). Multiple citations, intended either as citations to
In the Java domain the described separation between the com-
more than one partition of a single document or as citations to more
mon framework and the national extensions is realized through the
than one document issued by a single authority, are also covered
Service Provider Interface paradigm, which is part of standard Java
and distinct legal references are generated in correspondence to
and its adoption is straightforward. Following SPI, the integration
each partition and each document. A distinguishable characteristic
among the framework and all the different national extensions is as
of the software consists in the capability to be extended in order
simple as publishing them as standard Java jar libraries and making
to support the extraction process from texts written in different
them visible in the classpath.
languages or issued within different jurisdictions.
In order to realize such design, two practical steps are required:
• dividing the process of legal link extraction into a generic
and customizable sequence of atomic services, following a
pipeline pattern;
• defining an annotation system able to convey the work
done by each service along the pipeline.
Distributed as Java Libraries with a standard Java API, the soft-
ware can either be integrated within an existing environment for an
automatic batch parsing over a large corpus or be wrapped into an
even more interoperable HTTP API and queried by a remote user-
interface. Especially for this use case, the overall efficiency of the
software guarantees a quick response (in terms of user experience)
even with large case-law texts as input.
While the input of the software is as simple as text and additional
metadata, the result of the parsing process consists in a collection of
legal reference objects, possibly accompanied with legal identifiers, Figure 2: A schematic representation of the execution of a
and in the original text with added inline annotations, possibly with sequence of atomic tasks through a pipeline of service im-
hyperlinks, in correspondence with the recognized citations. plementations belonging to the three different phases of the
parsing process.
3 http://gitlab.com/BO-ECLI/Engine
BO-ECLI Parser Engine ASAIL 2017, June 16, 2017, London, UK
Figure 3: Empty interfaces representing annotation categories like authorities, aliases and types of document are defined in
the framework and are implemented by Java Enumerations provided by the national extensions or by the framework itself,
thus realizing extensible lists of normalized values for annotations that share a common Java type.
4 ANNOTATION SYSTEM the output the syntactically correct annotation for such fragment
The BO-ECLI Parser Engine framework defines an internal annota- of text. By exploiting the auxiliary methods, the implementor is
tion system to allow every service implementation, especially the completely guided in producing a syntactically correct annotation
ones belonging to entity identification and reference recognition, to with the allowed values for each annotation category.
save the specific results of their execution directly in the text. Thus, Reference recognition service implementations can also benefit
every service can be seen as a module that receives an annotated from specific auxiliary methods that are responsible for converting
text as input and produces an annotated text as output, enabling the annotations of several entities forming a valid pattern into a
the complete customizability of the pipeline: depending on the unique legal reference annotation.
language, jurisdiction or other specific metadata of the input text,
modules can be replaced, enabled, disabled or arranged differently. 5 SERVICE IMPLEMENTATION
Within the BO-ECLI Parser Engine, annotations are used to assign The implementation of an annotation service belonging to either
a category (hence, a meaning) to a fragment of text, while, through the entity identification or the reference recognition phase simply
normalization, annotated fragments of text can acquire a language consists in a piece of software that analyzes an input text, possibly
independent value. For example, the Italian fragment of text “sent.
already enriched with annotations, and produces an equivalent
della Corte Costituzionale”, meaning a judgment issued by the Ital-
ian Constitutional Court, at a certain point along the pipeline, is output text, possibly with altered annotations. Since the framework
annotated as follows: provides the methods for accessing the input text and for producing
the output text, it is up to the national implementor to decide, for
[BOECLI:CASELAW_TYPE:JUDGMENT]sent.[/BOECLI] della
each service implementation, the textual tools to be used in order
[BOECLI:CASELAW_AUTHORITY:IT_COST]Corte Costituzionale[/BOECLI]
to perform the matching. For example, those operations could be
The controlled lists for the annotation categories and annotation realized with the methods of the Java String class, a set of regular
values can be provided either by the common framework or by expressions and the Java regex package, proper lexical scanners, i.e.
the national extensions through Java Enumerations. Thanks to the automata.
annotation system, the work of each service is conveyed and shared The last approach is not only the most powerful and efficient, but
along the pipeline in a language independent way. also the most fitting for the implementation of an annotation service.
The default implementations of the annotation services provided
4.1 Auxiliary methods by the framework make use of JFlex4 , a well-known lexical scanner
One of the peculiarities of the framework is to provide the imple- generator for Java. Since the JFlex syntax allows for the insertion
mentor with a number of auxiliary methods that are used to produce of Java code, a jflex file can directly make use of the auxiliary
all the annotations in a transparent way. methods and Java Enumerations supplied by the framework and
For each annotation category during the entity identification by the national libraries.
phase, by passing a fragment of the input text and optionally a
normalized value as argument, an auxiliary method appends to 4 http://jflex.de
ASAIL 2017, June 16, 2017, London, UK Tommaso Agnoloni, Lorenzo Bacci, and Marc van Opijnen
5.1 Default service implementations to the European primary legislation, like the Treaty on European
A number of implementations for services that belong to each phase Union or the Treaty on the Functioning of the European Union, so
of the legal link extraction process are provided by the framework that a national implementor, by reusing such normalized values,
by default. Typically, a default implementation is supplied when is guided in the implementation of an identification service that
the task that the service is in charge of can be considered language covers those references expressed in his specific language. More-
independent, pertains to the European jurisdiction or is common over, within the framework, a default service implementation for
in the European context. identifier generation automatically assigns, for each registered alias
value pertaining to the European legislation, the correct CELEX
5.1.1 ECLI identification. An ECLI code can be used within a identifier.
case-law text as a feature of a more complete citation or as a citation
by itself. Since the ECLI code has a standardized syntax that doesn’t
5.1.8 HTML rendering. At the end of the pipeline of services,
depend on the language used in the rest of the case-law text, a
all temporary annotations are discarded and the original input text
default service for ECLI identification is implemented and supplied
is only annotated with the “legal reference” annotation category in
by the framework.
correspondence of the recognized references. A national implemen-
5.1.2 Partitions identification. A partition is a hierarchical branch tor can develop a rendering service in order to convert the internal
of partition elements like articles, paragraphs, letters, etc. While annotation of legal references to a specific format of his choice. By
the identification of each partition element is a language dependent default, the framework provides an HTML style rendering service
task, the framework provides a service that converts sequences of that transforms the “legal reference” annotation category into the
partition element annotations into unique partition annotations, tag and uses the optional URL assigned during the identifier
correctly composing the branches, especially in case of multiple generation phase as the value of the href attribute.
citations.
5.1.3 Parties identification. The identification of the names of 6 NATIONAL EXTENSIONS
the parties in a citation should be generally considered as a language National extensions are used by the framework to allow the ex-
dependent task. Nonetheless, the framework provides a default traction process from texts written in specific languages or issued
service implementation for the identification of applicants and within specific jurisdictions. In order to develop a new extension,
defendants relying on heuristics based on positioning, upper and the implementor has to:
lower casing, the versus entity and the geographic identification
of a country member of the Council of Europe (as a defendant in • extend the controlled lists of normalized values for the
European Court for Human Rights citations). annotations with the values pertaining to the new jurisdic-
tion;
5.1.4 Reference recognition. After the entity identification phase,
• provide a certain number of service implementations for
the textual features that can potentially be part of a legal reference
entity identification, reference recognition and identifier
are annotated and normalized, hence they can be treated as lan-
generation;
guage independent entities. Although citation practices change
• export the project as a Java jar library for compatibility
from one jurisdiction to another, the framework provides a number
with the Service Provider Interface.
of default service implementations for reference recognition that
are able to cover the most typical citation patterns and, also, to
support multiple citations. 6.1 Template
5.1.5 ECLI generation for European Courts. In those cases where Along with the framework project, a Template project has been
a standard identifier can be simply generated as a composition of developed in order to facilitate and encourage the adoption of the
the features extracted from the textual citation, the framework software for the extraction of legal links in new languages and
provides a default service implementation to automatically assign jurisdictions. The Template project provides a national implementor
an identifier to a legal reference. This is the case for the generation with:
of ECLI for legal references that have the European Court of Human
Rights as the issuing authority, when the type of document, the • a complete Java project with organized packaging;
case number and the date are known. • a configuration file for setting general parameters like the
author, language and jurisdiction of the extension;
5.1.6 CELEX generation for European legislation. Another ser- • plain files of reusable macros of regular expressions to facil-
vice implementation supplied by the framework for the automatic itate the parsing of the annotations and to set up common
composition of a standard identifier is used for legislation references language dependent expressions;
to European directives and regulations. For these types of docu- • several Java Enumerations extending the controlled lists
ment, when the referred document number and year are known, a of normalized values for annotations with customizable
CELEX identifier as well as its ELI identifier can be assigned to the constants;
legal reference. • a full pipeline of services pertaining to the different phases
5.1.7 CELEX generation for European aliases. The framework of the legal links extraction process with dummy imple-
provides a controlled list of values of the main aliases pertaining mentations in jflex.
BO-ECLI Parser Engine ASAIL 2017, June 16, 2017, London, UK
6.2 The Italian and Spanish extensions 7 CONCLUSIONS
So far, following the Template project, two national extensions have We presented the BO-ECLI Parser Engine, an open source frame-
been successfully developed for allowing the extraction of case-law work for the automatic extraction of case-law and legislation refer-
and legislation references from case-law texts in the Italian and ences from case-law texts issued in the European context.
Spanish languages. This section contains some brief considerations Through a detailed description of its architecture and design,
concerning the development of such extensions. the paper showed how national extensions can be developed and
plugged within the framework in order to add support for the
6.2.1 Incremental annotations. The design of the BO-ECLI Parser extraction process to different languages and jurisdictions. Specifi-
Engine, composed by a sequence of entity identification services cally, thanks to a decomposition of the whole process into atomic
in charge of atomic tasks and accompanied with an incremental tasks and to an internal annotation system, the framework is able
annotation system, makes the identification or disambiguation of to provide a number of common services and resources that can be
complex entities possible, while keeping the code separated, read- reused and extended by the national implementors.
able and upgradable. For example, it has been possible to correctly By defining and providing a complete stack for legal links ex-
annotate the Italian Administrative Regional Tribunals in all their traction, the implementation of a national extension is guided and
textual variants as well as the Spanish Provincial Court of Appeals, straightforward, and the effort needed for the development of a
through the execution of distinct services responsible for the iden- fully functional national extractor is considerably reduced.
tification of geographic entities, followed by the identification of Along with the common framework, a Template project was also
sections and then the identification of local courts: created in order to encourage and facilitate the development of new
1) T.A.R. Sezione distaccata di Latina national extensions. Moreover, the paper presents two concrete
2) T.A.R. Sezione distaccata di [BOECLI:GEO:IT_LT] national extensions that have already been developed by different
Latina[/BOECLI] teams, proving both the feasibility and the straightforwardness of
3) T.A.R. [BOECLI:SECTION:IT_LT]Sezione distaccata the whole approach.
di Latina[/BOECLI] The BOECLI Parser Engine, as well as the Template and the
4) [BOECLI:CASELAW_AUTHORITY:IT_TARLT]T.A.R. Italian and Spanish extensions, are open source projects. Their code
Sezione distaccata di Latina[/BOECLI] and documentation are currently hosted on the GitLab software
development platform5 .
The decoupling between annotation tasks hugely reduces the effort
needed for covering all the possible linguistic variants of such
ACKNOWLEDGEMENT
complex entities since regular expressions are based on previous
annotation values that come from controlled lists, rather than on This publication has been produced with the financial support of
their textual variants. the Justice Programme of the European Union. The contents of this
In the example above, the fragment of Italian text is correctly publication are the sole responsibility of the authors and can in no
annotated as a “case-law authority” with the value IT_TARLT. way be taken to reflect the views of the European Commission.
6.2.2 Custom identifier generation. For both extensions it has REFERENCES
been possible to develop custom identifier generation services based [1] Marc van Opijnen and Cristiana Santos. On the concept of relevance in legal
information retrieval. Artificial Intelligence and Law, 25(1):65–87, 2017.
on a composition of the features of the legal references. [2] Lorenzo Bacci, Enrico Francesconi, and MariaTeresa Sagri. A proposal for intro-
Within the Italian extension, a service implementation is in ducing the ecli standard in the italian judicial documentary system. In Proceedings
charge of the automatic generation of the ECLI for case-law refer- of the 2013 Conference on Legal Knowledge and Information Systems: JURIX 2013:
The Twenty-sixth Annual Conference, pages 49–58. IOS Press, Amsterdam (NL),
ences of high courts (Constitutional Court, Supreme Court, Council 2013.
of State and Court of Auditors) and another one is responsible [3] Marc van Opijnen, Nico Verwer, and Jan Meijer. Beyond the experiment: The
for composing the national URN-NIR identifier for references to extendable legal link extractor. In Workshop on Automated Detection, Extraction and
Analysis of Semantic Information in Legal Texts, June 8-12 2015. held in conjunction
legislation, while in the Spanish extension the ECLI can be automat- with the 2015 International Conference on Artificial Intelligence and Law (ICAIL)
ically produced when the ROJ (a Spanish identifier) is present in the San Diego, CA, USA. Available at SSRN: https://ssrn.com/abstract=2626521.
[4] A. Mowbray, P. Chung, and G. Greenleaf. A free access, automated law citator with
citation and hence among the features of the case-law reference. international scope: the lawcite project. European Journal of Law and Technology,
7(3), 2016. Available at: http://ejlt.org/article/view/496/691.
6.2.3 Vocabulary extension. For both the extensions, the lists of [5] Pavel Popov, Alexander Konstantinov, Hristo Konstantinov, and Livio Robaldo.
controlled values have been effectively extended by customizing the EUCases project Deliverable D3.6, Report on Linking Tools’. 2014. Available
at: http://eucases.eu/fileadmin/eucases/documents/eucases_d3.6_linkingtools_
Java Enumerations provided by the Template project. National case- report_revised.pdf.
law authorities values are expressed following the ECLI convention [6] Tommaso Agnoloni and Lorenzo Bacci. BO-ECLI project deliverable D2.1 Linking
for Court codes and include high courts as well as regional and Data - analysis and existing solutions. 2016. Available at: http://bo-ecli.eu/uploads/
deliverables/DeliverableWS2-D1.pdf.
local courts, while aliases values for national legislation include, [7] Marc van Opijnen, Ginevra Peruginelli, Eleni Kefali, and Monica Palmirani. On-
for example, civil and penal codes. line Publication of Court Decisions in the EU - Report of the Policy Group of the
Project, ’Building on the European Case Law Identifier’. 2017. Available at: http:
//bo-ecli.eu/uploads/deliverables/Deliverable%20WS0-D1.pdf.
6.2.4 Pipeline. The pipeline included in the Template project,
that provides a number of suggested service implementations as
well as a default order of execution, has been used in both the
extensions and it has proven effective with only minor adjustments. 5 https://gitlab.com/BO-ECLI