BO-ECLI Parser Engine: the Extensible European Solution for the Automatic Extraction of Legal Links Tommaso Agnoloni Lorenzo Bacci Marc van Opijnen Institute of Legal Information Theory Institute of Legal Information Theory Publications Office of the Netherlands and Techniques and Techniques UBR|KOOP ITTIG-CNR ITTIG-CNR The Hague, the Netherlands Firenze, Italy Firenze, Italy marc.opijnen@koop.overheid.nl tommaso.agnoloni@ittig.cnr.it lorenzo.bacci@ittig.cnr.it ABSTRACT relations of the current document with other legal (legislative or This paper presents the BO-ECLI Parser Engine, an open source Java judicial) documents, formally expressed using uniform identifiers framework for the automatic extraction of case-law and legislation (the aforementioned ECLI for case-law, ELI for legislation, national references from case-law texts issued in the European context. identifiers, CELEX identifiers for European legal sources). Differences of languages and jurisdictions are tackled with an These relational metadata are at the same time among the most extensible design that guides and facilitates the development of useful case-law metadata, in that they allow the enhancement of pluggable national extensions, resulting in a considerably reduced legal information retrieval with relational search, and among the effort with respect to the development of a full national legal link most difficult to have valued, especially for legacy data and for less extractor from scratch. resourced languages and jurisdictions. Thanks to a well-defined pipeline of services that synthesize the In the legal domain, citations are an integral part of a text and whole extraction process and to an internal annotation system that their instrumental use for a variety of purposes (substantive, pro- is used to convey the information along the pipeline, the software cedural, argumentative) is a familiar tool for legal professionals ensures both overall efficiency and flexibility in absolving language performing their daily duties. Search by relationship [1] is there- and jurisdiction dependent tasks. fore popular among users as it conforms with the typical attitude Services can be provided either by the common part of the soft- of legal professionals confronted with the reconstruction of the ware or by a national extension. For the implementation of services sources relevant for a legal issue at hand. Nonetheless it is poorly performing rule-based textual analysis (like entity identification), supported by typical search engines relying on full text indexing JFlex is used in the common part and recommended in the national and necessarily requires an explicit reference tagging to be dealt extensions. Finally, through identifier generation services, the BOE- with by machines. CLI Parser Engine can produce standard identifiers, like ECLI or Manual reference tagging is an extremely costly procedure, not CELEX, for each recognized legal reference. viable in the public domain and especially unable to cope with the Starting from a Template project, two different national exten- growing amount of data published in national case law databases. sions have been successfully developed and tested in order to sup- Automatic legal reference extraction, on the other hand, has been port the extraction of legal links from case-law texts written in the successfully applied in several national contexts [2], [3], [4], despite Italian and Spanish languages. the complexity of coping with a diversity of styles, variants and exceptions to existing drafting rules and citation guidelines. KEYWORDS Due to the national specificities, national citation practices, and language dependency of the task, the problem scales in complexity Legal citations, Reference parsing, Multilinguality when approaching it from a multilingual and multi-jurisdictional perspective. Previous efforts in such direction within the EU-funded 1 INTRODUCTION EUCases project [5] was limited to the extraction of references from Among the goals of the European Case Law Identifier (ECLI) es- national case law to EU legal sources. tablished in 20101 is the publication of national case-law by courts Starting from an analysis of approaches and existing solutions to of European member States via the ECLI Search Engine on the the “Linking data” problem [6] and based on the results of a survey European e-Justice Portal. Besides being uniformly identified, deci- on citation practices within EU and national Member States’ courts sions should be equipped with a minimal set of structured metadata [7], the BO-ECLI Parser Engine presented in this work and devel- describing their main features. Among the (optional) metadata pre- oped within the EU funded project “Building on ECLI”2 , tackles scribed by the ECLI Metadata Scheme, references metadata describe the problem from an EU-wide multilingual and multi-jurisdictional perspective. The aim of the proposed framework is to lower the en- 1 Council conclusions inviting the introduction of the European Case Law Identifier try barrier for national data providers willing to develop their own (ECLI) and a minimum set of uniform metadata for case law (CELEX:52011XG0429(01)). legal reference extraction solution by providing a proven method- In: Proceedings of the Second Workshop on Automated Semantic Analysis of Informa- ology, shared common knowledge and reusable and extendable tion in Legal Text (ASAIL 2017), June 16, 2017, London, UK. components. Copyright © 2017 held by the authors. Copying permitted for private and academic purposes. 2 http://bo-ecli.eu Published at http://ceur-ws.org ASAIL 2017, June 16, 2017, London, UK Tommaso Agnoloni, Lorenzo Bacci, and Marc van Opijnen 3 A PIPELINE OF SERVICES One way to synthesize a generic process of legal link extraction from texts is, first, to divide it into three consecutive phases: (1) the entity identification phase, where the fragments of text that can potentially represent a feature of a citation are identified and normalized; (2) the reference recognition phase, where patterns of identi- fied features are read in order to decide whether they form a legal reference or not; (3) the identifier generation phase, where the recognized legal references are analyzed so that standard identifiers, and possibly URLs, can be assigned to them. Figure 1: The overall architecture of the BO-ECLI Parser En- Secondly, within every single phase, a number of different ser- gine: a common framework with pluggable extensions for vices can be placed, each specialized in performing one task. For supporting the parsing process of a text written in a specific example, within the entity identification phase, there could be a language or issued within a specific jurisdiction in order to service specialized in the identification of case numbers. get a collection of legal references and the original text with By modelling the process of legal link extraction with a sequence inline annotations. of services that belong to these three distinct phases it is possible to concretely achieve a separation in design between a common part and an extension part. Specifically, the common part (i.e. the 2 PARSER ENGINE framework) defines the classes, the interfaces and the methods that guide the implementation of any specific task and provides the The BO-ECLI Parser Engine is an extensible framework for the default implementations for services common to every jurisdiction, extraction of legal links from case-law texts. It is written in Java and like the generation of a CELEX identifier for legal references to distributed as open source software3 . It targets citations to both case- European legislation. On the other hand, the extension part must law and legislation, expressed as lists of textual features (authority, provide the implementations for services like the identification of type of document, document number, date, etc.) or as common national issuing authorities, a strictly language dependent task. names (i.e. aliases). Multiple citations, intended either as citations to In the Java domain the described separation between the com- more than one partition of a single document or as citations to more mon framework and the national extensions is realized through the than one document issued by a single authority, are also covered Service Provider Interface paradigm, which is part of standard Java and distinct legal references are generated in correspondence to and its adoption is straightforward. Following SPI, the integration each partition and each document. A distinguishable characteristic among the framework and all the different national extensions is as of the software consists in the capability to be extended in order simple as publishing them as standard Java jar libraries and making to support the extraction process from texts written in different them visible in the classpath. languages or issued within different jurisdictions. In order to realize such design, two practical steps are required: • dividing the process of legal link extraction into a generic and customizable sequence of atomic services, following a pipeline pattern; • defining an annotation system able to convey the work done by each service along the pipeline. Distributed as Java Libraries with a standard Java API, the soft- ware can either be integrated within an existing environment for an automatic batch parsing over a large corpus or be wrapped into an even more interoperable HTTP API and queried by a remote user- interface. Especially for this use case, the overall efficiency of the software guarantees a quick response (in terms of user experience) even with large case-law texts as input. While the input of the software is as simple as text and additional metadata, the result of the parsing process consists in a collection of legal reference objects, possibly accompanied with legal identifiers, Figure 2: A schematic representation of the execution of a and in the original text with added inline annotations, possibly with sequence of atomic tasks through a pipeline of service im- hyperlinks, in correspondence with the recognized citations. plementations belonging to the three different phases of the parsing process. 3 http://gitlab.com/BO-ECLI/Engine BO-ECLI Parser Engine ASAIL 2017, June 16, 2017, London, UK Figure 3: Empty interfaces representing annotation categories like authorities, aliases and types of document are defined in the framework and are implemented by Java Enumerations provided by the national extensions or by the framework itself, thus realizing extensible lists of normalized values for annotations that share a common Java type. 4 ANNOTATION SYSTEM the output the syntactically correct annotation for such fragment The BO-ECLI Parser Engine framework defines an internal annota- of text. By exploiting the auxiliary methods, the implementor is tion system to allow every service implementation, especially the completely guided in producing a syntactically correct annotation ones belonging to entity identification and reference recognition, to with the allowed values for each annotation category. save the specific results of their execution directly in the text. Thus, Reference recognition service implementations can also benefit every service can be seen as a module that receives an annotated from specific auxiliary methods that are responsible for converting text as input and produces an annotated text as output, enabling the annotations of several entities forming a valid pattern into a the complete customizability of the pipeline: depending on the unique legal reference annotation. language, jurisdiction or other specific metadata of the input text, modules can be replaced, enabled, disabled or arranged differently. 5 SERVICE IMPLEMENTATION Within the BO-ECLI Parser Engine, annotations are used to assign The implementation of an annotation service belonging to either a category (hence, a meaning) to a fragment of text, while, through the entity identification or the reference recognition phase simply normalization, annotated fragments of text can acquire a language consists in a piece of software that analyzes an input text, possibly independent value. For example, the Italian fragment of text “sent. already enriched with annotations, and produces an equivalent della Corte Costituzionale”, meaning a judgment issued by the Ital- ian Constitutional Court, at a certain point along the pipeline, is output text, possibly with altered annotations. Since the framework annotated as follows: provides the methods for accessing the input text and for producing the output text, it is up to the national implementor to decide, for [BOECLI:CASELAW_TYPE:JUDGMENT]sent.[/BOECLI] della each service implementation, the textual tools to be used in order [BOECLI:CASELAW_AUTHORITY:IT_COST]Corte Costituzionale[/BOECLI] to perform the matching. For example, those operations could be The controlled lists for the annotation categories and annotation realized with the methods of the Java String class, a set of regular values can be provided either by the common framework or by expressions and the Java regex package, proper lexical scanners, i.e. the national extensions through Java Enumerations. Thanks to the automata. annotation system, the work of each service is conveyed and shared The last approach is not only the most powerful and efficient, but along the pipeline in a language independent way. also the most fitting for the implementation of an annotation service. The default implementations of the annotation services provided 4.1 Auxiliary methods by the framework make use of JFlex4 , a well-known lexical scanner One of the peculiarities of the framework is to provide the imple- generator for Java. Since the JFlex syntax allows for the insertion mentor with a number of auxiliary methods that are used to produce of Java code, a jflex file can directly make use of the auxiliary all the annotations in a transparent way. methods and Java Enumerations supplied by the framework and For each annotation category during the entity identification by the national libraries. phase, by passing a fragment of the input text and optionally a normalized value as argument, an auxiliary method appends to 4 http://jflex.de ASAIL 2017, June 16, 2017, London, UK Tommaso Agnoloni, Lorenzo Bacci, and Marc van Opijnen 5.1 Default service implementations to the European primary legislation, like the Treaty on European A number of implementations for services that belong to each phase Union or the Treaty on the Functioning of the European Union, so of the legal link extraction process are provided by the framework that a national implementor, by reusing such normalized values, by default. Typically, a default implementation is supplied when is guided in the implementation of an identification service that the task that the service is in charge of can be considered language covers those references expressed in his specific language. More- independent, pertains to the European jurisdiction or is common over, within the framework, a default service implementation for in the European context. identifier generation automatically assigns, for each registered alias value pertaining to the European legislation, the correct CELEX 5.1.1 ECLI identification. An ECLI code can be used within a identifier. case-law text as a feature of a more complete citation or as a citation by itself. Since the ECLI code has a standardized syntax that doesn’t 5.1.8 HTML rendering. At the end of the pipeline of services, depend on the language used in the rest of the case-law text, a all temporary annotations are discarded and the original input text default service for ECLI identification is implemented and supplied is only annotated with the “legal reference” annotation category in by the framework. correspondence of the recognized references. A national implemen- 5.1.2 Partitions identification. A partition is a hierarchical branch tor can develop a rendering service in order to convert the internal of partition elements like articles, paragraphs, letters, etc. While annotation of legal references to a specific format of his choice. By the identification of each partition element is a language dependent default, the framework provides an HTML style rendering service task, the framework provides a service that converts sequences of that transforms the “legal reference” annotation category into the partition element annotations into unique partition annotations, tag and uses the optional URL assigned during the identifier correctly composing the branches, especially in case of multiple generation phase as the value of the href attribute. citations. 5.1.3 Parties identification. The identification of the names of 6 NATIONAL EXTENSIONS the parties in a citation should be generally considered as a language National extensions are used by the framework to allow the ex- dependent task. Nonetheless, the framework provides a default traction process from texts written in specific languages or issued service implementation for the identification of applicants and within specific jurisdictions. In order to develop a new extension, defendants relying on heuristics based on positioning, upper and the implementor has to: lower casing, the versus entity and the geographic identification of a country member of the Council of Europe (as a defendant in • extend the controlled lists of normalized values for the European Court for Human Rights citations). annotations with the values pertaining to the new jurisdic- tion; 5.1.4 Reference recognition. After the entity identification phase, • provide a certain number of service implementations for the textual features that can potentially be part of a legal reference entity identification, reference recognition and identifier are annotated and normalized, hence they can be treated as lan- generation; guage independent entities. Although citation practices change • export the project as a Java jar library for compatibility from one jurisdiction to another, the framework provides a number with the Service Provider Interface. of default service implementations for reference recognition that are able to cover the most typical citation patterns and, also, to support multiple citations. 6.1 Template 5.1.5 ECLI generation for European Courts. In those cases where Along with the framework project, a Template project has been a standard identifier can be simply generated as a composition of developed in order to facilitate and encourage the adoption of the the features extracted from the textual citation, the framework software for the extraction of legal links in new languages and provides a default service implementation to automatically assign jurisdictions. The Template project provides a national implementor an identifier to a legal reference. This is the case for the generation with: of ECLI for legal references that have the European Court of Human Rights as the issuing authority, when the type of document, the • a complete Java project with organized packaging; case number and the date are known. • a configuration file for setting general parameters like the author, language and jurisdiction of the extension; 5.1.6 CELEX generation for European legislation. Another ser- • plain files of reusable macros of regular expressions to facil- vice implementation supplied by the framework for the automatic itate the parsing of the annotations and to set up common composition of a standard identifier is used for legislation references language dependent expressions; to European directives and regulations. For these types of docu- • several Java Enumerations extending the controlled lists ment, when the referred document number and year are known, a of normalized values for annotations with customizable CELEX identifier as well as its ELI identifier can be assigned to the constants; legal reference. • a full pipeline of services pertaining to the different phases 5.1.7 CELEX generation for European aliases. The framework of the legal links extraction process with dummy imple- provides a controlled list of values of the main aliases pertaining mentations in jflex. BO-ECLI Parser Engine ASAIL 2017, June 16, 2017, London, UK 6.2 The Italian and Spanish extensions 7 CONCLUSIONS So far, following the Template project, two national extensions have We presented the BO-ECLI Parser Engine, an open source frame- been successfully developed for allowing the extraction of case-law work for the automatic extraction of case-law and legislation refer- and legislation references from case-law texts in the Italian and ences from case-law texts issued in the European context. Spanish languages. This section contains some brief considerations Through a detailed description of its architecture and design, concerning the development of such extensions. the paper showed how national extensions can be developed and plugged within the framework in order to add support for the 6.2.1 Incremental annotations. The design of the BO-ECLI Parser extraction process to different languages and jurisdictions. Specifi- Engine, composed by a sequence of entity identification services cally, thanks to a decomposition of the whole process into atomic in charge of atomic tasks and accompanied with an incremental tasks and to an internal annotation system, the framework is able annotation system, makes the identification or disambiguation of to provide a number of common services and resources that can be complex entities possible, while keeping the code separated, read- reused and extended by the national implementors. able and upgradable. For example, it has been possible to correctly By defining and providing a complete stack for legal links ex- annotate the Italian Administrative Regional Tribunals in all their traction, the implementation of a national extension is guided and textual variants as well as the Spanish Provincial Court of Appeals, straightforward, and the effort needed for the development of a through the execution of distinct services responsible for the iden- fully functional national extractor is considerably reduced. tification of geographic entities, followed by the identification of Along with the common framework, a Template project was also sections and then the identification of local courts: created in order to encourage and facilitate the development of new 1) T.A.R. Sezione distaccata di Latina national extensions. Moreover, the paper presents two concrete 2) T.A.R. Sezione distaccata di [BOECLI:GEO:IT_LT] national extensions that have already been developed by different Latina[/BOECLI] teams, proving both the feasibility and the straightforwardness of 3) T.A.R. [BOECLI:SECTION:IT_LT]Sezione distaccata the whole approach. di Latina[/BOECLI] The BOECLI Parser Engine, as well as the Template and the 4) [BOECLI:CASELAW_AUTHORITY:IT_TARLT]T.A.R. Italian and Spanish extensions, are open source projects. Their code Sezione distaccata di Latina[/BOECLI] and documentation are currently hosted on the GitLab software development platform5 . The decoupling between annotation tasks hugely reduces the effort needed for covering all the possible linguistic variants of such ACKNOWLEDGEMENT complex entities since regular expressions are based on previous annotation values that come from controlled lists, rather than on This publication has been produced with the financial support of their textual variants. the Justice Programme of the European Union. The contents of this In the example above, the fragment of Italian text is correctly publication are the sole responsibility of the authors and can in no annotated as a “case-law authority” with the value IT_TARLT. way be taken to reflect the views of the European Commission. 6.2.2 Custom identifier generation. For both extensions it has REFERENCES been possible to develop custom identifier generation services based [1] Marc van Opijnen and Cristiana Santos. On the concept of relevance in legal information retrieval. Artificial Intelligence and Law, 25(1):65–87, 2017. on a composition of the features of the legal references. [2] Lorenzo Bacci, Enrico Francesconi, and MariaTeresa Sagri. A proposal for intro- Within the Italian extension, a service implementation is in ducing the ecli standard in the italian judicial documentary system. In Proceedings charge of the automatic generation of the ECLI for case-law refer- of the 2013 Conference on Legal Knowledge and Information Systems: JURIX 2013: The Twenty-sixth Annual Conference, pages 49–58. IOS Press, Amsterdam (NL), ences of high courts (Constitutional Court, Supreme Court, Council 2013. of State and Court of Auditors) and another one is responsible [3] Marc van Opijnen, Nico Verwer, and Jan Meijer. Beyond the experiment: The for composing the national URN-NIR identifier for references to extendable legal link extractor. In Workshop on Automated Detection, Extraction and Analysis of Semantic Information in Legal Texts, June 8-12 2015. held in conjunction legislation, while in the Spanish extension the ECLI can be automat- with the 2015 International Conference on Artificial Intelligence and Law (ICAIL) ically produced when the ROJ (a Spanish identifier) is present in the San Diego, CA, USA. Available at SSRN: https://ssrn.com/abstract=2626521. [4] A. Mowbray, P. Chung, and G. Greenleaf. A free access, automated law citator with citation and hence among the features of the case-law reference. international scope: the lawcite project. European Journal of Law and Technology, 7(3), 2016. Available at: http://ejlt.org/article/view/496/691. 6.2.3 Vocabulary extension. For both the extensions, the lists of [5] Pavel Popov, Alexander Konstantinov, Hristo Konstantinov, and Livio Robaldo. controlled values have been effectively extended by customizing the EUCases project Deliverable D3.6, Report on Linking Tools’. 2014. Available at: http://eucases.eu/fileadmin/eucases/documents/eucases_d3.6_linkingtools_ Java Enumerations provided by the Template project. National case- report_revised.pdf. law authorities values are expressed following the ECLI convention [6] Tommaso Agnoloni and Lorenzo Bacci. BO-ECLI project deliverable D2.1 Linking for Court codes and include high courts as well as regional and Data - analysis and existing solutions. 2016. Available at: http://bo-ecli.eu/uploads/ deliverables/DeliverableWS2-D1.pdf. local courts, while aliases values for national legislation include, [7] Marc van Opijnen, Ginevra Peruginelli, Eleni Kefali, and Monica Palmirani. On- for example, civil and penal codes. line Publication of Court Decisions in the EU - Report of the Policy Group of the Project, ’Building on the European Case Law Identifier’. 2017. Available at: http: //bo-ecli.eu/uploads/deliverables/Deliverable%20WS0-D1.pdf. 6.2.4 Pipeline. The pipeline included in the Template project, that provides a number of suggested service implementations as well as a default order of execution, has been used in both the extensions and it has proven effective with only minor adjustments. 5 https://gitlab.com/BO-ECLI