Proceedings of the 1st Workshop on Technologies for Regulatory Compliance LawORDate: a Service for Distinguishing Legal References from Temporal Expressions María Navas-Loro1 Ontology Engineering Group Universidad Politécnica de Madrid, Madrid, España, mnavas@fi.upm.es, http://marianavas.oeg-upm.net/ Abstract. References to documents in the legal domain usually follow patterns containing temporal information in different forms (e.g. ’Di- rective 2001/29’). These references mislead algorithms detecting pure temporal references, and false positives occur in named entity recogni- tion algorithms searching dates or intervals. This paper presents meth- ods and techniques to identify these references, applied to two different domains. The first domain is that of news, where the temporal infor- mation plays a crucial role for their understanding and automatically building timelines can be hampered by the errors induced from these le- gal references. The second domain is that dataset descriptions. Dataset descriptions sometimes contain temporal information, not only in their dedicated metadata fields (e.g. dataset creation) but also within the text of their description. LawORDate, the system presented in this paper, is a web service able to detect legal references with temporal information in Spanish texts. The service identifies these references, avoiding their annotation by temporal taggers and enabling a further step of linking the references to the original sources and building co-reference graphs. Keywords: legal references, temporal expressions, news, dataset de- scription 1 Introduction Temporal expressions detection, mainly focused on news, is a emerging field gaining more and more importance in NLP. Efforts such as the NewsReader project1 and the TempEval [1, 2] initiatives in SemEval, along with subsequent more specific temporal tasks [3, 4] show the interest in processing the temporal dimension on all kind of texts. Usually processing of temporal expressions is done regarding the concrete type of text being faced, both depending on its field (such as news, clinical domain or historical texts) or extension (free texts or length-limited tweets). Due to this specialization, systems do not usually react well when they find expressions from other fields, such as is the case of legal references in news or dataset description. 1 http://www.newsreader-project.eu/results/data/wikinews/ 25 Proceedings of the 1st Workshop on Technologies for Regulatory Compliance The boom of open data portals also present this kind of mixed information. Thousands of datasets become publicly available everyday, sometimes presenting just basic scarce metadata such as title and description. Being able to extract additional information and new search parameters from them, such as named en- tities or temporal references, would facilitate managing them, along with linking them resources or queries. To this end, a system2 was built to extract temporal coverage from both news and related datasets in Spanish, some of them in the legal domain, and be able to link them based in the temporal dimension. This system calls an existing temporal tagger, HeidelTime [5], able to detect temporal expressions in texts in Spanish and tag them following the TIMEX3 annotation standard. Nevertheless, this tagger happened to tag as temporal expressions references to Spanish laws and legal documents that led to false positives, such as shown in the example exposed in Fig. 1, extracted from a real article3 . The result of the tagging by HeidelTime can be found in Fig. 2. Estas actividades están reguladas por Real Decreto 1341/2007, de 11 de octubre sobre la gestión de la calidad de las aguas de baño, incorporando al derecho español la Directiva 2006/7/CE del Parlamento Europeo y del Consejo de 15 de febrero de 2006 relativa a la gestión de la calidad de las aguas de baño. Fig. 1. For English: ’These activities are regulated by Royal Decree 1341/2007, of 11th October on the management of bathing water quality, incorporating into Spanish law Directive 2006/7/ EC of the European Parliament and of the Council of 15th February 2006 on to the management of the quality of bathing waters.’ Estas actividades están reguladas por Real Decreto 1341/2007, de 11 de octubre sobre la gestión de la calidad de las aguas de baño, incorporando al derecho español la Directiva 2006/7/CE del Parlamento Europeo y del Consejo de 15 de febrero de 2006 relativa a la gestión de la calidad de las aguas de baño. Fig. 2. In blue, result of HeidelTime tagging on the text in Fig. 1. We also find this problem in the description of datasets, being specially prob- lematic when obtaining obviously inconsistent dates such as happens in the ex- 2 https://github.com/mnavasloro/AportaCuando 3 http://www.castillalamancha.es/actualidad/notasdeprensa/castilla-la-mancha- cuenta-con-35-zonas-de-ba%C3%B1o-autorizadas-donde-disfrutar-de-la-naturaleza 26 Proceedings of the 1st Workshop on Technologies for Regulatory Compliance ample in Fig.3, extracted from the description of a real dataset4 . Here the tagged dates without a legal-focused preprocessing were ’2093’, ’2008’ and ’2008-12- 19’. While the latest can at least be used as a lower temporal bound (since there is no additional temporal information on the coverage in the description), the year 2093 is obviously inconsistent. Base de datos que proporciona información sobre los Centros Tecnológicos y Cen- tros de apoyo a la Innovacin inscritos en el registro creado mediante el Real De- creto 2093/2008, de 19 de diciembre. Permite la consulta por Modalidad, rea Tec- nológica, Sector, Comunidad Autónoma y/o Provincia. Además, posibilita la descarga de la versión completa en PDF. Fig. 3. For English: ’Database that provides information on Technology Centers and Innovation Support Centers registered in the registry created by the Royal Decree 2093/2008, of December 19. It allows consultation by Modality, Technological Area, Sector, Autonomous Community and/or Province. In addition, it allows to download the full version in PDF.’ The aim of the web service LawORDate5 introduced in this paper is to detect common legal expressions appearing in non-legal texts that tend to mislead temporal taggers and replace them in the text, in order to obtain a clean version of it where temporal taggers are able to detect just temporal expressions. The remainder will expose a brief state-of-the-art and an analysis on usual legal expressions with patterns similar to temporal expressions in Spanish, along with examples of regular expressions able to detect most of them (tested in a case of use on descriptions of datasets from the Spanish Open Data portal). Finally, conclusions derived from this analysis and future work on this topic will be exposed. 2 State of the Art Processing the temporal dimension of legal text has been previously tackled in literature [6–8], and the confusion between legal and temporal references has been previously exposed [9]. Nevertheless, to the best of her knowledge, the author is not aware to any previous dedicated approach to detect legal references specifically for ulterior temporal processing. Identification of legal cross-references has been widely studied in literature [10], being targeted in different languages (such as French [11, 12], Dutch [13], Italian [14] or Japanese [15]) and with different levels of deepness. We find for instance the approach of Adedjouma et al. [11] for the Luxembourg’s Legisla- tion (later expanded to a Canadian legal corpus [12]), where a complete schema 4 http://datos.gob.es/catalogo/e04990501-registro-de-centros-tecnologicos-y-centros- de-apoyo-a-la-innovacion-tecnologica 5 https://github.com/mnavasloro/LawORDate (with information on how to use the web service) 27 Proceedings of the 1st Workshop on Technologies for Regulatory Compliance identifying different parts that can be included in a reference in this context (such as Part, Book or Article), along with the different information in them (dates, names, headers...) are built. The authors also make a distinction be- tween simple and complex cross reference patterns; this had been previously exposed also in [13], including also some special cases, where a grammar allowed identification of just in-collection legal references in documents from the Dutch Tax and Customs Administration. Finally, the work by Tran et al. [15] focused on references to sub-document targets, proposing machine-learning based ap- proaches. Also more generic-aimed frameworks for managing legal documents, such as NORMA-system [16], include services for marking-up legal references, called by further works [17]. Differently from the temporal aim presented in this paper, the use of this techniques for legal references identification go from mark-up and linking [16] to normalization [14]. Most of these approaches are based in patterns; the only work in Spanish the author is aware of also follows this pattern-based approach [18]. 3 Analysis of the problem In the frame of news and dataset description processing, namely trying to locate them into a temporal instant or interval, several legal references happened to be tagged as temporal expressions by a state-of-the-art temporal tagger. Some examples are the following expressions, that refer to different official Spanish documents or laws: – Ley Orgánica 10/1995 (Organic Law). – Ley 22/2011, de 28 de julio (Law). – BOE: 29/07/2011 or BOE de 22 de julio or BOE núm. 306, de 23 de diciembre (BOE: Boletı́n Oficial del Estado Official State Gazette). – Real Decreto 1341/2007 (sometimes also expressed as RD 1463/2007, Royal Decree) – Directiva 2012/27/UE. These references are often also surrounded by a date referred to their creation (being therefore important to detect them as well). These legal expressions can also include additional words such as in ’Real Decreto Legislativo’ (Legislative Royal Decree) or be combined such as in ’Real Decreto Legislativo 1/2004 de 5 de enero BOE de 8 de marzo’. Also exceptions where dates near to references to legal documents can be found, such as happens when the dataset contains information about the proper legal document, such as in the example6 depicted below, where the dates refer indeed to temporal coverage: 6 http://datos.gob.es/catalogo/l01280148-publicaciones-en-boletin-oficial-del-estado- boe-2013-2017 28 Proceedings of the 1st Workshop on Technologies for Regulatory Compliance Publicaciones en Boletn Oficial del Estado (BOE): 2013-2017. (for English: ’Pub- lications in the Official State Gazette (BOE): 2013-2017.’ ) The problem of detecting these references is therefore not straightforward. In the following section some patterns for references found in a concrete application case (dataset descriptions and news) are introduced. 4 Hands on and first patterns The corpus we have worked with is a dump of metadata from the Spanish open data portal datos.gob.es7 , consisting of almost 16k datasets. Some of them con- tained temporal coverage information expressed as dcat:temporal property8 , but most of them had their upload and creation date as only temporal informa- tion, along with information on the publisher, the title and the description. A first analysis performed on metadata from these datasets showed that most appearances followed constrained patterns, and that texts that presented this kind of references misled the temporal tagger. Besides detecting temporal expressions that are not actually from the text timeline, another major prob- lem derives from this misleading: temporal normalization9 is also affected, since some dates can be wrongly normalized because of the misidentification of legal references nearby as temporal expressions. Some of the used patterns for detecting these problematic legal references are the exposed in Fig. 4. (([D|d](irectiva|IRECTIVA)) (\d*)\/(\d*)\/(\w*))(,? de (\d*) de ([E|e]nero|[F|f]ebrero|[M|m] arzo|[A|a]bril|[M|m]ayo|[J|j]unio|[J|j]ulio|[A|a]gosto|[S|s]emptiembre|[O|o]ctubre|[N|n]ovie mbre|[D|d]iciembre)( de (\d\d\d\d))?)? (([R|r](eal|EAL) [D|d](ecreto|ECRETO)) (\d*)\/(\d*))(,? de (\d*) de ([E|e]nero|[F|f]ebrero|[ M|m]arzo|[A|a]bril|[M|m]ayo|[J|j]unio|[J|j]ulio|[A|a]gosto|[S|s]emptiembre|[O|o]ctubre|[N|n] oviembre|[D|d]iciembre)( de (\d\d\d\d))?)? ((([L|l]ey [O|o]rg[|a]nica)|(LEY ORG[|A]NICA)) (\d*)\/(\d*))(,? de (\d*) de ([E|e]nero|[F|f] ebrero|[M|m]arzo|[A|a]bril|[M|m]ayo|[J|j]unio|[J|j]ulio|[A|a]gosto|[S|s]emptiembre|[O|o]ctub re|[N|n]oviembre|[D|d]iciembre)( de (\d\d\d\d))?)? Fig. 4. Some patterns for detecting references to Spanish legal documents, as well as surrounding dates referring to them. 7 http://datos.gob.es/ 8 https://www.w3.org/TR/vocab-dcat/#Property:dataset temporal 9 Temporal normalization can be described as ”to assign the same value to all expres- sions carrying the same semantics or referring to the same point in time” [5]; this is, the temporal anchoring (often derived from context) for an incomplete temporal expression. An example can be how from the sentence ’The 4th of October of 1991 he came here. The 6th he left.’, the date ’06/10/1991’ can be derived for ’6th’. 29 Proceedings of the 1st Workshop on Technologies for Regulatory Compliance Once these patterns are detected, they are replaced in the text by strings containing information of the legal references detected, but in a format that does not mislead the temporal tagger. This new version of the text maintains all the original genuine temporal expressions, being therefore the ones remaining those that must be detected by the tagger. Once the text is correctly tagged, old legal references can be recovered. Beside facilitating single-use temporal processing of isolate documents, this service also allows to generate correctly temporally tagged texts with legal references that can be used for training machine-learning based temporal taggers in order to adapt them to the legal domain. 5 Conclusions and Future Work The work presented shows how just a basic preprocessing for detecting legal expressions to prevent temporal taggers from tagging them can improve temporal tagging on all kind of legal related texts, being for instance able to solve similar cases to the examples exposed in the introduction. Also other languages or kinds of texts could benefit from this preprocessing: the work made for Spanish and general but legal related texts (news and datasets in our case) can be adopted also for other languages and kinds of texts, such as genuine legal documents. Future work include asking experts in the field for more ways in which legal references can be written, along with increasing the amount of references detected and the languages covered by the web service. Acknowledgments This project has received funding from the European Union’s Horizon 2020 re- search and innovation programme under grant agreement No 780602. References 1. Marc Verhagen, Robert Gaizauskas, Frank Schilder, Mark Hepple, Graham Katz, and James Pustejovsky. Semeval-2007 task 15: Tempeval temporal relation identifi- cation. In Proceedings of the 4th International Workshop on Semantic Evaluations, pages 75–80. Association for Computational Linguistics, 2007. 2. Marc Verhagen, Roser Sauri, Tommaso Caselli, and James Pustejovsky. Semeval- 2010 task 13: Tempeval-2. In Proceedings of the 5th international workshop on semantic evaluation, pages 57–62. Association for Computational Linguistics, 2010. 3. Hector Llorens, Nathanael Chambers, Naushad UzZaman, Nasrin Mostafazadeh, James F Allen, and James Pustejovsky. Semeval-2015 task 5: Qa tempeval- evaluating temporal information understanding with question answering. In Se- mEval@ NAACL-HLT, pages 792–800, 2015. 4. Steven Bethard, Guergana Savova, Wei-Te Chen, Leon Derczynski, James Puste- jovsky, and Marc Verhagen. Semeval-2016 task 12: Clinical tempeval. Proceedings of SemEval, pages 1052–1062, 2016. 5. Jannik Strötgen and Michael Gertz. Multilingual and cross-domain temporal tag- ging. Language Resources and Evaluation, 47(2):269–298, 2013. 30 Proceedings of the 1st Workshop on Technologies for Regulatory Compliance 6. Frank Schilder. Event Extraction and Temporal Reasoning in Legal Documents. In Annotating, Extracting and Reasoning about Time and Events, International Seminar, Dagstuhl Castle, Germany, April 10-15, 2005. Revised Papers, pages 59– 71, 2005. 7. Venkateswrlu Naik, Guda Vanitha, and Srujana Inturi. Reasoning in legal text documents with extracted event information. International Journal of Computer Applications, 28:8—13, 08 2011. 8. Kolikipogu Ramakrishna, Vanitha Guda, B.Padmaja Rani, and Vinaya Ch. A novel model for timed event extraction and temporal reasoning in legal text documents. International Journal of Computer Science and Engineering Survey, 2:39–48, 02 2011. 9. Daniel Isemann, Khurshid Ahmad, Tim Fernando, and Carl Vogel. Temporal De- pendence in Legal Documents, pages 497–504. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013. 10. Monica Palmirani, Raffaella Brighi, and Matteo Massini. Automated extraction of normative references in legal texts. In Proceedings of the 9th international confer- ence on Artificial intelligence and law, pages 105–106. ACM, 2003. 11. M. Adedjouma, M. Sabetzadeh, and L. C. Briand. Automated detection and resolu- tion of legal cross references: Approach and a study of luxembourg’s legislation. In 2014 IEEE 22nd International Requirements Engineering Conference (RE), pages 63–72, Aug 2014. 12. Nicolas Sannier, Morayo Adedjouma, Mehrdad Sabetzadeh, and Lionel Briand. An automated framework for detection and resolution of cross references in legal texts. Requirements Engineering, 22(2):215–237, Jun 2017. 13. Emile de Maat, Radboud Winkels, and Tom van Engers. Automated detection of reference structures in law. Frontiers in Artificial Intelligence and Applications, 152:41–50, 2006. 14. M. Palmirani, R. Brighi, and M. Massini. Processing normative references on the basis of natural language questions. In Proceedings. 15th International Workshop on Database and Expert Systems Applications, 2004., pages 9–12, Aug 2004. 15. Oanh Thi Tran, Bach Xuan Ngo, Minh Le Nguyen, and Akira Shimazu. Automated reference resolution in legal texts. Artificial Intelligence and Law, 22(1):29–60, Mar 2014. 16. Monica Palmirani and Federica Benigni. Norma-system: A legal information sys- tem for managing time. In Proceedings of the V legislative XML workshop, pages 205–223, 2007. 17. Leonardo Lesmo, Alessandro Mazzei, Monica Palmirani, and Daniele P. Radicioni. Tulsi: an nlp system for extracting legal modificatory provisions. Artificial Intelli- gence and Law, 21(2):139–172, May 2013. 18. Mercedes Martı́nez-González, Pablo de la Fuente, and Dámaso-Javier Vicente. Ref- erence extraction and resolution for legal texts. In International Conference on Pattern Recognition and Machine Intelligence, pages 218–221. Springer, 2005. 31