Lexisla: a Legislative Information Retrieval System∗ Lexisla: un sistema de Recuperación de Información Legislativa Ismael Hasan Javier Parapar Álvaro Barreiro IRLab, ICT Centre IRLab, A Coruña Univ. IRLab, A Coruña Univ. Campus Elviña s/n Campus Elviña s/n Campus Elviña s/n A Coruña A Coruña A Coruña ihasan@udc.es javierparapar@udc.es barreiro@udc.es Resumen: Cada dı́a se publican nuevos documentos legislativos en Internet que con- tienen cambios en la legislación: leyes, decisiones, resoluciones, etc. Lexisla pretende ofrecer acceso a esta información mediante una única aplicación de búsqueda que recupera, analiza y segmenta los documentos legislativos publicados diariamente. Palabras clave: Sistema de búsqueda legislativa, segmentación de documentos, extracción de texto Abstract: New legislative documents are published everyday in the Internet, com- prising changes in the legislation: laws, decisions, resolutions, etc. Lexisla intends to offer access to this information through a single search application which crawls, analyses and segments the daily published legislative documents. Keywords: Legislation search system, documents’ segmentation, text extraction 1. Introduction turn are not fully satisfactory (for instance, In the last years, the growth of Internet some applications search only over the sum- has favoured the use of electronic documents. maries of the documents). In this work we Public administrations offer the printed doc- present Lexisla 1 , a system that offers searches umentation they generate also in an electron- over several different legislative publications; ic way, being PDF the most used format. moreover, the information is processed and Legislative publications are a particular case. analysed, so a document (which can contain This kind of documents is produced on a hundreds of pages) is segmented into the leg- daily basis, and they are supplied from the islative units (resolutions, notifications, etc) publishers web pages. The information they it comprises. Also, the information of these cover is useful for a wide variety of Inter- units is analysed so the final users of the sys- net users, being the most representative the tem can make complex searches over the in- lawyers community. Also, enterprises can use formation, including titles, publisher organ- the information of the legal documents: a new isms, etc. regulation may affect their business model, for instance. Finally, the third target group of 2. System Overview the legal information is the whole citizenship Lexisla is a Web Application for access- of a country: the documents can contain no- ing the legislation periodically published in tifications to concrete people, important offi- the online official sources. It is divided into cial dates, etc. two subsystems, the user’s application, offer- However, despite the fact that this infor- ing searches over the information maintained mation is very valuable, to search over it is a by the system, and the management appli- hard task: the official publishers offer search cation, to manage the information. At the engines to access the information, but each present moment, the information processed one of those search engines offers access on- and maintained by the system comprises Eu- ly to one source of documents. There are al- ropean and Spanish official bulletins, and sev- so commercial applications offering searches eral Spanish regional bulletins. The system over several bulletins, but the results they re- allows the addition of new sources of official bulletins through the management applica- ∗ This work was funded by FEDER, Minis- 1 terio de Ciencia e Innovación and Xunta de An operative version for registered users is avail- Galicia under projects TIN2008-06566-C04-04 and able from www.irlab.org/lexisla. An evaluation ac- 07SIN005206PR. count can be requested at irlab@udc.es tion. Also, the administrator of the system 3.1. Crawling module can schedule when the system automatical- It accomplishes the access to the web ly crawls new information from a source, and pages of the publisher organisms and admin- can create and assign search profiles to the istrations (defined by the administrator of users. the application) and downloads all of the rel- The documents automatically downloaded evant documents. by the application are processed in the fol- lowing way: first, it is obtained the text from 3.2. Text Extraction module each document in reading order (this task PDF is the most usual format to distribute is specially difficult with PDF documents), electronic documents. Currently, there are next, the text is segmented into the legislative several tools to extract the text from this units it contains. Finally, these units are anal- type of documents (Apache Software Foun- ysed and segmented. For each of them the fol- dation, 2010; Phelps, 2010), but it is very lowing fields are stored: body, title, publisher usual that this text contains errors: for in- organism, date, document and source, page stance, the paragraphs of a page may be re- numbers of the unit and type of the legisla- turned disordered. This issue penalises the tive unit (resolution, notification, etc). analysis of the information, so Lexisla con- tains its own text extraction module to by- This processing of the documents provides pass this problem, specially designed to ac- the users with the following features: results complish this task. display in the web browser, download of the pages of a document containing an informa- 3.3. Document Analysis module tion unit, documents browsing (“Which are This component processes the texts ob- the resolutions of this document?”) and ad- tained by the Text Extraction module, and vanced search features, like searching only in extracts all of the information contained in- a few sources, searching in certain specific side each document. It also analyses each leg- fields and use of regular expressions to search, islative unit to get its fields (title, publisher so Lexisla can offer a wide range of available organism, start and end page, body, etc). searches. 3.4. Indexing and Search modules These components use incremental index- 3. System Architecture es to store the legislative units. They use IR Lexisla was designed as a Model-View- algorithms for processing queries against in- Controller web application, following a verted indexes, assuring an efficient and ef- component-based architecture. The most rel- fective search. evant components of the application are ex- 3.5. Information storage plained next, followed by an explanation To satisfy the users information needs, the about the data storage of Lexisla. information is stored in three different sys- tems. Integrity of references is maintained be- tween the systems. Search index: contains information about the legislative units. Database: contains information about users, configuration, search profiles, doc- uments, etc. File system: original documents. 4. Research Issues Lexisla is an IR system that uses state-of- art algorithms and techniques for crawling, extraction of text, segmentation of informa- tion and search. In this section of the paper Figure 1: Lexisla Model Architecture we will explain some of the most relevant re- search issues. 4.1. Extraction of Ordered Text (BEU, OJEU), America (FR), United King- As explained earlier, one of the challenges dom (UK), France (JO) and Spain (BOE, is the extraction of correct ordered text from DOG, BOCYL). Our algorithm greatly out- PDF documents. For this purpose, we devel- performed XY-cuts, although it is fair to say oped a method which simulates the human that XY-cuts was not designed for this task. reading order to obtain the text from docu- But, our algorithm also outperformed PDF- ments. For each page, it works as follows: Box: the overall ratio of pages correctly ex- tracted with LRE was 96 %, and the over- 1. Detection of the rectangular text regions all ratio with PDFBox was 87 %. The differ- which are present in the page. ence of the mean between LRE and PDF- Box is statistically significant, according to 2. Retrieval of the list of images and cre- the Wilcoxon test (p < 0,05). ation of regions using the images coordi- nates. 4.2. Documents Segmentation 3. Split of the text regions which are Legal documents can contain a lot of reso- crossed by image regions. lutions, communications, etc. An user search- 4. Sorting of the regions of a page in the ing through a LegalIR system does not expect following way: to receive complete documents as a response for a query; instead, he expects that the re- a) The region comprising the header of sults are single information units. So, there the page. is a need of segmentation of these full doc- b) The left top region. uments. It follows a briefing of the analysis process of documents in Lexisla, which uses c) The regions on the right of the one a specialised lexicon. A extended version of obtained in (b). this summary can be found in the work of d) The region on the left of the page Hasan, Parapar, and Blanco (2008). which is below the previously found Text pre-processing. PDF format was regions. originally created to look good to the users. e) The regions on the right of the one Because of this, when an application builds obtained in (d). a PDF containing text this text is not ex- actly the same as in the original version. f ) Steps (d) and (e) are repeated until For instance, “fi” sequence (numeric code no more regions are found. \102\105) is coded as the single character “fi” 5. Extraction of the text of each region, in (numeric code \64257). So, when extracting the order stated in (4). the text from a PDF document, this issue must be taken into account. It is worthy to mention that Lexisla also Identification of the titles contained deals with language identification issues. A in the index of the document. The main legislative document can contain text in dif- characteristic of these titles is that they al- ferent languages: for instance, “Boletı́n Ofi- ways start with a special word (“Resolution”, cial del Territorio Histórico de Álava” con- “Notification”, etc). Lexisla looks for phrases tains sections in which the text in the left inside the index which begin with these spe- columns is written in Basque and the same cific words, or with variations of these words. text appears translated to Spanish in the This step returns the titles of the legislative right ones. Lexisla identifies the language of units of the document. each region so in the result of the text extrac- Identification of resolutions and oth- tion only the text in one language is returned. er legislative units using the titles. First, To evaluate our method, its results where the lexicon terms are searched all over the compared against the results obtained with text. With these list of terms, a list of ti- PDFBox and an implementation of the XY- tle candidates is built. Then, this list is com- cuts algorithm (Mao and Kanungo, 2001); pared against the list obtained from the in- our method is coined as “LRE” (Left Regions dex of the document. Those titles from the Expansion). The metric used was the ratio of content which exactly match a title in the in- pages correctly extracted. The dataset com- dex, and are found in the same order as in prises documents from the European Union the index, are stored. In the case some of the titles in the index were not matched, the com- as a query to do a search. One of the parison is softened by the use of a compari- main steps proposed is the extraction of son using n-grams instead of an exact match phrases from the document to be used comparison. as queries; similar methods can be used Identification of full legislative units. to identify the crossed references inside The full content of each unit will be the text a legislative text. between its title and the title of the next unit. Generalisation of the segmentation algo- To evaluate our segmentation algorithm rithm to be used in different domains. we built an evaluation set composed of 20 documents from heterogeneous sources, pro- References viding more than 1400 legislative units. The metrics used to evaluate the segmentation al- Apache Software Foundation. 2010. Pdfbox. gorithm were recall (number of units correct- http://pdfbox.apache.org/. ly extracted divided by the total number of Hasan, Ismael, Javier Parapar, and Roi Blan- units) and precision (number of units correct- co. 2008. Segmentation of legislative doc- ly extracted divided by the total number of uments using a domain-specific lexicon. In extracted units). The results are very good, DEXA ’08: Proceedings of the 2008 19th with a mean precision of 97,85 % and a mean International Conference on Database and recall of 95,99 %. Also, for every source both Expert Systems Application, pages 665– values stand over 93 %. Regarding the com- 669, Washington, DC, USA. IEEE Com- puting time, the algorithm needs 0,13 seconds puter Society. per legislative unit2 . Mao, Song and Tapas Kanungo. 2001. Em- pirical performance evaluation methodol- 5. Conclusions and Future Work ogy and its application to page segmen- In this work we presented a LegalIR sys- tation algorithms. IEEE Transactions tem, Lexisla. The implementation of the ap- on Pattern Analysis and Machine Intel- plication had special research challenges, like ligence, 23(3):242–256. extraction of text from PDF documents, or segmentation of the documents into its com- Phelps, Tom. 2010. Multivalent. prised legislative units, which were success- http://multivalent.sourceforge.net/. fully faced as it is shown in the evaluation Yang, Yin, Nilesh Bansal, Wisam Dakka, of the results. To accomplish this, the sys- Panagiotis Ipeirotis, Nick Koudas, and tem makes use of several NLP techniques, Dimitris Papadias. 2009. Query by doc- like string similarity comparisons, stemming, ument. In WSDM ’09: Proceedings of the searches using regular expressions, the use of Second ACM International Conference on a specific lexicon to segment the documents Web Search and Data Mining, pages 34– and language identification features. 43, New York, NY, USA. ACM. As for the future, there are several tasks to be considered in the domain of this appli- cation: Entities detection. It can be very inter- esting to infer which are the people or enterprises affected by a concrete notifi- cation, resolution, etc. Crossed references. It is very usual that a legislative unit makes a reference to another one. The automatic detection of this references can improve the users’ ex- perience. The work of Yang et al. (2009) can provide a good startpoint to this feature. In this work, the authors face the problem of using an entire document 2 Pentium 4, 3GHz, 1GB of ram