Lexisla: a Legislative Information Retrieval System∗
       Lexisla: un sistema de Recuperación de Información Legislativa
        Ismael Hasan                    Javier Parapar                    Álvaro Barreiro
      IRLab, ICT Centre              IRLab, A Coruña Univ.             IRLab, A Coruña Univ.
      Campus Elviña s/n               Campus Elviña s/n                 Campus Elviña s/n
           A Coruña                        A Coruña                         A Coruña
         ihasan@udc.es                javierparapar@udc.es                 barreiro@udc.es

      Resumen: Cada dı́a se publican nuevos documentos legislativos en Internet que con-
      tienen cambios en la legislación: leyes, decisiones, resoluciones, etc. Lexisla pretende
      ofrecer acceso a esta información mediante una única aplicación de búsqueda que
      recupera, analiza y segmenta los documentos legislativos publicados diariamente.
      Palabras clave: Sistema de búsqueda legislativa, segmentación de documentos,
      extracción de texto
      Abstract: New legislative documents are published everyday in the Internet, com-
      prising changes in the legislation: laws, decisions, resolutions, etc. Lexisla intends
      to offer access to this information through a single search application which crawls,
      analyses and segments the daily published legislative documents.
      Keywords: Legislation search system, documents’ segmentation, text extraction
1.   Introduction                                    turn are not fully satisfactory (for instance,
    In the last years, the growth of Internet        some applications search only over the sum-
has favoured the use of electronic documents.        maries of the documents). In this work we
Public administrations offer the printed doc-        present Lexisla 1 , a system that offers searches
umentation they generate also in an electron-        over several different legislative publications;
ic way, being PDF the most used format.              moreover, the information is processed and
Legislative publications are a particular case.      analysed, so a document (which can contain
This kind of documents is produced on a              hundreds of pages) is segmented into the leg-
daily basis, and they are supplied from the          islative units (resolutions, notifications, etc)
publishers web pages. The information they           it comprises. Also, the information of these
cover is useful for a wide variety of Inter-         units is analysed so the final users of the sys-
net users, being the most representative the         tem can make complex searches over the in-
lawyers community. Also, enterprises can use         formation, including titles, publisher organ-
the information of the legal documents: a new        isms, etc.
regulation may affect their business model,
for instance. Finally, the third target group of
                                                     2.       System Overview
the legal information is the whole citizenship           Lexisla is a Web Application for access-
of a country: the documents can contain no-          ing the legislation periodically published in
tifications to concrete people, important offi-      the online official sources. It is divided into
cial dates, etc.                                     two subsystems, the user’s application, offer-
    However, despite the fact that this infor-       ing searches over the information maintained
mation is very valuable, to search over it is a      by the system, and the management appli-
hard task: the official publishers offer search      cation, to manage the information. At the
engines to access the information, but each          present moment, the information processed
one of those search engines offers access on-        and maintained by the system comprises Eu-
ly to one source of documents. There are al-         ropean and Spanish official bulletins, and sev-
so commercial applications offering searches         eral Spanish regional bulletins. The system
over several bulletins, but the results they re-     allows the addition of new sources of official
                                                     bulletins through the management applica-
∗
   This work was funded by FEDER, Minis-
                                                          1
terio de Ciencia e Innovación and Xunta de              An operative version for registered users is avail-
Galicia under projects TIN2008-06566-C04-04 and      able from www.irlab.org/lexisla. An evaluation ac-
07SIN005206PR.                                       count can be requested at irlab@udc.es
tion. Also, the administrator of the system         3.1.     Crawling module
can schedule when the system automatical-               It accomplishes the access to the web
ly crawls new information from a source, and        pages of the publisher organisms and admin-
can create and assign search profiles to the        istrations (defined by the administrator of
users.                                              the application) and downloads all of the rel-
    The documents automatically downloaded          evant documents.
by the application are processed in the fol-
lowing way: first, it is obtained the text from
                                                    3.2.     Text Extraction module
each document in reading order (this task              PDF is the most usual format to distribute
is specially difficult with PDF documents),         electronic documents. Currently, there are
next, the text is segmented into the legislative    several tools to extract the text from this
units it contains. Finally, these units are anal-   type of documents (Apache Software Foun-
ysed and segmented. For each of them the fol-       dation, 2010; Phelps, 2010), but it is very
lowing fields are stored: body, title, publisher    usual that this text contains errors: for in-
organism, date, document and source, page           stance, the paragraphs of a page may be re-
numbers of the unit and type of the legisla-        turned disordered. This issue penalises the
tive unit (resolution, notification, etc).          analysis of the information, so Lexisla con-
                                                    tains its own text extraction module to by-
    This processing of the documents provides
                                                    pass this problem, specially designed to ac-
the users with the following features: results
                                                    complish this task.
display in the web browser, download of the
pages of a document containing an informa-          3.3.     Document Analysis module
tion unit, documents browsing (“Which are               This component processes the texts ob-
the resolutions of this document?”) and ad-         tained by the Text Extraction module, and
vanced search features, like searching only in      extracts all of the information contained in-
a few sources, searching in certain specific        side each document. It also analyses each leg-
fields and use of regular expressions to search,    islative unit to get its fields (title, publisher
so Lexisla can offer a wide range of available      organism, start and end page, body, etc).
searches.
                                                    3.4.     Indexing and Search modules
                                                       These components use incremental index-
3.    System Architecture
                                                    es to store the legislative units. They use IR
   Lexisla was designed as a Model-View-            algorithms for processing queries against in-
Controller web application, following a             verted indexes, assuring an efficient and ef-
component-based architecture. The most rel-         fective search.
evant components of the application are ex-         3.5.     Information storage
plained next, followed by an explanation
                                                       To satisfy the users information needs, the
about the data storage of Lexisla.
                                                    information is stored in three different sys-
                                                    tems. Integrity of references is maintained be-
                                                    tween the systems.
                                                           Search index: contains information
                                                           about the legislative units.
                                                           Database: contains information about
                                                           users, configuration, search profiles, doc-
                                                           uments, etc.
                                                           File system: original documents.

                                                    4.     Research Issues
                                                       Lexisla is an IR system that uses state-of-
                                                    art algorithms and techniques for crawling,
                                                    extraction of text, segmentation of informa-
                                                    tion and search. In this section of the paper
     Figure 1: Lexisla Model Architecture           we will explain some of the most relevant re-
                                                    search issues.
4.1.    Extraction of Ordered Text                 (BEU, OJEU), America (FR), United King-
    As explained earlier, one of the challenges    dom (UK), France (JO) and Spain (BOE,
is the extraction of correct ordered text from     DOG, BOCYL). Our algorithm greatly out-
PDF documents. For this purpose, we devel-         performed XY-cuts, although it is fair to say
oped a method which simulates the human            that XY-cuts was not designed for this task.
reading order to obtain the text from docu-        But, our algorithm also outperformed PDF-
ments. For each page, it works as follows:         Box: the overall ratio of pages correctly ex-
                                                   tracted with LRE was 96 %, and the over-
 1. Detection of the rectangular text regions      all ratio with PDFBox was 87 %. The differ-
    which are present in the page.                 ence of the mean between LRE and PDF-
                                                   Box is statistically significant, according to
 2. Retrieval of the list of images and cre-
                                                   the Wilcoxon test (p < 0,05).
    ation of regions using the images coordi-
    nates.                                         4.2.   Documents Segmentation
 3. Split of the text regions which are                Legal documents can contain a lot of reso-
    crossed by image regions.                      lutions, communications, etc. An user search-
 4. Sorting of the regions of a page in the        ing through a LegalIR system does not expect
    following way:                                 to receive complete documents as a response
                                                   for a query; instead, he expects that the re-
       a) The region comprising the header of      sults are single information units. So, there
          the page.                                is a need of segmentation of these full doc-
       b) The left top region.                     uments. It follows a briefing of the analysis
                                                   process of documents in Lexisla, which uses
       c) The regions on the right of the one      a specialised lexicon. A extended version of
          obtained in (b).                         this summary can be found in the work of
       d) The region on the left of the page       Hasan, Parapar, and Blanco (2008).
          which is below the previously found          Text pre-processing. PDF format was
          regions.                                 originally created to look good to the users.
       e) The regions on the right of the one      Because of this, when an application builds
          obtained in (d).                         a PDF containing text this text is not ex-
                                                   actly the same as in the original version.
       f ) Steps (d) and (e) are repeated until    For instance, “fi” sequence (numeric code
           no more regions are found.              \102\105) is coded as the single character “fi”
 5. Extraction of the text of each region, in      (numeric code \64257). So, when extracting
    the order stated in (4).                       the text from a PDF document, this issue
                                                   must be taken into account.
   It is worthy to mention that Lexisla also           Identification of the titles contained
deals with language identification issues. A       in the index of the document. The main
legislative document can contain text in dif-      characteristic of these titles is that they al-
ferent languages: for instance, “Boletı́n Ofi-     ways start with a special word (“Resolution”,
cial del Territorio Histórico de Álava” con-     “Notification”, etc). Lexisla looks for phrases
tains sections in which the text in the left       inside the index which begin with these spe-
columns is written in Basque and the same          cific words, or with variations of these words.
text appears translated to Spanish in the          This step returns the titles of the legislative
right ones. Lexisla identifies the language of     units of the document.
each region so in the result of the text extrac-       Identification of resolutions and oth-
tion only the text in one language is returned.    er legislative units using the titles. First,
   To evaluate our method, its results where       the lexicon terms are searched all over the
compared against the results obtained with         text. With these list of terms, a list of ti-
PDFBox and an implementation of the XY-            tle candidates is built. Then, this list is com-
cuts algorithm (Mao and Kanungo, 2001);            pared against the list obtained from the in-
our method is coined as “LRE” (Left Regions        dex of the document. Those titles from the
Expansion). The metric used was the ratio of       content which exactly match a title in the in-
pages correctly extracted. The dataset com-        dex, and are found in the same order as in
prises documents from the European Union           the index, are stored. In the case some of the
titles in the index were not matched, the com-             as a query to do a search. One of the
parison is softened by the use of a compari-               main steps proposed is the extraction of
son using n-grams instead of an exact match                phrases from the document to be used
comparison.                                                as queries; similar methods can be used
    Identification of full legislative units.              to identify the crossed references inside
The full content of each unit will be the text             a legislative text.
between its title and the title of the next unit.
                                                           Generalisation of the segmentation algo-
    To evaluate our segmentation algorithm                 rithm to be used in different domains.
we built an evaluation set composed of 20
documents from heterogeneous sources, pro-             References
viding more than 1400 legislative units. The
metrics used to evaluate the segmentation al-          Apache Software Foundation. 2010. Pdfbox.
gorithm were recall (number of units correct-            http://pdfbox.apache.org/.
ly extracted divided by the total number of            Hasan, Ismael, Javier Parapar, and Roi Blan-
units) and precision (number of units correct-           co. 2008. Segmentation of legislative doc-
ly extracted divided by the total number of              uments using a domain-specific lexicon. In
extracted units). The results are very good,             DEXA ’08: Proceedings of the 2008 19th
with a mean precision of 97,85 % and a mean              International Conference on Database and
recall of 95,99 %. Also, for every source both           Expert Systems Application, pages 665–
values stand over 93 %. Regarding the com-               669, Washington, DC, USA. IEEE Com-
puting time, the algorithm needs 0,13 seconds            puter Society.
per legislative unit2 .
                                                       Mao, Song and Tapas Kanungo. 2001. Em-
                                                         pirical performance evaluation methodol-
5.        Conclusions and Future Work
                                                         ogy and its application to page segmen-
    In this work we presented a LegalIR sys-             tation algorithms. IEEE Transactions
tem, Lexisla. The implementation of the ap-              on Pattern Analysis and Machine Intel-
plication had special research challenges, like          ligence, 23(3):242–256.
extraction of text from PDF documents, or
segmentation of the documents into its com-            Phelps, Tom.        2010.      Multivalent.
prised legislative units, which were success-            http://multivalent.sourceforge.net/.
fully faced as it is shown in the evaluation           Yang, Yin, Nilesh Bansal, Wisam Dakka,
of the results. To accomplish this, the sys-             Panagiotis Ipeirotis, Nick Koudas, and
tem makes use of several NLP techniques,                 Dimitris Papadias. 2009. Query by doc-
like string similarity comparisons, stemming,            ument. In WSDM ’09: Proceedings of the
searches using regular expressions, the use of           Second ACM International Conference on
a specific lexicon to segment the documents              Web Search and Data Mining, pages 34–
and language identification features.                    43, New York, NY, USA. ACM.
    As for the future, there are several tasks
to be considered in the domain of this appli-
cation:
          Entities detection. It can be very inter-
          esting to infer which are the people or
          enterprises affected by a concrete notifi-
          cation, resolution, etc.
          Crossed references. It is very usual that
          a legislative unit makes a reference to
          another one. The automatic detection of
          this references can improve the users’ ex-
          perience. The work of Yang et al. (2009)
          can provide a good startpoint to this
          feature. In this work, the authors face
          the problem of using an entire document
     2
         Pentium 4, 3GHz, 1GB of ram