<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Lexisla: a Legislative Information Retrieval System∗</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Javier Parapar</string-name>
          <email>javierparapar@udc.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IRLab</institution>
          ,
          <addr-line>A Corun ̃a Univ. Campus Elvin ̃a s/n A Corun ̃a</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ismael Hasan IRLab, ICT Centre Campus Elvin ̃a s/n A Corun ̃a</institution>
        </aff>
      </contrib-group>
      <fpage>04</fpage>
      <lpage>04</lpage>
      <abstract>
        <p>New legislative documents are published everyday in the Internet, comprising changes in the legislation: laws, decisions, resolutions, etc. Lexisla intends to offer access to this information through a single search application which crawls, analyses and segments the daily published legislative documents.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In the last years, the growth of Internet
has favoured the use of electronic documents.
Public administrations offer the printed
documentation they generate also in an
electronic way, being PDF the most used format.
Legislative publications are a particular case.
This kind of documents is produced on a
daily basis, and they are supplied from the
publishers web pages. The information they
cover is useful for a wide variety of
Internet users, being the most representative the
lawyers community. Also, enterprises can use
the information of the legal documents: a new
regulation may affect their business model,
for instance. Finally, the third target group of
the legal information is the whole citizenship
of a country: the documents can contain
notifications to concrete people, important
official dates, etc.</p>
      <p>However, despite the fact that this
information is very valuable, to search over it is a
hard task: the official publishers offer search
engines to access the information, but each
one of those search engines offers access
only to one source of documents. There are
also commercial applications offering searches
over several bulletins, but the results they
return are not fully satisfactory (for instance,
some applications search only over the
summaries of the documents). In this work we
present Lexisla1, a system that offers searches
over several different legislative publications;
moreover, the information is processed and
analysed, so a document (which can contain
hundreds of pages) is segmented into the
legislative units (resolutions, notifications, etc)
it comprises. Also, the information of these
units is analysed so the final users of the
system can make complex searches over the
information, including titles, publisher
organisms, etc.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>System Overview</title>
      <p>Lexisla is a Web Application for
accessing the legislation periodically published in
the online official sources. It is divided into
two subsystems, the user’s application,
offering searches over the information maintained
by the system, and the management
application, to manage the information. At the
present moment, the information processed
and maintained by the system comprises
European and Spanish official bulletins, and
several Spanish regional bulletins. The system
allows the addition of new sources of official
bulletins through the management
applica1An operative version for registered users is
available from www.irlab.org/lexisla. An evaluation
account can be requested at irlab@udc.es
tion. Also, the administrator of the system
can schedule when the system
automatically crawls new information from a source, and
can create and assign search profiles to the
users.</p>
      <p>The documents automatically downloaded
by the application are processed in the
following way: first, it is obtained the text from
each document in reading order (this task
is specially difficult with PDF documents),
next, the text is segmented into the legislative
units it contains. Finally, these units are
analysed and segmented. For each of them the
following fields are stored: body, title, publisher
organism, date, document and source, page
numbers of the unit and type of the
legislative unit (resolution, notification, etc).</p>
      <p>This processing of the documents provides
the users with the following features: results
display in the web browser, download of the
pages of a document containing an
information unit, documents browsing (“Which are
the resolutions of this document?”) and
advanced search features, like searching only in
a few sources, searching in certain specific
fields and use of regular expressions to search,
so Lexisla can offer a wide range of available
searches.
3.</p>
    </sec>
    <sec id="sec-3">
      <title>System Architecture</title>
      <p>Lexisla was designed as a
Model-ViewController web application, following a
component-based architecture. The most
relevant components of the application are
explained next, followed by an explanation
about the data storage of Lexisla.</p>
      <p>It accomplishes the access to the web
pages of the publisher organisms and
administrations (defined by the administrator of
the application) and downloads all of the
relevant documents.
3.2.</p>
      <sec id="sec-3-1">
        <title>Text Extraction module</title>
        <p>
          PDF is the most usual format to distribute
electronic documents. Currently, there are
several tools to extract the text from this
type of documents
          <xref ref-type="bibr" rid="ref1 ref4">(Apache Software
Foundation, 2010; Phelps, 2010)</xref>
          , but it is very
usual that this text contains errors: for
instance, the paragraphs of a page may be
returned disordered. This issue penalises the
analysis of the information, so Lexisla
contains its own text extraction module to
bypass this problem, specially designed to
accomplish this task.
3.3.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Document Analysis module</title>
        <p>This component processes the texts
obtained by the Text Extraction module, and
extracts all of the information contained
inside each document. It also analyses each
legislative unit to get its fields (title, publisher
organism, start and end page, body, etc).
3.4.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Indexing and Search modules</title>
        <p>These components use incremental
indexes to store the legislative units. They use IR
algorithms for processing queries against
inverted indexes, assuring an efficient and
effective search.
3.5.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Information storage</title>
        <p>To satisfy the users information needs, the
information is stored in three different
systems. Integrity of references is maintained
between the systems.</p>
        <p>Search index: contains information
about the legislative units.</p>
        <p>Database: contains information about
users, configuration, search profiles,
documents, etc.</p>
        <p>File system: original documents.
4.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Research Issues</title>
      <p>Lexisla is an IR system that uses
state-ofart algorithms and techniques for crawling,
extraction of text, segmentation of
information and search. In this section of the paper
we will explain some of the most relevant
research issues.</p>
      <sec id="sec-4-1">
        <title>Extraction of Ordered Text</title>
        <p>As explained earlier, one of the challenges
is the extraction of correct ordered text from
PDF documents. For this purpose, we
developed a method which simulates the human
reading order to obtain the text from
documents. For each page, it works as follows:
1. Detection of the rectangular text regions
which are present in the page.
2. Retrieval of the list of images and
creation of regions using the images
coordinates.
3. Split of the text regions which are
crossed by image regions.
4. Sorting of the regions of a page in the
following way:
a) The region comprising the header of
the page.
b) The left top region.
c) The regions on the right of the one
obtained in (b).
d) The region on the left of the page
which is below the previously found
regions.
e) The regions on the right of the one
obtained in (d).
f ) Steps (d) and (e) are repeated until
no more regions are found.
5. Extraction of the text of each region, in
the order stated in (4).</p>
        <p>It is worthy to mention that Lexisla also
deals with language identification issues. A
legislative document can contain text in
different languages: for instance, “Bolet´ın
Oficial del Territorio Hist´orico de A´lava”
contains sections in which the text in the left
columns is written in Basque and the same
text appears translated to Spanish in the
right ones. Lexisla identifies the language of
each region so in the result of the text
extraction only the text in one language is returned.</p>
        <p>
          To evaluate our method, its results where
compared against the results obtained with
PDFBox and an implementation of the
XYcuts algorithm
          <xref ref-type="bibr" rid="ref3">(Mao and Kanungo, 2001)</xref>
          ;
our method is coined as “LRE” (Left Regions
Expansion). The metric used was the ratio of
pages correctly extracted. The dataset
comprises documents from the European Union
(BEU, OJEU), America (FR), United
Kingdom (UK), France (JO) and Spain (BOE,
DOG, BOCYL). Our algorithm greatly
outperformed XY-cuts, although it is fair to say
that XY-cuts was not designed for this task.
But, our algorithm also outperformed
PDFBox: the overall ratio of pages correctly
extracted with LRE was 96 %, and the
overall ratio with PDFBox was 87 %. The
difference of the mean between LRE and
PDFBox is statistically significant, according to
the Wilcoxon test (p &lt; 0,05).
4.2.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Documents Segmentation</title>
        <p>
          Legal documents can contain a lot of
resolutions, communications, etc. An user
searching through a LegalIR system does not expect
to receive complete documents as a response
for a query; instead, he expects that the
results are single information units. So, there
is a need of segmentation of these full
documents. It follows a briefing of the analysis
process of documents in Lexisla, which uses
a specialised lexicon. A extended version of
this summary can be found in the work of
          <xref ref-type="bibr" rid="ref2">Hasan, Parapar, and Blanco (2008</xref>
          ).
        </p>
        <p>Text pre-processing. PDF format was
originally created to look good to the users.
Because of this, when an application builds
a PDF containing text this text is not
exactly the same as in the original version.
For instance, “fi” sequence (numeric code
\102\105) is coded as the single character “fi”
(numeric code \64257). So, when extracting
the text from a PDF document, this issue
must be taken into account.</p>
        <p>Identification of the titles contained
in the index of the document. The main
characteristic of these titles is that they
always start with a special word (“Resolution”,
“Notification”, etc). Lexisla looks for phrases
inside the index which begin with these
specific words, or with variations of these words.
This step returns the titles of the legislative
units of the document.</p>
        <p>Identification of resolutions and
other legislative units using the titles. First,
the lexicon terms are searched all over the
text. With these list of terms, a list of
title candidates is built. Then, this list is
compared against the list obtained from the
index of the document. Those titles from the
content which exactly match a title in the
index, and are found in the same order as in
the index, are stored. In the case some of the
titles in the index were not matched, the
comparison is softened by the use of a
comparison using n-grams instead of an exact match
comparison.</p>
        <p>Identification of full legislative units.
The full content of each unit will be the text
between its title and the title of the next unit.</p>
        <p>To evaluate our segmentation algorithm
we built an evaluation set composed of 20
documents from heterogeneous sources,
providing more than 1400 legislative units. The
metrics used to evaluate the segmentation
algorithm were recall (number of units
correctly extracted divided by the total number of
units) and precision (number of units
correctly extracted divided by the total number of
extracted units). The results are very good,
with a mean precision of 97,85 % and a mean
recall of 95,99 %. Also, for every source both
values stand over 93 %. Regarding the
computing time, the algorithm needs 0,13 seconds
per legislative unit2.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>In this work we presented a LegalIR
system, Lexisla. The implementation of the
application had special research challenges, like
extraction of text from PDF documents, or
segmentation of the documents into its
comprised legislative units, which were
successfully faced as it is shown in the evaluation
of the results. To accomplish this, the
system makes use of several NLP techniques,
like string similarity comparisons, stemming,
searches using regular expressions, the use of
a specific lexicon to segment the documents
and language identification features.</p>
      <p>As for the future, there are several tasks
to be considered in the domain of this
application:</p>
      <p>Entities detection. It can be very
interesting to infer which are the people or
enterprises affected by a concrete
notification, resolution, etc.</p>
      <p>
        Crossed references. It is very usual that
a legislative unit makes a reference to
another one. The automatic detection of
this references can improve the users’
experience. The work of
        <xref ref-type="bibr" rid="ref5">Yang et al. (2009)</xref>
        can provide a good startpoint to this
feature. In this work, the authors face
the problem of using an entire document
as a query to do a search. One of the
main steps proposed is the extraction of
phrases from the document to be used
as queries; similar methods can be used
to identify the crossed references inside
a legislative text.
      </p>
      <p>Generalisation of the segmentation
algorithm to be used in different domains.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Apache</given-names>
            <surname>Software Foundation</surname>
          </string-name>
          .
          <year>2010</year>
          . Pdfbox. http://pdfbox.apache.org/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Hasan</surname>
            , Ismael,
            <given-names>Javier</given-names>
          </string-name>
          <string-name>
            <surname>Parapar</surname>
            , and
            <given-names>Roi</given-names>
          </string-name>
          <string-name>
            <surname>Blanco</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Segmentation of legislative documents using a domain-specific lexicon</article-title>
          .
          <source>In DEXA '08: Proceedings of the 2008 19th International Conference on Database and Expert Systems Application</source>
          , pages
          <fpage>665</fpage>
          -
          <lpage>669</lpage>
          , Washington, DC, USA. IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Mao</surname>
            , Song and
            <given-names>Tapas</given-names>
          </string-name>
          <string-name>
            <surname>Kanungo</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Empirical performance evaluation methodology and its application to page segmentation algorithms</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>23</volume>
          (
          <issue>3</issue>
          ):
          <fpage>242</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Phelps</surname>
          </string-name>
          , Tom.
          <year>2010</year>
          . Multivalent. http://multivalent.sourceforge.net/.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Yang</surname>
            , Yin, Nilesh Bansal, Wisam Dakka, Panagiotis Ipeirotis, Nick Koudas, and
            <given-names>Dimitris</given-names>
          </string-name>
          <string-name>
            <surname>Papadias</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Query by document</article-title>
          .
          <source>In WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining</source>
          , pages
          <fpage>34</fpage>
          -
          <lpage>43</lpage>
          , New York, NY, USA. ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>