Automatic mark-up of legislative documents and
   its application to parallel text generation

Lorenzo Bacci1 , Pierluigi Spinosa1 , Carlo Marchetti2,3 , and Roberto Battistoni3
               1
               Institute of Legal Information Theory and Techniques,
                                    Florence, Italy
                                bacci@ittig.cnr.it
                               spinosa@ittig.cnr.it
2
  Dipartimento di Informatica e Sistemistica dell’Università di Roma “La Sapienza”
                            carlo.marchetti@senato.it
                            3
                              Italian Senate, Rome, Italy
                              r.battistoni@senato.it


        Abstract. In the juridical domain a huge amount of plain legislative
        acts have been produced since and before the advent of computers and
        word processors. The conversion of legacy and plain documents in a
        standard XML format implies great and numerous benefits. In order to
        accomplish this task, several automatic and semi-automatic tools have
        been developed in the last ten years. In this paper, xmLegesMarker, de-
        veloped at Ittig4 , the Italian state-of-art legislative documents parser, is
        presented. The tool is NIR standard compliant, it’s embedded in xm-
        LegesEditor and it was recently adopted for evaluation by the Italian
        Senate in order to automatically spawn the parallel text (testo a fronte),
        i.e. the document used to highlight the modifications introduced during
        the debate of a bill.


1     Introduction
xmLegesMarker is a structure parser for legislative documents. Its development
started in 2003, within Norme In Rete (NIR) project[1], in order to provide a
detailed mark-up of legacy documents and, generally, plain legislative acts. Al-
though it was realized as a stand-alone software, it has been mainly exploited
inside xmLegesEditor[2], an XML based legislative editor arisen from the NIR
project, to implement the import function, namely the migration of plain texts
into the XML environment. The NIR project defined also appropriate DTD and
XMLSchema, which represent the basis of both xmLegesEditor and xmLeges-
Marker and aim at describing in a very detailed way the Italian legislative acts.
Thanks to its independent development and the ongoing improvements, xm-
LegesMarker has also been effectively adopted in the last few years by several
public administrations and regional governments, usually to pursue a migration
of their plain text databases of local, regional and national laws towards the
NIR-XML mark-up.
4
    Institute of Legal Information Theory and Techniques (CNR)
2

   Several benefits derive from this process: structured documents enhance in-
formation retrieval and normative system maintenance, and can represent the
ground for a further semantic description of text in terms of provisions [3], but,
more important, they can also be exploited for legislative-domain strictly-related
operations. As an example, TafWeb, described in section 5, is a smart legislative
text comparison software, developed by the Computer Science Department of
the University of Bologna under the supervision of the Italian Senate, which
exploits the XML mark-up of the versions of a bill under debate in order to gen-
erate the parallel text document (testo a fronte). The Italian Senate is currently
evaluating xmLeges-Marker within the TafWeb suite in order to automatize the
production of the first version of a parallel text, starting from the plain Chamber
and Senate versions of the bill.


2     Approaching plain documents

Automatically structuring a plain document means creating a software that,
given the plain document as input, is able to assign an identity to every piece of
text. In the XML world, this task is accomplished by putting the text between
tags. In this case, the marker uses NIR defined tags in order to create well formed
and valid, with respect to NIR schema, outputs.
    Generally speaking, the information that can be obtained from a plain doc-
ument consists of data and meta-data. In legislative acts, the enacting terms
section, in which articles and paragraphs lie, matches the data, while entities
like title, number and type of document, subscribers and so on, are considered
explicit meta-data. Other meta-data, defined in NIR schema, although not ex-
plicitly present in the input, are computed or added by xmLegesMarker, usually
exploiting the values of the explicit meta-data (i.e.: automatic generation of
URN [4]). The splitting of information in data and meta-data follows the phys-
ical structure of the document. While explicit meta-data are typically located
in the header or in the footer, the body of the legislative act accommodates the
enacting terms, namely the data.
    Besides the physical position, there is another important difference between
the body and the header (or the footer) in a legislative act: the former is com-
posed by partitions strictly organized and sorted in a hierarchical way, practically
a tree of partitions, while the latter appears fuzzy and composed by expected
and unexpected elements, often in a random order. This is the reason why xm-
LegesMarker adopts two different strategies in order to analyze the header, the
footer and the body of a document.


2.1   Header and footer

The fuzziness that belongs to both the header and the footer of a legislative act
requires a statistical and machine learning approach for meta-data extraction.
In order to accomplish this, the theory of Hidden Markov Model (HMM) [5] was
                                                                                 3

chosen and a model, able to understand the most typical information lying in
the header and footer of a generic legislative act, was developed. [6]
   An important source of information which was exploited to improve the ac-
curacy of the header analysis is represented by the legislative document subtype
(act, bill, decree, regional act, etc.). Depending on the subtype in fact, more
precise patterns stand out. Therefore, xmLegesMarker applies a specific HMM
model if the input subtype is known and supported, the generic model otherwise.
Using subtype-crafted models brings two benefits:
 – a better formalization of the most important subtype, reaching higher (far
   higher in some cases, like bills) degrees of accuracy;
 – a guess about the subtype when the subtype information isn’t given, just
   applying all the specific model and checking which one fits better.

2.2     Body
The body of a legislative act coincides with the enacting terms section, which
is typically well organized in known partitions, hierarchically arranged in a tree
structure. On the other hand, the tree representing the enacting terms can be
complex, long and nested. An automata approach is required in order to ef-
ficiently parse this kind of structure. The Flex scanner generator5, which has
been used to accomplish several tasks for and besides body analysis, allows the
creation of very powerful text scanner based on a non deterministic finite state
automata (NFA). [6]
     The automata that handles the enacting terms strictly follows the constraints
imposed by the NIR schema: the parsing process obeys to rules that depend on
the automata states (start conditions), which match all the partitions defined
in the NIR hierarchy. For example, an alphabetical list can be read only if the
automata is in the paragraph state, because, according to NIR and to legislative
drafting rules, a list should only stay inside paragraphs.

2.3     Annexes
Let’s conclude this survey on plain legislative documents by describing the way
the marker handles annexes. In the legislative domain, annexes are very com-
mon: they belong to the legislative document itself, just following the, let’s say,
main act; they can be simple tables, reporting details, prices, fees, or be leg-
islative documents themselves. Their importance sometimes exceeds even the
importance of the main act: for example, there are bills containing just one or
two paragraph, followed by a legislative decree containing dozens of articles.
    The xmLegesMarker strategy comprises the preliminary segmentation of the
input into main act and annexes, and the iteration of the header, body and
footer parsing functions, described in the previous sections, on the main act as
well on every single annex. In this way, possible legislative documents following
the main act receive the same treatment and come out completely marked-up.
5
    http://dinosaur.compilertools.net/flex/
4

3      A glance in depth

In order to better understand the working, the capabilities and the potentiality of
the marker, in this section the most interesting features and issues are deepened
and discussed. After a brief overview about the shape of the input, we focus on
how the marker handles partitions and amendments, which factors determine
an increase of complexity during the body analysis process and how syntactical
errors in the source have been successfully tackled.


3.1     Input

The input of xmLegesMarker is a plain legislative act in txt, doc, html or pdf
format. The marker is able to manage different subtype of plain Italian legislative
document, like act, bill, decree, regional act, etc. As discussed, different weights
and models are used for the header and footer analysis depending on the subtype.


3.2     Handling partitions

Partitions are used to structure the fragments of the legislative document body.
They are arranged in a hierarchical way. The paragraph is the partition that
effectively contains the text of the law, while the greater order partitions (article,
chapter, section, etc.) can be seen as containers. Even paragraphs can have
sub-partitions: lists, which can be alphabetical, numerical or bulleted. Thanks
to the use of regular expression, layout reasoning and the application of NIR
constraints, as we have seen in 2.2, the marker is able to identify all of these
partitions.
    Furthermore, xmLegesMarker performs a check concerning partitions num-
bering, which turns out to be quite useful:

    – as a mean to better identify the next partition, often avoiding ambiguity;
    – to assign a unique value to the “id” attribute, defined by NIR, in order to
      permit referencing to every single partition.


3.3     Dealing with textual amendments

Amendments are a widely used mechanism to express modifications from a leg-
islative document to another legislative document. The textual amendments can
be briefly categorized in repeal, insertion and modification, and they can act
on words as well as on whole partitions (structural amendment). Insertion and
modification amendments are typically expressed using quotes. So, for example,
it’s possible to express through an amendment the substitution of a single noun
or the insertion of a brand-new article in a precise position of a regional act.
     The marker is able to identify and handle both words and structural amend-
ments, and, in case of structural amendment, it enters the amendment and pre-
cisely mark-up all the partitions and data contained.
                                                                                 5

3.4   Increasing complexity

Even though the body of legislative documents generally follows the same syn-
tactical rules, there are several variables that increase the complexity of the
automata:

 – paragraphs are sometimes not numbered, this happens especially with legacy
   documents;
 – almost every partition can have a partition title, the rubrica, that sometimes
   is placed just after the declaration of the new partition, sometimes below,
   sometimes it’s included between particular separator characters, sometimes
   not;


Fig. 1. A pretty nested example visualized in xmLegesEditor: the first paragraph of
article 3 contains a chapter substitution amendment.


 – within a paragraph there are three allowed kind of lists: alphabetical, nu-
   merical and bulleted list; however, good drafting rules state that bulleted
   lists should be avoided, because they can’t be directly referenced (the rel-
   ative position has to be specified), and only alphabetical lists should stay
   immediately inside paragraphs while numerical lists should only lie within
   alphabetical lists;
6

    – parsing the text of amendments sometimes turns out to be a pretty hard task:
      between quotes there’s a no man’s land where most of the rules that usually
      guide the automata don’t apply anymore, while every kind of partition is
      allowed there by NIR schema, thus xmLegesMarker sometimes has to deal
      with very complex cases (Fig. 1).

3.5     Tackling syntax errors
One of the main problems that have to be faced working with legacy documents
and, generally, with documents edited manually, with no drafting support, is
represented by the presence of syntactical errors. They can be:

    – numbering errors;
    – incorrect use of punctuation marks;
    – errors in the layout of the document;
    – unbalanced quotes;
    – other drafting errors.

    Some of these errors in the plain document have a limited effect in the XML
output of xmLegesMarker, while others may cause totally disruptive behaviors of
the automata used for the body parsing. For example, if quotes aren’t balanced,
the automata jams in the amendment states, forcing all the remaining text into
the amendments tags. Another example, with not such a catastrophic effect,
is represented by the erroneous or missing punctuation marks that should be
used to separate different paragraphs inside an article; in that case, the next
paragraph isn’t acknowledged because of the erroneous separator, and all the
remaining paragraphs in the article aren’t acknowledged too, because of the
checks on paragraph numbering, until the next article is read, which force a
reset of the paragraph counter.
    Thus, little errors in the input source often generate huge troubles in the
resulting XML, while a little correction of the input saves the user from a painful
correction of the output in an XML editor. For this reason a messaging system
which allows the user to operate directly in the source, correct it and process it
again was implemented inside xmLegesMarker.
    The new version of the marker is able to identify the most typical trouble-
some situations having reference to errors in the source, embedding a warning
message in the output. The message is formatted as a processing instruction in
the resulting XML, so it doesn’t affect the validity of the document. Moreover,
the message contains a warning code that refers to a warning table where typ-
ical troubles are classified, described and a guideline to solve each of them is
provided.


4      Qualitative evaluation
The main automata, dedicated to the analysis of the enacting terms section
(the body of the document), consist of 83 regular expressions, 22 states (or
                                                                                 7

start conditions) and more than 70 rules based on start conditions and stacks of
start conditions. It handles ten different kind of partitions (book, part, chapter,
title, section, article, paragraph, alphabetical, numerical and bulleted list), and
various other entities defined in the NIR schema, like partition title, decoration,
amendment, and so on. The lex file that defines the Flex scanner comprises more
than one thousand lines of code.

4.1    Shape of legislative acts
Legislative documents may be particularly complex from a structural point of
view. Let’s have a look at a couple of numerical example.
    In Italy, the bigger legislative documents are probably the Budget laws. The
bill 11836 of 2007, for example, representing the 2007 Budget bill, counts in the
main act, including the amendments, 1122 paragraphs, 46 articles, 410 alphabet-
ical list partitions and 75 numerical list partitions, altogether 1653 partitions!
    The nesting of the body is variable as well, but isn’t uncommon to run into
bills with almost ten nested partitions. For example, the bill 33287 of 2005, has,
take a deep breath, a four-points-long numerical list inside the letter “a” in
paragraph “3” of article “165-ter” within section “VI-bis” inside an amendment
in paragraph “1” within article “6” of chapter “III” in the title “I”!

4.2    On-road test
The Italian Senate provided a big data-set of bills on which several tests have
been carried out, aiming at more and more refining the software. Given the
fact that is a pretty hard task to conduct a precise statistical analysis about
the accuracy on the whole data-set, because of the complexity of the input, in
this section we try to give at least a qualitative idea about the capabilities of
xmLegesMarker, just reporting the marking-up process outcomes of the two,
quite representative, previously discussed bills.


                        Table 1. Bill 3328 mark-up details

                       Partition Total Amendment Missing
                         Title      6      0        -
                        Chapter    10      1        -
                        Section     4      4        -
                        Article    81     39        -
                       Paragraph 249      155       -
                      Alphabetical 178    73        -
                       Numerical   50      8        -


6
    http://www.senato.it/leg/15/BGT/Schede/Ddliter/27212.htm
7
    http://www.senato.it/leg/14/BGT/Schede/Ddliter/22640.htm
8

Table 2. Bill 1183 mark-up details, before and after correction of the original input

                           Before correction        After correction
           Partition Total Amendment Missing Total Amendment Missing
            Article     35       17         11  46        28         -
           Paragraph 1072       867         50 1122      152         -
          Alphabetical 398      324         12 410        98         -
           Numerical    66       59          9  66        28         9


   Although quite complex, the mark-up of 3328 body was perfect and the
marker didn’t miss any partition. On the other hand, the same process on the
huge 1183 triggered two warning messages:
    – erroneous balancing of quotes in art. 18 paragraph 45;
    – not numbered comma inside an amendment in art. 18.
    As discussed, the first one is a disruptive problem, which effectively causes a
pretty bad result. After correcting the two problems in the original plain input,
the marking-up process was performed again and the outcome was excellent:
the only imperfection is given by two non-standard numerical lists (“1.1”, “1.2”,
etc.), a format not included in the rules for legislative drafting enacted by the
Parliament and, consequently, neither included in the NIR drafting standards.
    Tables 1 and 2 report, for each type of partition, the number of total occur-
rences found by the marker, how many of them are found in amendments and
how many are missing.


5      An Italian Senate application
This section shows how the promising results of xmLegeMarker are exploited for
evaluations within the TafWeb application, supported by the Italian Senate.

5.1     Scenario
The article by article discussion of a newly proposed bill is scheduled within
the so-called “ordinary legislative procedure”, in one of the two chambers of the
Italian Parliament. During the discussion, amendments are voted and applied
to the bill. Once the agreement is reached, the amended bill moves to the other
Chamber, which is entitled to apply further modifications and send it back to
the previous Chamber, until the bill is applied in a Chamber without introducing
new modifications, which terminates the process.
    During the process, the effects of approved amendments, i.e., the differences
between the two versions of a bill under debate, are represented using a TAF
document, which stands for Testo A Fronte, a parallel text organized in two
columns, with the original text on the left and the modified text on the right
(Fig. 2). The TAF document is very useful for two reasons:
                                                                               9

– it highlights the effects of amendments by using specific textual represen-
  tations for each kind of modification, making it easier to understand where
  and how a bill has been modified;
– it can be used to limit the analysis and the discussion of the bill only to the
  changed parts.


                    Fig. 2. An excerpt of a TAF document
10

5.2     TafWeb
TafWeb8 is an experimental web service developed by the Department of Com-
puter Science of the University of Bologna under the supervision of the Italian
Senate, which aims at automatically spawning the first version of a TAF docu-
ment, in order to reduce the amount of work done for producing these documents
from scratch. The overall system is currently under development for evaluation
purposes. The core of TafWeb is represented by JNdiff9 , arisen from Ndiff [7][8],
a highly configurable algorithm for smartly comparing XML documents.
   Thanks to the integration of xmLegesMarker inside the TafWeb environ-
ment, it’s possible to implement a service which is able, starting from the plain
Chamber and Senate version of a bill, to automatically produce a document
representing the TAF. The main steps involved are:
 – the conversion of the plain Senate and Chamber version of a bill in XML
   through xmLegesMarker;
 – the computation of the “difference document”, through JNdiff;
 – the application of a style sheet to the original and to the difference document,
   generating the TAF in the official formats: XHTML, Office Open XML, PDF.


6      Conclusions
In this paper we presented xmLegesMarker, a powerful parser for Italian leg-
islative documents. Its main features and capabilities were deeply analyzed and
a qualitative evaluation concerning the marking-up accuracy on plain bills was
provided.
    Besides the benefits in the information retrieval field, the conversion of a
legislative corpus into the XML language permits the development of useful,
legislative domain related software. The scientific collaboration between Italian
Senate and Ittig, currently focusing on the promising outcomes of the marker,
enabled the realization of a process to automatically produce the first version of
a parallel text from the plain Chamber and Senate versions of a bill under de-
bate. Within the TafWeb environment, JNDiff, a comparison algorithm for XML
documents, exploits the very detailed mark-up generated by xmLegesMarker in
order to capture amendments and structural differences.
    The automatic production of the parallel text, a process manually performed
to date, reduces the burden of communication between the two chambers of the
Italian Parliament and speeds up the legislative amending process.


7      Acknowledgements
This work is supported by a grant of the Office for the development of the
information systems of the Italian Senate.
8
     http://sourceforge.net/projects/tafweb/
9
     http://sourceforge.net/projects/jndiff/
                                                                                        11

References
1. Biagioli, C., Francesconi, E., Spinosa, P., Taddei, M.: The NIR project: Standards
   and tools for legislative drafting and legal document web publication. In Proceed-
   ings of ICAIL Workshop on e-Government: Modeling Norms and Concepts as Key
   Issues, pp. 69-78 (2003).
2. Agnoloni, T., Francesconi, E., Spinosa, P.: xmLegesEditor: an opensource visual
   XML editor for supporting legal national standards. In Proceedings of the V Leg-
   islative XML Workshop, European Press Academic Publishing, pp. 239-251 (2007).
3. Biagioli, C.: Ipotesi di modello descrittivo del testo legislativo per l’accesso in rete
   a informazioni giuridiche. Informatica e Diritto 2:90 (2000).
4. Spinosa, P.: Identication of legal documents through URNs (uniform resource
   names). In Proceedings of the EuroWeb 2001, The Web in Public Administration
   (2001).
5. Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in
   Speech Recognition. Proceedings of the IEEE 77 (2): pp. 81-106 (1989).
6. Francesconi, E.: The “Norme in Rete”- project: Standards and tools for Italian
   legislation. International Journal of Legal Information, vol. 34, no. 2, pp. 358-376
   (2006);
7. Schirinzi, M., Vitali, F., Di Iorio, A.: Ndiff, un approccio naturale al confronto di
   documenti XML (2007).
8. Di Iorio, A., Marchetti, C., Schirinzi, M., Vitali, F.: A Natural and Multi-layered
   Approach to Detect Changes in Tree-Based Textual Documents. (to appear) Pro-
   ceedings of the 11th International Conference on Enterprise Information Systems
   (ICEIS), (2009).