<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>RDF for the Camera dei Deputati, and CSV/J-
CLiC-it</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A preliminary release of the Italian Parliamentary Corpus</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Valentino Frasnelli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessio Palmero Aprosio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>Via Sommarive 18, I-38121 Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università di Trento</institution>
          ,
          <addr-line>Via Giuseppe Verdi 26, I-38122 Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>9</volume>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>English. Political debates have been used for years in political and social studies on languages and their cultures. In this paper, we release a preliminary version of the Italian Parliamentary Corpus, a dataset containing 1.2 billion words that includes the political debates in the Italian Parliament from 1848 to 2018. The data has been collected applying an Optical Character Recognition (OCR) software to the original documents, available in PDF format on the websites of Camera dei Deputati and Senato della Repubblica. Italian. I dibattiti politici vengono usati da anni in studi sociali e politici sulle lingue e le loro culture. In questo articolo, rilasciamo una versione preliminare dell'Italian Parliamentary Corpus, un dataset contenente 1.2 miliardi di parole che include i dibattiti politici del Parlamento Italiano dal 1848 al 2018. I dati sono stati collezionati applicando un software di Optical Character Recognition (OCR) ai documenti originali, disponibili in formato PDF sui siti web della Camera dei Deputati e del Senato della Repubblica.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Parliamentary Corpus</kwd>
        <kwd>Political debates</kwd>
        <kwd>OCR post-correction</kwd>
        <kwd>Italian Parliament</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        European Union, political debates of the European
Parliament have been made available in multiple languages,
The analysis of parliamentary debates is very important becoming a precious resource for machine translation
from many research perspectives. Apart from political [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
science, this kind of data can be used to understand how a In this paper, we present the preliminary version of
language and its culture evolves in history. In particular, the Italian Parliamentary Corpus, a collection of
docuin the last two centuries the Italian society has changed ments covering 200 years and containing all the
docuunder a lot of points of view. Since the transition from the ments redacted by the two houses of the bicameral Italian
absolute monarchy to the parliamentary monarchy, that Parliament (Camera dei Deputati, the lower house, and
took place in 1848, Italy went through historical events Senato della Repubblica, previously Senato del Regno, the
such as two world wars, the fascist dictatorship, the exile upper house).
of the royal family, the universal sufrage, the accession The rest of this article is structured as follows. In
to the European Union, and much more. Such important Section 2 we describe how the raw data has been collected.
milestones, along with all the rest of the Italian political Section 3 we show the steps performed to get the clean
and social life, are traced in the parliamentary reports. texts. Section 4 contains some statistics of the dataset.
      </p>
      <p>
        Most research groups around the world have already Finally, both the source code and the dataset are available
collected and released corpora of political debates in var- for download, as described in Section 5.
ious languages, used in diversified fields, such as religion
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and gender [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] studies, multilinguality [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and so
on. GerParCor [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is a dataset containing the German- 2. Data collection
language parliamentary protocols from three centuries
and four countries. Similarly, siParl [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], DutchParl [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
and the Polish Parliamentary Corpus [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] are collection
of political debates, in Slovenian, Dutch, and Polish
languages respectively. In addition, since the creation of the
time interval have already been digitalized, but not yet
published at the time of writing, we could obtain them
thanks to the precious help from the Servizio dei
Resoconti e della Comunicazione istituzionale del Senato della
Repubblica.
      </p>
      <p>In both cases, documents dated before 1996 were not
produced natively in a digital format, therefore are
available only in PDF scanned format. Starting from 1996
(Republic Legislature number XIII), debates have been
published also in text format on the web.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Processing</title>
      <p>
        To convert PDF scanned documents to text, we used
Optical Character Recognition (OCR), in particular Tesseract
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], a software originally developed by Hewlett-Packard,
and subsequently released as open source. Tesseract is
free to use and can support more than 100 languages
out-of-the-box (among them, Italian).
      </p>
      <p>After the conversion, the data is cleaned using some
rule-based heuristics: headers, footers and indexes are
removed, hyphenated words are joined, and pages are
merged.</p>
      <p>Finally, we needed to test the OCR output quality. To
do this, we compiled a gold standard consisting of 30
pages manually transcribed, taken from diferent
legislatures spanning from 1848 to 1996.</p>
      <p>
        To evaluate the accuracy of the extraction, we use
two metrics: word error rate (WER), and character error
rate (CER). The error rates are derived from Levhenstein
distance [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and quantify the number of operation –
insertions, deletions and substitutions – needed to
transform one string in the other. They are common metrics
for evaluating the performance of speech recognition and
machine translation systems, but are often used also for
OCR [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>They are computed as follows:</p>
      <p>+  + 
WER/CER =</p>
      <p>where , , and  represent the number of insertions,
substitutions, and deletions respectively.  is the total
number of instances (words or character, depending on
which metric is considered). The lower the value, the
higher the accuracy.</p>
      <p>As a baseline, we first evaluated the accuracy of the
extraction on the output of Tesseract. Then, we applied
the spell-checker software SymSpell.1 Since SymSpell
only works on words (or word-like strings), we removed
all the punctuation marks from the text. We also ignore
case and consider every word as lowercase.</p>
      <p>SymSpell makes use of dictionaries for the correction
of documents in the format &lt;word&gt; &lt;frequency&gt; for
1https://github.com/wolfgarbe/SymSpell
all words one wants to insert in the dictionary. Since
SymSpell Italian default dictionary is build on top of recent
and general purpose texts, we attempted to create
dictionaries using the lexicon present in the documents
themselves, trying to filter out those words containing errors.</p>
      <p>The idea is to create custom dictionaries for each
legislature, containing only words coming from the time period
of that legislature, in order to better capture the
historical nuances for each legislature. To avoid as much as
possible inserting words with spelling errors into the
dictionaries, only words with a Tesseract confidence score
over a user-set threshold (meaning that their
recognition is likely accurate) were inserted in the dictionary.</p>
      <p>Furthermore, in order to make its creation more robust,
the dictionary for a specific legislature is merged with
those chronologically adjacent, meaning that
dictionaries contained words from both its legislature of origin
and a user-selected window of adjacent legislatures (for
instance, a span of 7 legislatures mean the dictionary
having on average a span of around 35 years). Figure
1 shows how the windowed dictionary system works.</p>
      <p>In theory, this allowed SymSpell to have access to both
more domain specific and historically realistic lexicon
in the dictionaries, instead of the Italian dictionary that
comes out-of-the-box with the software.</p>
      <p>By looking at the error made by SymSpell, it seems
that most of the problems belong to proper names (such</p>
      <p>In this paper we describe a preliminary version of the
Italian Parliamentary Corpus, containing the Italian
Parliament debates since 1848. In total, around 1.2 billion
Table 1 words have been collected.</p>
      <p>Mean CER and WER against the test set (the lower, the better). In the future, we will further investigate OCR
postcorrection solutions to get cleaner data. We will also
complete the data collection, by downloading and processing
as persons and geographical entities), that often are not attachments to the parliamentary sessions, bulletins, law
included into the dictionary and are replaced by existing proposals, and reports of the Standing Committees,
alwords very close to the apparently-misspelled term. ready available on the Italian Parliament houses websites.</p>
      <p>We then compare four diferent approaches: OCR plain We are also planning to assign each speech to the
output from Tesseract, SymSpell with the original dictio- corresponding politician, and release the dataset so that
nary, Symspell with the windowed dictionary, Symspell anyone can use the tagging to make comparative and
with the windowed dictionary applied only to lower- social studies.
cased words.</p>
      <p>Table 1 shows the results of the four configurations. References
The CER and WER value calculated without applying
SymSpell are lower than the other ones, resulting in a
more accurate extraction. However, the use of the custom
frequency list and the removal of proper nouns seems
promising when compared to SymSpell applied with the
original model.</p>
      <p>By looking at the data, we can infer some useful
insights. First of all, the raw text returned by Tesseract is
already very precise: the Italian documents are written
in a very clear font, and the digitalization has been done
at a good level. The errors show that SymSpell replaced
right words with wrong ones in case of proper names
and very technical words, as expected.</p>
      <p>In this first release, then, we will not use any spelling
correction software, and provide the raw text extracted
by Tesseract.</p>
    </sec>
    <sec id="sec-3">
      <title>4. Dataset statistics</title>
      <p>Table 2 shows some statistics of the dataset. In particular,
for each legislature, one can see the number of words,
pages and documents. In recent legislatures (since 1996)
data is published in HTML format on the web, therefore
the number of pages is not available.</p>
    </sec>
    <sec id="sec-4">
      <title>5. Release</title>
      <p>Both the data and the scripts (written in Python) are free
to use and released on Github.2</p>
      <p>The data contained in the Camera dei Deputati and
Senato della Repubblica websites is released under the
Creative Commons Attribution 3.0.3 We use the same
policy and distribute the text data under the same license.
2https://github.com/valefras/Italian_Parliament_Symspell
3https://creativecommons.org/licenses/by/3.0/
8 May 1848 - 30 Dec 1848
1 Feb 1849 - 30 Mar 1849
30 Jul 1849 - 20 Nov 1849
20 Dec 1849 - 20 Nov 1853
19 Dec 1853 - 25 Oct 1857
14 Dec 1857 - 21 Jan 1860
2 Apr 1860 - 17 Dec 1860
18 Feb 1861 - 7 Sep 1865
18 Nov 1865 - 13 Feb 1867
22 Mar 1867 - 2 Nov 1870
5 Dec 1870 - 20 Sep 1874
23 Nov 1874 - 3 Oct 1876
20 Nov 1876 - 2 May 1880
26 May 1880 - 2 Oct 1882
22 Nov 1882 - 27 Apr 1886
10 Jun 1886 - 22 Oct 1890
10 Dec 1890 - 27 Sep 1892
23 Nov 1892 - 8 May 1895
10 Jun 1895 - 2 Mar 1897
5 Apr 1897 - 17 May 1900
16 Jun 1900 - 18 Oct 1904
30 Nov 1904 - 8 Feb 1909
24 Mar 1909 - 29 Sep 1913
27 Nov 1913 - 29 Sep 1919
1 Dec 1919 - 7 Apr 1921
11 Jun 1921 - 25 Jan 1924
24 May 1924 - 21 Jan 1929
20 Apr 1929 - 19 Jan 1934
28 Apr 1934 - 2 Mar 1939
23 Mar 1939 - 5 Aug 1943
25 Sep 1945 - 1 Jun 1946
25 Jun 1946 - 31 Jan 1948
8 May 1948 - 24 Jun 1953
25 Jun 1953 - 11 Jun 1958
12 Jun 1958 - 15 May 1963
16 May 1963 - 4 Jun 1968
5 Jun 1968 - 24 May 1972
25 May 1972 - 4 Jul 1976
5 Jul 1976 - 19 Jun 1979
20 Jun 1979 - 11 Jul 1983
12 Jul 1983 - 1 Jul 1987
2 Jul 1987 - 22 Apr 1992
23 Apr 1992 - 14 Apr 1994
15 Apr 1994 - 8 May 1996
9 May 1996 - 29 May 2001
30 May 2001 - 27 Apr 2006
28 Apr 2006 - 28 Apr 2008
29 Apr 2008 - 14 Mar 2013
15 Mar 2013 - 22 Mar 2018</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Cheng</surname>
          </string-name>
          , Islamophobia,
          <article-title>muslimophobia or racism? parliamentary discourses on islam and muslims in debates on the minaret ban in switzerland</article-title>
          ,
          <source>Discourse &amp; Society</source>
          <volume>26</volume>
          (
          <year>2015</year>
          )
          <fpage>562</fpage>
          -
          <lpage>586</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paoletti</surname>
          </string-name>
          ,
          <article-title>La presenza femminile nelle assemblee parlamentari: Per un'analisi comparata</article-title>
          ,
          <source>Il Politico</source>
          <volume>56</volume>
          (
          <year>1991</year>
          )
          <fpage>77</fpage>
          -
          <lpage>96</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bayley</surname>
          </string-name>
          ,
          <article-title>Cross-cultural perspectives on parliamentary discourse, Cross-Cultural Perspectives on Parliamentary Discourse (</article-title>
          <year>2004</year>
          )
          <fpage>1</fpage>
          -
          <lpage>390</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Abrami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bagci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hammerla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mehler</surname>
          </string-name>
          ,
          <article-title>German parliamentary corpus (gerparcor)</article-title>
          ,
          <source>in: Proceedings of the Language Resources and Evaluation Conference</source>
          , European Language Resources Association, Marseille, France,
          <year>2022</year>
          , pp.
          <fpage>1900</fpage>
          -
          <lpage>1906</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pancur</surname>
          </string-name>
          , T. Erjavec,
          <article-title>The siParl corpus of Slovene parliamentary proceedings</article-title>
          ,
          <source>in: Proceedings of the Second ParlaCLARIN Workshop</source>
          , European Language Resources Association, Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>28</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Marx</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Schuth,
          <string-name>
            <surname>DutchParl.</surname>
          </string-name>
          <article-title>the parliamentary documents in Dutch</article-title>
          ,
          <source>in: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)</source>
          ,
          <source>European Language Resources Association (ELRA)</source>
          , Valletta, Malta,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ogrodniczuk</surname>
          </string-name>
          , Polish Parliamentary Corpus, in: D.
          <string-name>
            <surname>Fišer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Eskevich</surname>
          </string-name>
          , F. de Jong (Eds.),
          <source>Proceedings of the LREC 2018 Workshop ParlaCLARIN: Creating and Using Parliamentary Corpora, European Language Resources Association (ELRA)</source>
          , Paris, France,
          <year>2018</year>
          , pp.
          <fpage>15</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Koehn</surname>
          </string-name>
          ,
          <article-title>Europarl: A parallel corpus for statistical machine translation</article-title>
          ,
          <source>in: Proceedings of Machine Translation Summit X: Papers</source>
          , Phuket, Thailand,
          <year>2005</year>
          , pp.
          <fpage>79</fpage>
          -
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kay</surname>
          </string-name>
          ,
          <string-name>
            <surname>Tesseract:</surname>
          </string-name>
          <article-title>An open-source optical character recognition engine</article-title>
          ,
          <source>Linux J</source>
          .
          <year>2007</year>
          (
          <year>2007</year>
          )
          <article-title>2</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>V. I. Levenshtein</surname>
          </string-name>
          ,
          <article-title>Binary codes capable of correcting deletions, insertions and reversals</article-title>
          .,
          <source>Soviet Physics Doklady</source>
          <volume>10</volume>
          (
          <year>1966</year>
          )
          <fpage>707</fpage>
          -
          <lpage>710</lpage>
          .
          <string-name>
            <given-names>Doklady</given-names>
            <surname>Akademii Nauk</surname>
          </string-name>
          <string-name>
            <surname>SSSR</surname>
          </string-name>
          ,
          <year>V163</year>
          No4
          <fpage>845</fpage>
          -848
          <year>1965</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schulz</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>Kuhn, Multi-modular domain-tailored OCR post-correction</article-title>
          ,
          <source>in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Copenhagen, Denmark,
          <year>2017</year>
          , pp.
          <fpage>2716</fpage>
          -
          <lpage>2726</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D17</fpage>
          -1288.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>