1. Introduction

RDF for the Camera dei Deputati, and CSV/J- CLiC-it

A preliminary release of the Italian Parliamentary Corpus

Valentino Frasnelli

Alessio Palmero Aprosio

0 0 Fondazione Bruno Kessler , Via Sommarive 18, I-38121 Trento , Italy 1 Università di Trento , Via Giuseppe Verdi 26, I-38122 Trento , Italy

2023

9 0000 0002

English. Political debates have been used for years in political and social studies on languages and their cultures. In this paper, we release a preliminary version of the Italian Parliamentary Corpus, a dataset containing 1.2 billion words that includes the political debates in the Italian Parliament from 1848 to 2018. The data has been collected applying an Optical Character Recognition (OCR) software to the original documents, available in PDF format on the websites of Camera dei Deputati and Senato della Repubblica. Italian. I dibattiti politici vengono usati da anni in studi sociali e politici sulle lingue e le loro culture. In questo articolo, rilasciamo una versione preliminare dell'Italian Parliamentary Corpus, un dataset contenente 1.2 miliardi di parole che include i dibattiti politici del Parlamento Italiano dal 1848 al 2018. I dati sono stati collezionati applicando un software di Optical Character Recognition (OCR) ai documenti originali, disponibili in formato PDF sui siti web della Camera dei Deputati e del Senato della Repubblica.

eol>Parliamentary Corpus Political debates OCR post-correction Italian Parliament

1. Introduction

European Union, political debates of the European Parliament have been made available in multiple languages, The analysis of parliamentary debates is very important becoming a precious resource for machine translation from many research perspectives. Apart from political [ 8 ]. science, this kind of data can be used to understand how a In this paper, we present the preliminary version of language and its culture evolves in history. In particular, the Italian Parliamentary Corpus, a collection of docuin the last two centuries the Italian society has changed ments covering 200 years and containing all the docuunder a lot of points of view. Since the transition from the ments redacted by the two houses of the bicameral Italian absolute monarchy to the parliamentary monarchy, that Parliament (Camera dei Deputati, the lower house, and took place in 1848, Italy went through historical events Senato della Repubblica, previously Senato del Regno, the such as two world wars, the fascist dictatorship, the exile upper house). of the royal family, the universal sufrage, the accession The rest of this article is structured as follows. In to the European Union, and much more. Such important Section 2 we describe how the raw data has been collected. milestones, along with all the rest of the Italian political Section 3 we show the steps performed to get the clean and social life, are traced in the parliamentary reports. texts. Section 4 contains some statistics of the dataset.

Most research groups around the world have already Finally, both the source code and the dataset are available collected and released corpora of political debates in var- for download, as described in Section 5. ious languages, used in diversified fields, such as religion [ 1 ] and gender [ 2 ] studies, multilinguality [ 3 ], and so on. GerParCor [ 4 ] is a dataset containing the German- 2. Data collection language parliamentary protocols from three centuries and four countries. Similarly, siParl [ 5 ], DutchParl [ 6 ], and the Polish Parliamentary Corpus [ 7 ] are collection of political debates, in Slovenian, Dutch, and Polish languages respectively. In addition, since the creation of the time interval have already been digitalized, but not yet published at the time of writing, we could obtain them thanks to the precious help from the Servizio dei Resoconti e della Comunicazione istituzionale del Senato della Repubblica.

In both cases, documents dated before 1996 were not produced natively in a digital format, therefore are available only in PDF scanned format. Starting from 1996 (Republic Legislature number XIII), debates have been published also in text format on the web.

3. Processing

To convert PDF scanned documents to text, we used Optical Character Recognition (OCR), in particular Tesseract [ 9 ], a software originally developed by Hewlett-Packard, and subsequently released as open source. Tesseract is free to use and can support more than 100 languages out-of-the-box (among them, Italian).

After the conversion, the data is cleaned using some rule-based heuristics: headers, footers and indexes are removed, hyphenated words are joined, and pages are merged.

Finally, we needed to test the OCR output quality. To do this, we compiled a gold standard consisting of 30 pages manually transcribed, taken from diferent legislatures spanning from 1848 to 1996.

To evaluate the accuracy of the extraction, we use two metrics: word error rate (WER), and character error rate (CER). The error rates are derived from Levhenstein distance [ 10 ] and quantify the number of operation – insertions, deletions and substitutions – needed to transform one string in the other. They are common metrics for evaluating the performance of speech recognition and machine translation systems, but are often used also for OCR [ 11 ].

They are computed as follows:

+ + WER/CER =

where , , and represent the number of insertions, substitutions, and deletions respectively. is the total number of instances (words or character, depending on which metric is considered). The lower the value, the higher the accuracy.

As a baseline, we first evaluated the accuracy of the extraction on the output of Tesseract. Then, we applied the spell-checker software SymSpell.1 Since SymSpell only works on words (or word-like strings), we removed all the punctuation marks from the text. We also ignore case and consider every word as lowercase.

SymSpell makes use of dictionaries for the correction of documents in the format <word> <frequency> for 1https://github.com/wolfgarbe/SymSpell all words one wants to insert in the dictionary. Since SymSpell Italian default dictionary is build on top of recent and general purpose texts, we attempted to create dictionaries using the lexicon present in the documents themselves, trying to filter out those words containing errors.

The idea is to create custom dictionaries for each legislature, containing only words coming from the time period of that legislature, in order to better capture the historical nuances for each legislature. To avoid as much as possible inserting words with spelling errors into the dictionaries, only words with a Tesseract confidence score over a user-set threshold (meaning that their recognition is likely accurate) were inserted in the dictionary.

Furthermore, in order to make its creation more robust, the dictionary for a specific legislature is merged with those chronologically adjacent, meaning that dictionaries contained words from both its legislature of origin and a user-selected window of adjacent legislatures (for instance, a span of 7 legislatures mean the dictionary having on average a span of around 35 years). Figure 1 shows how the windowed dictionary system works.

In theory, this allowed SymSpell to have access to both more domain specific and historically realistic lexicon in the dictionaries, instead of the Italian dictionary that comes out-of-the-box with the software.

By looking at the error made by SymSpell, it seems that most of the problems belong to proper names (such

In this paper we describe a preliminary version of the Italian Parliamentary Corpus, containing the Italian Parliament debates since 1848. In total, around 1.2 billion Table 1 words have been collected.

Mean CER and WER against the test set (the lower, the better). In the future, we will further investigate OCR postcorrection solutions to get cleaner data. We will also complete the data collection, by downloading and processing as persons and geographical entities), that often are not attachments to the parliamentary sessions, bulletins, law included into the dictionary and are replaced by existing proposals, and reports of the Standing Committees, alwords very close to the apparently-misspelled term. ready available on the Italian Parliament houses websites.

We then compare four diferent approaches: OCR plain We are also planning to assign each speech to the output from Tesseract, SymSpell with the original dictio- corresponding politician, and release the dataset so that nary, Symspell with the windowed dictionary, Symspell anyone can use the tagging to make comparative and with the windowed dictionary applied only to lower- social studies. cased words.

Table 1 shows the results of the four configurations. References The CER and WER value calculated without applying SymSpell are lower than the other ones, resulting in a more accurate extraction. However, the use of the custom frequency list and the removal of proper nouns seems promising when compared to SymSpell applied with the original model.

By looking at the data, we can infer some useful insights. First of all, the raw text returned by Tesseract is already very precise: the Italian documents are written in a very clear font, and the digitalization has been done at a good level. The errors show that SymSpell replaced right words with wrong ones in case of proper names and very technical words, as expected.

In this first release, then, we will not use any spelling correction software, and provide the raw text extracted by Tesseract.

4. Dataset statistics

Table 2 shows some statistics of the dataset. In particular, for each legislature, one can see the number of words, pages and documents. In recent legislatures (since 1996) data is published in HTML format on the web, therefore the number of pages is not available.

5. Release

Both the data and the scripts (written in Python) are free to use and released on Github.2

The data contained in the Camera dei Deputati and Senato della Repubblica websites is released under the Creative Commons Attribution 3.0.3 We use the same policy and distribute the text data under the same license. 2https://github.com/valefras/Italian_Parliament_Symspell 3https://creativecommons.org/licenses/by/3.0/ 8 May 1848 - 30 Dec 1848 1 Feb 1849 - 30 Mar 1849 30 Jul 1849 - 20 Nov 1849 20 Dec 1849 - 20 Nov 1853 19 Dec 1853 - 25 Oct 1857 14 Dec 1857 - 21 Jan 1860 2 Apr 1860 - 17 Dec 1860 18 Feb 1861 - 7 Sep 1865 18 Nov 1865 - 13 Feb 1867 22 Mar 1867 - 2 Nov 1870 5 Dec 1870 - 20 Sep 1874 23 Nov 1874 - 3 Oct 1876 20 Nov 1876 - 2 May 1880 26 May 1880 - 2 Oct 1882 22 Nov 1882 - 27 Apr 1886 10 Jun 1886 - 22 Oct 1890 10 Dec 1890 - 27 Sep 1892 23 Nov 1892 - 8 May 1895 10 Jun 1895 - 2 Mar 1897 5 Apr 1897 - 17 May 1900 16 Jun 1900 - 18 Oct 1904 30 Nov 1904 - 8 Feb 1909 24 Mar 1909 - 29 Sep 1913 27 Nov 1913 - 29 Sep 1919 1 Dec 1919 - 7 Apr 1921 11 Jun 1921 - 25 Jan 1924 24 May 1924 - 21 Jan 1929 20 Apr 1929 - 19 Jan 1934 28 Apr 1934 - 2 Mar 1939 23 Mar 1939 - 5 Aug 1943 25 Sep 1945 - 1 Jun 1946 25 Jun 1946 - 31 Jan 1948 8 May 1948 - 24 Jun 1953 25 Jun 1953 - 11 Jun 1958 12 Jun 1958 - 15 May 1963 16 May 1963 - 4 Jun 1968 5 Jun 1968 - 24 May 1972 25 May 1972 - 4 Jul 1976 5 Jul 1976 - 19 Jun 1979 20 Jun 1979 - 11 Jul 1983 12 Jul 1983 - 1 Jul 1987 2 Jul 1987 - 22 Apr 1992 23 Apr 1992 - 14 Apr 1994 15 Apr 1994 - 8 May 1996 9 May 1996 - 29 May 2001 30 May 2001 - 27 Apr 2006 28 Apr 2006 - 28 Apr 2008 29 Apr 2008 - 14 Mar 2013 15 Mar 2013 - 22 Mar 2018

[1]

J. E.

Cheng , Islamophobia, muslimophobia or racism? parliamentary discourses on islam and muslims in debates on the minaret ban in switzerland , Discourse & Society 26 ( 2015 ) 562 - 586 .

[2]

Paoletti , La presenza femminile nelle assemblee parlamentari: Per un'analisi comparata , Il Politico 56 ( 1991 ) 77 - 96 .

[3]

Bayley , Cross-cultural perspectives on parliamentary discourse, Cross-Cultural Perspectives on Parliamentary Discourse ( 2004 ) 1 - 390 .

[4]

Abrami ,

Bagci ,

Hammerla ,

Mehler , German parliamentary corpus (gerparcor) , in: Proceedings of the Language Resources and Evaluation Conference , European Language Resources Association, Marseille, France, 2022 , pp. 1900 - 1906 .

[5]

Pancur , T. Erjavec, The siParl corpus of Slovene parliamentary proceedings , in: Proceedings of the Second ParlaCLARIN Workshop , European Language Resources Association, Marseille, France, 2020 , pp. 28 - 34 .

[6]

Marx , A . Schuth, DutchParl. the parliamentary documents in Dutch , in: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10) , European Language Resources Association (ELRA) , Valletta, Malta, 2010 .

[7]

Ogrodniczuk , Polish Parliamentary Corpus, in: D. Fišer , M. Eskevich , F. de Jong (Eds.), Proceedings of the LREC 2018 Workshop ParlaCLARIN: Creating and Using Parliamentary Corpora, European Language Resources Association (ELRA) , Paris, France, 2018 , pp. 15 - 19 .

[8]

Koehn , Europarl: A parallel corpus for statistical machine translation , in: Proceedings of Machine Translation Summit X: Papers , Phuket, Thailand, 2005 , pp. 79 - 86 .

[9]

Kay , Tesseract: An open-source optical character recognition engine , Linux J . 2007 ( 2007 ) 2 .

[10] V. I. Levenshtein , Binary codes capable of correcting deletions, insertions and reversals ., Soviet Physics Doklady 10 ( 1966 ) 707 - 710 .

Doklady

Akademii Nauk SSSR , V163 No4 845 -848 1965 .

[11]

Schulz , J. Kuhn, Multi-modular domain-tailored OCR post-correction , in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Copenhagen, Denmark, 2017 , pp. 2716 - 2726 . doi: 10 .18653/v1/ D17 -1288.