<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A Quantitative/Qualitative Approach to OCR Error Detection and Correction in Old Newspapers for Corpus-assisted Discourse Studies</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Dario</forename><forename type="middle">Del</forename><surname>Fante</surname></persName>
							<email>dario.delfante@phd.unipd.it</email>
							<affiliation key="aff0">
								<orgName type="department">Dept. of Linguistic and Literary Studies</orgName>
								<orgName type="institution">University of Padua</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Giorgio</forename><forename type="middle">Maria</forename><surname>Di</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Dept. of Information Engineering</orgName>
								<orgName type="institution">University of Padua</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="department">Dept. of Mathematics</orgName>
								<orgName type="institution">University of Padua</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff3">
								<address>
									<addrLine>18-19</addrLine>
									<postCode>2021</postCode>
									<settlement>Padua</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">A Quantitative/Qualitative Approach to OCR Error Detection and Correction in Old Newspapers for Corpus-assisted Discourse Studies</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">7BC99E5AA5EE7357D1864D09CA4E05B5</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T16:12+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Corpus-assisted Discourse Studies</term>
					<term>OCR detection</term>
					<term>OCR correction</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpusassisted discourse Studies because allow researchers to expand their interest by making many texts available and analysable through a computer. However, OCR software are not totally accurate, and the resulting error rate compromises their effectiveness. This paper proposes a mixed qualitative-quantitative approach to OCR error detection and correction in order to develop a methodology for compiling historical corpora. The proposed approach consists of three main steps: corpus creation, OCR detection and correction, and application of the automatic rules. The rules are implemented in R using a "tidyverse" approach for a better reproducibility of the experiments.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>In Corpus-assisted Discourse Studies (CADS) <ref type="bibr" target="#b13">[14]</ref>, the processes of corpus design and corpus compilation have a marked impact on the entire research and, depending on it, the results may shift dramatically. Especially for diachronic studies, there is a scarcity of digitized version of paper documents; for this reason, it is often necessary to manually transcribe the texts under analysis or to use Optical Character Recognition (OCR) software which plays a fundamental role in the study of digitizes manuscripts <ref type="bibr" target="#b9">[10]</ref>. However, OCR technologies do Fig. <ref type="figure">1</ref>. The three steps of the proposed procedure for the qualitative and quantitative analysis of OCR Error and detection.</p><p>not always achieve satisfactory results because of several aspects that affects the quality of the original scan: the quality of the camera through which the image has been taken, the image compression algorithm, the quality of the paper especially when working with ancient or easily perishable texts such as old newspapers. These errors may affect in a crucial way the results of a search for documents which may compromise the compilation of a corpus in CADS <ref type="bibr" target="#b1">[2]</ref>.</p><p>In this paper, we propose a procedure for collecting and creating corpora for discourse analysis from old paper documents and we present a semi-automatic method for detection and correction of OCR errors. The outcome of this work consists in a set of rules which are, eventually, valid for different context and applicable to different datasets. The proposed procedure, in terms of computational readability, is aimed at making more readable and searchable the vast array of historical text corpora which are, at the moment, only partially usable given the high error rate introduced by an OCR software.</p><p>In Figure <ref type="figure">1</ref>, we show an overview of the proposed approach which consists of three main steps: corpus creation, OCR detection and correction, and application of the automatic rules. The details of each step are described in the following sections.</p><p>The remainder of the paper is organized as follows: in Section 2, we describe our case study and the choices of the corpora; in Section 3, we present a brief overview of the state-of-the art in OCR correction, we describe our proposal for the error detection and correction, and we analyze the preliminary results; in Section 4, we give our final remarks and discuss our future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Case Study: Searching for Metaphors to Represent Immigrants</head><p>Our case study focuses on the analysis of the metaphors used in the newspapers to represent migration to/from the United States of America and Italy from a diachronic perspective between the beginning of the XX century and the beginning of the XXI century. Given the vast amount of documents available, we needed to define a criterion in order to select a representative sample of documents that allows the comparison for the type of discourse analysis which is the object of our work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Choice of the Corpus</head><p>In order to reduce the amount of data to collect, we selected two moments in history which represent two sampling points in time which have a significant value in relation to migratory movement: 1900-14 and 2000-2014. The decision to focus on the aforementioned time periods lies in the fact that these represent important moments for migratory movements, for both USA and Italy:</p><p>-As for USA:</p><p>• 1900-1914: intense migration movements to USA particularly from Europe <ref type="bibr" target="#b4">[5]</ref>; • 2000-2010: the highest decade of immigration in USA.</p><p>-As for Italy:</p><p>• 1900-1914: intense period of emigration and internal migration <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b5">6]</ref>;</p><p>• 2000-2014: a dramatic increase of the immigration phenomenon which lead to the "2015 migration crisis". <ref type="bibr" target="#b6">[7]</ref> The availability of data, the newspaper political leaning, and the registration fees were additional constraints that narrowed the range of options down to three newspapers.</p><p>For USA, we selected the New York Herald<ref type="foot" target="#foot_0">4</ref> , for the time period 1900-1914, and the New York Times<ref type="foot" target="#foot_1">5</ref> , for the time period 2000-2014, because they are both examples of quality press published in New York. <ref type="foot" target="#foot_2">6</ref> Even though the analysis of the same newspaper was preferable for a matter of homogeneity and integrity, we could not find an American newspaper available for both time period.</p><p>Regarding Italy, we selected La Stampa,<ref type="foot" target="#foot_3">7</ref> a newspaper belonging to the category of quality press and which is published in Turin, a crossroads for many migration routes, both internal and from foreign countries. Fortunately, La Stampa provides an archive concerning all of its daily editions in digital format from 1867 to nowadays.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Selecting Search Terms</head><p>Having chosen the newspapers and the historical period, we needed to select the keywords to filter the articles useful for our study. We decided to use searchterms to sort and collect the newspaper articles. On the one hand, not using search terms would have provided for more versatile corpora, that could be used for other research purposes. On the other hand, a corpus collected by narrowing down the amount of texts retrieved using search terms is more manageable. In addition, as shown by <ref type="bibr" target="#b7">[8]</ref>, sometimes the idiosyncrasies of the online database from which texts are retrieved pose limitations.</p><p>When compiling a specialised corpus using keywords, there is a trade-off between precision and recall. That is, there is a tension between, on the one hand, creating a corpus in which all the texts are relevant, but which does not contain all relevant texts available in the database, and, on the other, creating a corpus which does contain all available relevant texts, albeit at the expense of irrelevant texts also being included. Seen from a different perspective, the tradeoff is between a corpus that can be deemed incomplete, and one which contains noise (i.e. irrelevant texts). Therefore, considering our research purposes which essentially consisted in identifying metaphors of migration, we decided to define a set of search terms to retrieve texts, in order to create a specialised corpus.</p><p>In particular, given the task of identifying metaphor of migration, we needed to identify a set of search terms which would fit into both time periods and which would be comparable and should denote the same meaning <ref type="bibr" target="#b15">[16]</ref>.</p><p>English keywords As for English, the starting point was the set of words identified by <ref type="bibr" target="#b7">[8]</ref> named under the acronym RASIM: refugee*,<ref type="foot" target="#foot_4">8</ref> asylum seeker*, immigrant*, and migrant*. We added a fifth word to this list: emigrant*. These set of words, in fact, has received great attention within corpora and discourse studies and is generally recognized as fully accounting for migration. In order to study the use of these words, we consulted two diachronic corpora: the Contemporary Corpus of Historical American English (COHA) <ref type="foot" target="#foot_5">9</ref> corpus and the US Supreme Court Opinions. <ref type="foot" target="#foot_6">10</ref> The former consists of a collection of 400 million words from a balance set of sources. The latter contains approximately 130 million words in 32,000 Supreme Court decisions from the 1790s to the current time. We also used the Corpus of Contemporary American English (COCA) <ref type="foot" target="#foot_7">11</ref> , which contains more than one billion words of text (25 million words each year 1990-2019) of different genres, and the Sibol Corpus, <ref type="foot" target="#foot_8">12</ref> which contains newspapers data from 1993 to 2013. The comparison of the relative frequency of the selected terms (the full table is not displayed for space reason) shows that in particular the two terms emigrant and immigrant changed the relative frequency across time: in the past the term immigrant was less frequent than the present, while emigrant was more frequent in the past. Asylum seeker and refugee are two relatively recent terms (at least their use).</p><p>Italian keywords As for Italian, we needed to select a set of comparable search terms between English and Italian <ref type="bibr" target="#b15">[16]</ref>. We initially checked different sources (newspaper articles, glossaries, books on migration) in order to identify a preliminary list of plausible candidate query terms. We identified the following words: migrante/i, immigrato/i/a/e, immigrante/i, emigrante/i, emigrato/i/a/e as translations of migrant/s, immigrant/s, and emigrant/s, rifugiato/i/a/e, profugo/i/a/e, clandestino/i/a/e and richiedente/i asilo as translations of refugee/s and asylum seeker/s. We excluded <ref type="foot" target="#foot_9">13</ref> straniero/i/a/e (foreigner) because, as argued by <ref type="bibr" target="#b14">[15]</ref>, it is used more in its adjectival function than as a noun and this would be problematic since it would introduce data which are not relevant for my research purpose in the corpus. We consulted four different corpora: the diachronic Diacoris Corpus,<ref type="foot" target="#foot_10">14</ref> a 15 million words collection of written Italian texts produced between 1861 and 1945; the Pais, <ref type="foot" target="#foot_11">15</ref> a 250 million words corpus of Italian web texts produced in 2010; ItTenTen16,<ref type="foot" target="#foot_12">16</ref> a 5 million words collection of Italian web texts produced in 2016; La Repubblica,<ref type="foot" target="#foot_13">17</ref> a 380 million words corpus of Italian newspaper texts published between 1985 and 2000. These corpora can be regard as representative dataset of the Italian language, including both the 20th and 21st century, because it spans more than 150 years, from 1861 to 2016. Focusing on the aforementioned terms, we looked at the most frequent words over time to in order to define a representative set of search terms for both the past and the present. This way, we discarded terms which did not have a significant relative frequency. The comparison of the relative frequency of the selected terms (the full table is not displayed for space reason) in the aforementioned corpora shows that the best candidate translation for migrant, immigrant and emigrant were migrant*, immigrat*, immigrant*, emigrant*, emigrat* ; for </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Corpora Statistics</head><p>After the identification of the two sets of query terms, we compiled four datasets.</p><p>In Table <ref type="table" target="#tab_0">1</ref>, we show a summary of the statistics for each corpus. The tokens and types values represent the total number of occurrences versus the number of unique words, respectively. We report the type/token ratio (TTR) which serves as indicator of lexical diversity <ref type="bibr" target="#b2">[3]</ref>. The differences between the older and the newer datasets were unexpectedly high and it is very unlikely due to chance. As shown in Table <ref type="table" target="#tab_0">1</ref>, the older datasets relative to the period 1900-14 show a dramatically higher number of TTR.</p><p>A careful analysis of a sample of texts showed that in both the old corpora there were a lot of misspellings or non-meaningful words caused by the OCR software which produced those documents. For example, there are many occurrences of the sequence tbe instead of the in the English corpus, as well as many occurrences of cho in the Italian corpus instead of che (that). In the following section, we describe the semi-automatic procedure for the detection and correction of these OCR errors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">OCR Error Detection and Correction</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Background</head><p>As argued by <ref type="bibr" target="#b11">[12]</ref>, there are two types of errors in an OCR-scanned document:</p><p>-Non-word errors: invalid lexicon entries which are not attested in any dictionary of the analysed language; -Real-word errors: valid words which are in the wrong context.</p><p>The former are easier to identify but more difficult to correct. Contrarily, the latter are easier to correct but more difficult to identify.</p><p>The main approaches to OCR post-processing error correction are 1. a dictionary-based approach which aims at the correction of isolated errors without considering the context <ref type="bibr" target="#b0">[1]</ref>;</p><p>2. a context-based approach which takes into account the grammatical context of occurrence <ref type="bibr" target="#b10">[11]</ref>.</p><p>The former approach is not able to capture all the real-word errors. For example, the English expression a flood of irritants is not recognized as an error because all the words are part of the dictionary. However, analyzing the context, it should be corrected in a flood of immigrants. The latter approach intends to overcome the problems of the dictionary-based, however it requires more effort in terms of time and energy invested and is characterized by a lower level of efficiency in terms of automation. Moreover, the procedures which are generally adopted to overcome OCR errors <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b0">1]</ref> do not work properly in these particular cases because these method make use of the linguistic context which is, in turn, compromised and non-corrected. For this reason, it is necessary to develop a semi-automatic approach which mix quantitative and qualitative methodologies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Our Proposal</head><p>In this paper, we propose a semi-automatic mixed approach to OCR detection which brings together the dictionary-based and the context-based approaches.</p><p>The first problem in our case study concerns the fact that we did not have the corresponding ground truth version of the corpora. Therefore, we decided to use the contemporary corpora where the text was digital since the beginning. The error detection correction task consisted in a three-step procedure:</p><p>1. Definition of a list of plausible error candidates by comparing the list of words of the old corpus with the new corpus. The words that do not appear in the latter, or that have a statistically significant difference in frequency, compose a list of plausible error candidates. For example, the previously mentioned expressions tbe or te, in the English dataset, and cho and olla in the Italian dataset. Successively, each error candidate has been qualitatively analysed by being manually observed through concordance lines within its context of occurrence in order to verify if it was an error or not. Lastly, a list of detected errors has been produced. 2. Analysis and categorization of the error in the list of candidates. Each error is categorised according to three categories: i) Standard Mapping: the error contains the same number of characters than the respective correct form. For example: 'hear' (correct) vs 'jear' (error); ii) Non-standard Mapping: the error contains a higher or a smaller number of characters than the correct form. For example: 'main' (correct) vs 'rnain' (error); iii) Split errors: the word is interpreted by the OCR as two distinct words. This is a very common error when digitalizing newspapers because of the shape of the column in which articles are written. For example: 'department' vs 'depart' and 'ment'. 3. Define the error correction rule as a regular expression to match the pattern of the error (i.e. jear) and substitute it with the (supposedly) correct form (i.e. hear) <ref type="foot" target="#foot_14">18</ref> .  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">A 'Tidy' Implementation</head><p>The implementation of these procedure follows the principles described by <ref type="bibr" target="#b17">[18]</ref> where the idea is to efficiently and effectively mine textual information from large text collections by means of pipelines in order to allow for a sequential process of text analysis. For our experiments, we used the R programming language which has a set of packages, named 'tidyverse'<ref type="foot" target="#foot_15">19</ref> , that implements this idea of pipelined in a clear way. For space reasons, we are not going to describe in detail the code and we will make the source code used in our experiments available online.<ref type="foot" target="#foot_16">20</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Post-hoc analysis</head><p>A total of 476 errors for English and 80 errors for Italian have been manually individuated and, respectively, as many correcting rules have been written for each language. <ref type="foot" target="#foot_17">21</ref> The automatic application of the rules produced 722,371 substitutions for English and 99,255 substitutions for Italian. As the Table <ref type="table" target="#tab_2">3</ref> shows, for both the American and the Italian corpora, the number of Tokens and Types have been moderately changed. As a general comment, it is not easy to predict in what way OCR correction will work. On one side, an increase in the number of tokens might happen because many errors, such as p/r or th/ were not previously recognized as valid tokens. On the other side, the number of types are in general reduced for both corpora since different errors are mapped to the same type. For example, the English article the has been differently misspelled in many ways: tne, tha, tbe, tna. These four words are counted as four different types. Then, by correcting substituting all these words with the, the number of types is reduced of three units. Similarly, the Italian female article la has been misspelled as jd, ja, ln. These three words are counted as three different types. Then, by correcting all these words with the, the number of types is reduced of three units. The correction task has been repeated four times for English corpus and two times for Italian. Any ambiguous and dubious case, such as ii and ih in Italian which could be corrected were not corrected to not compromise the validity of the corpora.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Final Remarks and Future Work</head><p>In this paper, we described a procedure for collecting and creating corpora for discourse analysis from old paper documents and we presented a semi-automatic method for detection and correction of OCR errors. The semi-automatic approach for correcting OCR errors developed for this project has proved to be effective. Despite the fact that the rules produced for the corrections may be less useful with other corpora, the methodology itself is applicable to different contexts.</p><p>We are currently investigating how to evaluate the set of rules which have been developed for the actual project to other corpora in order to verify if it is successfully applicable to different contexts. Secondly, we would replicate the same methodology for other set of documents and produce a different set of rules which would be compared with the ones developed for the actual work. Our aim is to create rules which can be generally reused from everyone for correcting their own OCR processed documents. Given the number of substitutions (for English we were close to a million of substitutions), it is important to understand the number false positives introduced. In this sense, we will explore how to evaluate the rules in a semi-automatic way and produce a ground truth. Recent papers have explored advanced automatic corrections based on edit distances, n-grams and neural models <ref type="bibr" target="#b11">[12]</ref>. They are successful indeed, but they all introduce some kind of error that may affect the qualitative analysis that CADS need.</p><p>There are still open questions that we will investigate in this line of work: for example, how many documents have we missed during the compilation of the corpus given that a search keyword may be subject to OCR correction as well. How these types of keyword search error can affect a CADS analysis? For this reason, following <ref type="bibr" target="#b3">[4]</ref>, we intend to use error models as a means to measure the relative risk of mismatch between search terms and the targeted resources posed by OCR errors. We also want to compare our analysis with recent approaches that make use of BERT pre-trained neural networks to post-hoc error correction <ref type="bibr" target="#b12">[13]</ref>, especially in those cases where the context is not clear given multiple OCR errors in the same paragraph.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Statistics about each corpus with Type/Token ratio (TTR)</figDesc><table><row><cell>Corpus</cell><cell cols="2">Years Documents</cell><cell>Tokens</cell><cell>Types TTR</cell></row><row><cell cols="2">New York Herald 1900-1914</cell><cell cols="3">9,119 64,061,101 3,085,080 4.82%</cell></row><row><cell>La Stampa</cell><cell>1900-1914</cell><cell cols="3">3,092 19,396,796 899,688 4.64%</cell></row><row><cell cols="2">New York Times 2000-2014</cell><cell cols="3">125 58,915,060 308,251 0.52%</cell></row><row><cell>La Stampa</cell><cell>2000-2014</cell><cell cols="3">62 15,324,728 282,318 1.84%</cell></row><row><cell cols="5">refugee and asylum seeker the candidate Italian terms were rifugiat*, profug*,</cell></row><row><cell cols="2">clandestin* and richiedent* asil*.</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Statistics about errors before and after OCR corrections.</figDesc><table><row><cell></cell><cell cols="4">Before OCR correction After OCR correction</cell><cell>Difference</cell></row><row><cell>Corpus</cell><cell>Tokens</cell><cell>Types</cell><cell>Tokens</cell><cell cols="2">Types ∆ Tokens ∆ Types</cell></row><row><cell cols="6">NY Herald 1900-1914 64,061,101 3,085,080 64,246,208 3,082,880 +0.29% -0.04%</cell></row><row><cell cols="6">La Stampa 1900-1914 19,396,796 899,688 19,396,558 899,676 ∼-0.0% ∼-0.0%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 .</head><label>3</label><figDesc>Examples of standard and non-standard errors and corrections.</figDesc><table><row><cell>Type</cell><cell cols="2">Error Correction</cell></row><row><cell></cell><cell>olla</cell><cell>alla</cell></row><row><cell></cell><cell>alia</cell><cell>alla</cell></row><row><cell></cell><cell>cho</cell><cell>che</cell></row><row><cell>Standard</cell><cell>cne ohe</cell><cell>che che</cell></row><row><cell></cell><cell>die</cell><cell>che</cell></row><row><cell></cell><cell>clic</cell><cell>che</cell></row><row><cell></cell><cell>clie</cell><cell>che</cell></row><row><cell></cell><cell>Clie</cell><cell>che</cell></row><row><cell>Non-standard</cell><cell cols="2">colleglli colleghi eia da</cell></row><row><cell></cell><cell>eli</cell><cell>di</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_0">http://chroniclingamerica.loc.gov</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_1">http://www.lexisnexis.com</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_2">By quality press we mean the more accurate newspapers which give detailed accounts of events, as well as reports on business, culture, and society, which contrasts with tabloid newspapers which are more devoted in giving sensational news.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_3">http://www.archiviolastampa.it/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_4">We use the symbol '*' to indicate the possibility of plural, or feminine/masculine for the Italian words.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_5">https://www.english-corpora.org/coha/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_6">https://www.english-corpora.org/scotus/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_7">https://www.english-corpora.org/coca/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="12" xml:id="foot_8">https://www.sketchengine.eu/sibol-corpus/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="13" xml:id="foot_9">Our main interest is on the people who actually migrate and not on the issue of migration in general.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="14" xml:id="foot_10">http://corpora.dslo.unibo.it/DiaCORIS/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="15" xml:id="foot_11">https://www.corpusitaliano.it/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="16" xml:id="foot_12">https://www.sketchengine.eu/ittenten-italian-corpus/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="17" xml:id="foot_13">https://corpora.dipintra.it/public/run.cgi/first_form</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="18" xml:id="foot_14">Ambiguous and dubious cases where two or more plausible corrections were available, were not inserted in the list to avoid compromising the validity of the corpora.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="19" xml:id="foot_15">https://www.tidyverse.org</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="20" xml:id="foot_16">https://github.com/gmdn</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="21" xml:id="foot_17">In this analysis, we excluded the split errors since this type of error require a longer evaluation procedure given the amount of false positives errors.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">OCR post-processing error correction algorithm using google online spelling suggestion</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Bassil</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Alwani</surname></persName>
		</author>
		<idno>CoRR abs/1204.0191</idno>
		<ptr target="http://arxiv.org/abs/1204.0191" />
		<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Assessing the impact of ocr errors in information retrieval</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">T</forename><surname>Bazzo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">A</forename><surname>Lorentz</surname></persName>
		</author>
		<author>
			<persName><surname>Suarez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Vargas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">P</forename><surname>Moreira</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Information Retrieval</title>
				<editor>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Jose</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Yilmaz</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Magalhães</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Castells</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Silva</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Martins</surname></persName>
		</editor>
		<meeting><address><addrLine>Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer International Publishing</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="102" to="109" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">V</forename><surname>Brezina</surname></persName>
		</author>
		<title level="m">Statistics in corpus linguistics: A practical guide</title>
				<imprint>
			<publisher>Cambridge University Press</publisher>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Impact of ocr errors on the use of digital libraries: Towards a better access to information</title>
		<author>
			<persName><forename type="first">G</forename><surname>Chiron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Doucet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Coustaty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Visani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Moreux</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACM/IEEE Joint Conference on Digital Libraries (JCDL)</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
			<biblScope unit="page" from="1" to="4" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Cohen</surname></persName>
		</author>
		<idno type="DOI">10.1017/CBO9780511598289</idno>
		<ptr target="https://doi.org/10.1017/CBO9780511598289" />
		<title level="m">The Cambridge Survey of World Migration</title>
				<imprint>
			<publisher>Cambridge University Press</publisher>
			<date type="published" when="1995">1995</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Storia dell&apos;immigrazione straniera in Italia</title>
		<author>
			<persName><forename type="first">M</forename><surname>Colucci</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
			<publisher>Carocci</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><surname>Comte</surname></persName>
		</author>
		<title level="m">The history of the European migration regime: Germany&apos;s strategic hegemony</title>
				<imprint>
			<publisher>Routledge</publisher>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Selecting query terms to build a specialised corpus from a restricted-access database</title>
		<author>
			<persName><forename type="first">C</forename><surname>Gabrielatos</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ICAME Journal</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<biblScope unit="page" from="5" to="44" />
			<date type="published" when="2007-04">Apr 2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Senza attraversare le frontiere: le migrazioni interne dall&apos;unità a oggi</title>
		<author>
			<persName><forename type="first">S</forename><surname>Gallo</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2012">2012</date>
			<publisher>Gius. Laterza &amp; Figli Spa</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Old content and modern tools -searching named entities in a finnish ocred historical newspaper collection 1771-1910</title>
		<author>
			<persName><forename type="first">K</forename><surname>Kettunen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Mäkelä</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ruokolainen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kuokkala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Löfberg</surname></persName>
		</author>
		<ptr target="http://www.digitalhumanities.org/dhq/vol/11/3/000333/000333.html" />
	</analytic>
	<monogr>
		<title level="j">Digit. Humanit. Q</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="issue">3</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">OCR error correction using character correction and feature-based word classification</title>
		<author>
			<persName><forename type="first">I</forename><surname>Kissos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Dershowitz</surname></persName>
		</author>
		<idno>CoRR abs/1604.06225</idno>
		<ptr target="http://arxiv.org/abs/1604.06225" />
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Deep statistical analysis of ocr errors for effective post-ocr processing</title>
		<author>
			<persName><forename type="first">T</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Jatowt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Coustaty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Doucet</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACM/IEEE Joint Conference on Digital Libraries (JCDL)</title>
				<imprint>
			<date type="published" when="2019">2019. 2019</date>
			<biblScope unit="page" from="29" to="38" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Neural machine translation with bert for post-ocr error detection and correction</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">T H</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Jatowt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">V</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Coustaty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Doucet</surname></persName>
		</author>
		<idno type="DOI">10.1145/3383583.3398605</idno>
		<idno>3383583.3398605</idno>
		<ptr target="https://doi.org/10.1145/3383583.3398605" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in</title>
				<meeting>the ACM/IEEE Joint Conference on Digital Libraries in<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2020">2020. 2020</date>
			<biblScope unit="page" from="333" to="336" />
		</imprint>
	</monogr>
	<note>JCDL &apos;20</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Patterns and meanings in discourse: Theory and practice in corpus-assisted discourse studies (CADS)</title>
		<author>
			<persName><forename type="first">A</forename><surname>Partingron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Duguid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Taylor</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="s">Studies in Corpus Linguistics</title>
		<imprint>
			<biblScope unit="volume">55</biblScope>
			<date type="published" when="2013">2013</date>
			<publisher>John Benjamins</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">The representation of immigrants in the italian press</title>
		<author>
			<persName><forename type="first">C</forename><surname>Taylor</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">CIRCaP Occasional Papers</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="page" from="1" to="40" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Investigating the representation of migrants in the uk and italian press: A cross-linguistic corpus-assisted discourse analysis</title>
		<author>
			<persName><forename type="first">C</forename><surname>Taylor</surname></persName>
		</author>
		<idno type="DOI">10.1075/ijcl.19.3.03tay</idno>
		<ptr target="http://sro.sussex.ac.uk/id/eprint/50044/" />
	</analytic>
	<monogr>
		<title level="j">International Journal of Corpus Linguistics</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="368" to="400" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">A statistical approach to automatic OCR error correction in context</title>
		<author>
			<persName><forename type="first">X</forename><surname>Tong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">A</forename><surname>Evans</surname></persName>
		</author>
		<ptr target="https://www.aclweb.org/anthology/W96-0108" />
	</analytic>
	<monogr>
		<title level="m">Fourth Workshop on Very Large Corpora</title>
				<imprint>
			<date type="published" when="1996">1996</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Text Analysis Pipelines -Towards Ad-hoc Large-Scale Text Mining</title>
		<author>
			<persName><forename type="first">H</forename><surname>Wachsmuth</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-319-25741-9</idno>
		<ptr target="https://doi.org/10.1007/978-3-319-25741-9" />
	</analytic>
	<monogr>
		<title level="s">Lecture Notes in Computer Science</title>
		<imprint>
			<biblScope unit="volume">9383</biblScope>
			<date type="published" when="2015">2015</date>
			<publisher>Springer</publisher>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
