<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Who Advertises in Newspapers? Data Criticism in Mining Historical Job Ads</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Klara</forename><surname>Venglarova</surname></persName>
							<email>klara.venglarova@uni-graz.at</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Graz</orgName>
								<address>
									<country key="AT">Austria</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Raven</forename><surname>Adam</surname></persName>
							<email>raven.adam@uni-graz.at</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Graz</orgName>
								<address>
									<country key="AT">Austria</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Wiltrud</forename><surname>Mölzer</surname></persName>
							<email>wiltrud.moelzer@uni-graz.at</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Graz</orgName>
								<address>
									<country key="AT">Austria</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Saranya</forename><surname>Balasubramanian</surname></persName>
							<email>saranya.balasubramanian@uni-graz.at</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Graz</orgName>
								<address>
									<country key="AT">Austria</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jörn</forename><surname>Kleinert</surname></persName>
							<email>joern.kleinert@uni-graz.at</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Graz</orgName>
								<address>
									<country key="AT">Austria</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Manfred</forename><surname>Füllsack</surname></persName>
							<email>manfred.fuellsack@uni-graz.at</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Graz</orgName>
								<address>
									<country key="AT">Austria</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Georg</forename><surname>Vogeler</surname></persName>
							<email>georg.vogeler@uni-graz.at</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Graz</orgName>
								<address>
									<country key="AT">Austria</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Who Advertises in Newspapers? Data Criticism in Mining Historical Job Ads</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">B2BE1345F2F4E7191731491DDE459899</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T19:49+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>digitized newspapers, historical job advertisements, historical labour market, data criticism, optical character recognition, page segmentation, post-processing, ground truth G. Vogeler) 0009-0007-6441-7795 (K. Venglarova)</term>
					<term>0000-0001-7841-2601 (R. Adam)</term>
					<term>0009-0002-9517-4531 (W. Mölzer)</term>
					<term>0000-0001-7516-7671 (S. Balasubramanian)</term>
					<term>0000-0002-1167-9245 (J. Kleinert)</term>
					<term>0000-0002-7772-4061 (M. Füllsack)</term>
					<term>0000-0002-1726-1712 (G. Vogeler)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Digitized newspapers are a source of unique and rich historical data but pose significant challenges in the interpretation of results obtained through their mining. The JobAds project (FWF P35783) explores the evolution of the labor market through job advertisements from digitized newspapers between 1850-1950, aiming to reveal regional and temporal trends in job offers, required skills, media strategies, and social aspects such as gender-specific ads. Using the ANNO corpus, we selected 29 newspapers with the most editions. Their processing involved job ads pages preselection, layout segmentation, optical character recognition (OCR), and post-correction, each introducing potential biases due to varying efÏciency of these processes. Additionally, the inherent bias of newspapers as historical sources must be considered, as they reflect only a subset of the job market dynamics of their time. This paper identifies these biases, quantifies their impact, and proposes solutions for steps from corpus selection to data preparation for subsequent text-mining and analysis. We discuss and exemplify the implications of these biases on research outcomes and suggest methodological adjustments to mitigate their effects, ensuring more reliable insights into the historical labor market. Also, we make a dataset of 15 000 manually annotated ground-truth data available as part of this paper.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Digitized newspapers as a source of data bring many opportunities, however, also many challenges and pitfalls. In the JobAds project (FWF P35783), we investigate the evolution of the Austrian labor market through historical job advertisements from digitized newspapers between 1850-1950. Through their analysis, we aim to get insights into the regional and temporal trends in positions offered and sought for, the skills and qualifications required and offered, media strategies, but also social aspects such as gender-specific job offers.</p><p>Our project aims to cover as much as possible of the Austrian labor market in the defined time period. In this paper, we discuss the difÏculties which typically arise when one tries to access data from digitized newspapers. We refer to these difÏculties as 'biases', as they have a great potential to skew results and their interpretation.</p><p>We used 29 periodics (see Appendix) with the largest number of editions within our time span from the ANNO corpus <ref type="bibr" target="#b15">[16]</ref>. From each newspaper, we preselected pages containing job advertisements and converted them to machine-readable text through page segmentation, optical character recognition (OCR), and post-correction.</p><p>Each processing step introduces potential biases: selection of periodicals may not be representative, a classifier may systematically misclassify pages containing certain types of ads, segmentation quality may vary with layout complexity or scan quality, OCR may be affected by font use or style particularities such as white letters on a black background, and post-correction effectiveness may vary between newspapers or years because of different typical mistakes.</p><p>Biases also arise from the historical context of newspapers, which were only one of several channels realizing matches between job seekers and vacancies <ref type="bibr" target="#b24">[25]</ref>. Some jobs were advertised through newspapers more often than others, as in a short radius, it was harder to find highly specialized employees than e.g. unqualified workers who could be placed through a bourse system <ref type="bibr" target="#b12">[13]</ref>.</p><p>This paper addresses how we confront a research question with the technical and historical reality, from corpus selection to OCRed text post-processing. Section 2 describes related work, section 3 provides details about dataset creation. Sections 4 and 5 discuss biases arising from historical context and processing steps respectively, and section 6 outlines future work and concludes the paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Oberbichler and Pfanzelter <ref type="bibr" target="#b16">[17]</ref> discuss a large number of 'biases that come along with the processing and datafication of historical newspapers' (p.127) and illustrate them in a case study about return migration. In a keyword-based search, the first challenge is to select the right terms, which is hindered by word flexions and spelling variants, but also by semantic uncertainty, e.g. false positive hits if the term is too broad or not finding all occurrences if the term is too narrow. They show how missing data and varying OCR skew absolute frequencies, and use therefore relative frequencies instead. They also propose improving OCR quality with tools like Transkribus <ref type="bibr" target="#b17">[18]</ref> and argue for adding metadata, contextualization of the source documents, providing information about the limitations of the collection, and additional tools in the interfaces.</p><p>Wijfjes <ref type="bibr" target="#b27">[28]</ref> discusses the relationship between traditional humanities and new digital methods. The most prominent obstacle in using digitized newspapers as a research source is the incompleteness of digitized collections, caused by factors such as costs, time or copyright issues. Relying on the available collections regardless of their broader context can result in working with a 'convenience sample' (p.16), closely related to 'digital laziness' (p.10) which arises from an overreliance on easily accessible digital information. Author also mentions unreliability in OCR and the need for 'complete and uniform data' (p.21).</p><p>The errors in Optical Layout Recognition (OLR) and Optical Character Recognition (OCR) are crucial problems in machine-readable text creation that have been addressed by several scholars ( <ref type="bibr" target="#b22">[23]</ref>; <ref type="bibr" target="#b26">[27]</ref>; <ref type="bibr" target="#b7">[8]</ref>; <ref type="bibr" target="#b5">[6]</ref>; <ref type="bibr" target="#b20">[21]</ref>). Noisy OCR or its varying quality across the corpus poses problems not only in a keyword search, but also in subsequent tasks, such as Part-of-Speech (POS) tagging, dependency parsing, Named Entity Recognition (NER) or topic modeling. Therefore, it is necessary to assess the influence of the OCR quality on the outcomes of NLP tasks, as discussed e.g. by <ref type="bibr">([22]</ref>; <ref type="bibr" target="#b19">[20]</ref>; <ref type="bibr" target="#b25">[26]</ref>).</p><p>Cordell <ref type="bibr" target="#b3">[4]</ref> largely discusses digitized collection in a broader context of the process of digitalization and decisions leading to which material shall be digitized. He distinguishes between printed and digital editions and argues for taking digitized text 'seriously within its own medium' (p.217). On the example of the Raven by E. A. Poe, he shows how results of keyword search are dependent on the quality of the OCR output.</p><p>We are aware of all these and handle the challenges where they occur. For example, working with relative frequencies of job titles can help to capture the changing demands and offers of a specific position instead of unintentional identification of changing trends of advertising jobs in newspapers. Considering the previous research and our own experience, we search for strategies for mitigating the bias in our data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Dataset</head><p>Our dataset comprises 29 newspaper titles from the ANNO corpus, a collection of digitized newspapers provided by the Austrian National Library, from 1850-1950. The 29 newspapers were selected based on the largest number of editions, with a minimal issue period of 10 years. These newspapers are predominantly in German, containing minor numbers of ads in French, English, Czech, Hungarian, Italian and other languages. Because of the time needed for processing the pages, we initially preselected pages containing job advertisements manually, based on observed patterns in the advertisement sections appearance, and later refined the preselection automatically using a transformer-based preselection.</p><p>From the preselected pages, we randomly sampled one page per year for each newspaper available for that year, resulting in 3 300 pages. On these pages, we manually annotated all job advertisements using doccano software <ref type="bibr" target="#b14">[15]</ref>, resulting in 14 985 annotated job ads. These annotations serve us as ground-truth data. The annotated ads were OCRed with the frak2021 model <ref type="bibr" target="#b9">[10]</ref> and manually corrected using Transkribus.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Biases from Historical Context</head><p>In this section, we explore biases arising from the historical context and the nature of newspapers as a medium. As these biases are beyond direct control, we have to gather comprehensive information to adjust our research questions based on available data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Newspapers as a Medium</head><p>The bias in our research does not start with the selection of the 29 newspaper titles; it starts with the decision to use newspapers as our primary data source to understand labor market development. The matches between the job seekers and vacancies were realized through several channels <ref type="bibr" target="#b4">[5]</ref>, with different proportions for various types of jobs.</p><p>To address this bias, it is crucial to quantify the extent to which job seekers and vacancies were matched through newspaper ads compared to other channels and to understand which jobs were underrepresented or missing in them. If sufÏcient facts concerning these aspects cannot be found, it is necessary to redefine the research goals. Instead of aiming to describe the entire labor market, we can narrow our scope to explore specific job categories, time periods, or regions.</p><p>For our period, approximately 30% of the matches between job seekers and employers were realized through job advertisements in newspapers <ref type="bibr" target="#b12">[13]</ref>. Other channels to find matches were personal contacts, asking around, bourse system, municipal placement services and commercial brokers, which came increasingly under administrative control to reduce abuse and were finally replaced by the public job services. Search via newspaper ads was dominating among white-collar jobs <ref type="bibr" target="#b4">[5]</ref>. Blue-collar workers were underrepresented in this channel <ref type="bibr" target="#b12">[13]</ref>. Based on these facts, we can aim to compare e.g. qualifications offered with requirements presented in job searches and job offers for white-collar workers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Newspapers and their focus</head><p>The focus of newspapers shall also be considered. To gain a comprehensive understanding of our data, we should address the following questions regarding the newspapers:</p><p>• What is the political orientation of the newspapers? • Which geographical area do they cover? • What is their temporal coverage? • What is their social focus? • Who are the intended readers?</p><p>Access to such meta-information about newspapers provides crucial context and facts. While no single newspaper can fully represent the labor market, a selection of heterogeneous titles covering various aspects can collectively offer a broader perspective. Our selected newspapers were issued in several geographical regions (e.g. Arbeiterwille in Graz, Arbeiter Zeitung in Wien, Linzer Volksblatt in Linz, Salzburger Chronik für Stadt und Land in Salzburg), covering longer periods (see Fig. <ref type="figure" target="#fig_0">1</ref>), most of them daily newspaper but also with focus on travels (Fremden-Blatt) or workers (Arbeiterwille). We included newspapers with various orientations, such as social-democratic (e.g. Arbeiterwille, Arbeiterzeitung), liberal-democratic (Prager Tagblatt), or nationalistic (Salzburger Volksblatt: die unabhängige Tageszeitung für Stadt und Land Salzburg). However, achieving an ideal representation still remains extremely challenging due to realworld complexities. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Who Advertises in Newspapers?</head><p>To draw unbiased conclusions from job advertisements about the entire labor market, the underlying assumption would be that people offering and searching through job ads (or the job ads that we analyze) have the same characteristics of interest as job offerers or seekers who do not use newspaper ads. Since we cannot often know that, we need to ask who actually advertises in newspapers.</p><p>For instance, in a small town with only one factory, there may be no need to advertise for unqualified workers in newspapers. Similarly, workers in such towns may not need to advertise either, as they can rely on local announcements or neighbors or family. Consequently, these types of ads may be missing or underrepresented in our corpus. On the contrary, if the factory seeks highly specialized personnel, they may expand their outreach by utilizing the newspapers.</p><p>A similar scenario applies to a city baker. If they seek someone locally, they might rely on the people in their surrounding. However, if they require an apprentice from the countryside, such as e.g. in (Fig. <ref type="figure" target="#fig_1">2</ref>), they would more likely advertise in newspapers. This highlights that in our corpus, advertisements might be overrepresented that involve a greater distance, e.g. geographical distance (a baker from a city seeks an apprentice from the countryside) or social distance (higher class person looking for a servant). In other cases, despite a short geographical distance in large cities, complexity of an urban society and anonymity plays a role. Also, the ads in our corpus may be extreme from some point of view, e.g. seeking a highly qualified person, a factory with bad working conditions, high demand for workers for seasonal work, a person who struggles to find a job for a longer time. On the other hand, some ads may be missing, e.g. from people who could not afford to pay for an ad in newspapers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Biases from Data Processing</head><p>This section describes biases that data processing introduces, and presents strategies to mitigate them. Our processing pipeline contains steps from corpus selection to the cleaned data, which can be used for meaningful economical analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Corpus creation</head><p>The initial step involves building a corpus. While in the ideal case, researchers would work with the entire population of data, practical limitations in terms of time and computation power or missing newspapers in digitized collections make this only hardly feasible.</p><p>We come as close as we can to the entire population given our resource constraints by selecting the newspapers with the most editions that were issued for at least 10 years, which allows for a comparison in time. In a corpus selection, attention needs to be paid to include heterogeneous newspapers that will cover a wide scope. For further information, see subsection 4.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Preselection of Relevant Pages</head><p>Initially, we manually preselected pages likely to contain job advertisements by examining sampled issues from all newspapers and identifying patterns in the job ad sections. In a later stage, we refined this process using a transformer-based model for the same task, by fine-tuning the microsoft/dit-large-finetuned-rvlcdip model <ref type="bibr" target="#b8">[9]</ref> on our data. The fine-tuned model reached a f1 score of 0.88 and recall of 0.89 on the testing data.</p><p>Preselection can be a dangerous process that can lead to excluding relevant data and bias the results. Our strategy against this pitfall was aiming for the highest recall possible, favoring the inclusion of non-relevant pages over the exclusion of relevant ones. Our model's results, which indicated that only about 34% of the preselected 4,000,000 pages actually contained job advertisements, give us confidence that we have captured most relevant pages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Segmentation process</head><p>Page segmentation is a step which has a direct impact on the OCR quality ( <ref type="bibr" target="#b10">[11]</ref>; <ref type="bibr" target="#b11">[12]</ref>; <ref type="bibr" target="#b0">[1]</ref>; <ref type="bibr" target="#b2">[3]</ref>). However, segmentation quality is often assessed through subjective visual control <ref type="bibr" target="#b6">[7]</ref>, which does not offer complex insights into segmented data quality.</p><p>To address this task, we adopt a methodology for segmentation evaluation from <ref type="bibr" target="#b23">[24]</ref>. This method is not based only on the area of intersection between the annotated and predicted region, as this can be skewed if graphical elements or large blank spaces are present in data. Instead, it also benefits from information about the presence of the text in the non-intersecting parts of the predicted region and its ground-truth.</p><p>We manually annotated nearly 15 000 job ads from our corpus to create ground truth data. We make the annotated data publicly available in the GitHub repository: https://github.com/JobAds-FWFProject/Ground-Truth-CHR2024. This allows us to identify newspapers with lower segmentation accuracy and generate additional ground-truth data to fine-tune our segmentation algorithm effectively. The results of the ongoing work on the evaluation of the segmentation quality will be published in another publication.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">OCR Quality</head><p>OCR quality is the most prominent source of bias, as highlighted in prior research ( <ref type="bibr" target="#b16">[17]</ref>; <ref type="bibr" target="#b22">[23]</ref>; <ref type="bibr" target="#b26">[27]</ref>). Variations in OCR accuracy can lead to discrepancies in e.g. keyword searches and affect data reliability.</p><p>The first step to mitigate this bias is quantifying the OCR quality by e.g. a character error rate (see Fig. <ref type="figure" target="#fig_2">3</ref>). Although an approximation of the OCR quality can be obtained by checking words against a dictionary <ref type="bibr" target="#b20">[21]</ref>, we decided to manually check and correct a sample of the advertisements. This cleaned data serves us to (1) quantify the quality of the OCR, (2) provide us with high-quality data for text-mining experiments and (3) give us information about the most common mistakes in recognition which we can use for automatic post-corrections. A pure dictionary-based approach would introduce a bias by words missing in the dictionary, e.g. abbreviations or names, which both are often in our datasets. Based on our manually transcribed ground-truth, our OCR reaches a SacreBLEU score of 67.5%, word error rate of 30.6% and character error rate of 12.2%. However, apart from this overall evaluation, it is crucial to compare the OCR quality across newspapers and years. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.">Post-correction</head><p>Post-correction is the last step in our pipeline that affects data quality. Starting with a varying OCR quality across newspaper titles or years, the post-correction can both reduce or amplify the discrepancies in data quality. To quantify this problem, we measure the error rate after different post-correction steps on the samples that were manually checked and corrected. As this sample contains ads across all newspaper titles and years, we will be able to compare text quality before and after the post-correction step.</p><p>For the post-correction process, we fine-tune the hmbyt5-preliminary model<ref type="foot" target="#foot_0">1</ref> on the IDCAR2019-POCR dataset for OCR correction <ref type="bibr" target="#b18">[19]</ref>, which significantly improves text accuracy. On the IDCAR data, we reach the SacreBLEU score of 72.25% compared to only 10.83% achieved by the original OCRed text.</p><p>While the model significantly improves the quality of a very bad OCR, our OCR already reaches a higher score (67.5%) before the post-correction step. As there is a large discrepancy between the OCR quality of the training data and our OCR data, the post-correction model itself sometimes introduces new errors (see Tab 1). Some mistakes concern only interpunction, some change letters leading to words with no meaning, but some of them introduce a semantically different result, such as painters (Malerinnen) from ironers (Büglerinnen). We also need to consider that not all characters can be corrected through post-processing. For example, for incorrectly recognized numbers, the original information is lost and cannot be gained back by post-processing (see Fig. <ref type="figure" target="#fig_3">4</ref>). This potentially affects our ability to analyze details about e.g. offered salary where available. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6.">Example: Looking for a Paperhanger</head><p>To illustrate how technical biases can affect historical data interpretation, we examine the demand for five different positions by comparing the frequency of job ads mentioning them between <ref type="bibr">1850-1900 and 1901-1950</ref>. We use a sample of 2779 job ads with both raw OCR and manually corrected transcriptions. These ads are divided into two data sets: one for 1850-1900 and another for 1901-1950.</p><p>First, we compare the absolute frequencies of the positions in both OCRed text, with the results in Tab. 2. E.g., the 'Tapezierer' (paperhanger) appears 2, resp. 4 times in the data sets. For the example of the paperhanger, the OCR data suggests a low frequency of ads, with a slight increase in the 20th century. This might lead to the conclusion that paperhanger jobs were rare and that their demand doubled in the first half of the 20th century. However, when we look at the manually corrected versions of the text, we obtain the following results (Tab. 3):</p><p>Manually corrected data also indicates an increase in demand over time. However, absolute frequencies can more reflect the amount of available data rather then trends of employment. To account for this, we divide absolute frequencies by the number of ads available for these time periods in our sample, and we get another information (Tab. 4): When adjusted for the total number of ads, the relative frequency of paperhanger ads shows that the demand remained almost constant. The apparent increase in absolute numbers is due to the larger volume of data available for the later period, not an actual rise in demand. The example of the paperhanger, and similarly of the other positions, demonstrate the need for cautious interpretation of historical job ads, and that the OCR errors do not impact all positions in the same way. Although similar tests need to be done on a larger scale, this example illustrates how easily we can run into a false interpretation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>In this paper, we addressed the biases encountered in our research of the labor market based on job advertisements from digitized newspapers. Our investigation focused on two sources of bias: those arising from the historical context of the newspapers and those introduced during data processing. The first set of biases stems from the nature of newspapers as a data source. This includes the selection of newspaper titles, their political and social orientations, geographic reach, and the profiles of advertisers and readers. These factors shape the type of job ads available, leading to over-or underrepresentation of certain job types and demographic groups. We emphasize that understanding these contextual elements is crucial. Researchers must adjust their research questions to align with the data's actual scope and representation rather than assuming a comprehensive view of the labor market.</p><p>The second set of biases arises from the various stages of data processing. This encompasses corpus creation, pre-selection of relevant pages, page segmentation, OCR quality, and post-correction processes. Each stage presents potential pitfalls that can skew results. We highlighted the importance of high recall in data preselection, evaluating segmentation and OCR accuracy, and manual corrections and error analysis. Through a practical example, we demonstrated how technical biases can skew historical job market analyses.</p><p>Our ongoing work includes gathering meta-information about each newspaper title and the labor market to better understand the matching of job applicants with vacancies. Also, we continue with further manual corrections of OCRed text to ensure consistent data quality across our corpus. We plan to deeper investigate the post-processing aspects, such as the model's skills transferability in cases, when there is a great difference between the OCR quality of the training and testing data. Additionally, we plan to implement strategies like data augmentation for underrepresented job categories.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Temporal distribution of selected newspaper titles according to their issuance years. Data source: [16].</figDesc><graphic coords="5,89.28,84.17,416.72,333.37" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Job advertisement for a baker apprentice from the countryside. Source: [2].</figDesc><graphic coords="6,141.37,109.70,312.54,168.87" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Character Error Rate of the OCRed text in Prager Abendblatt.</figDesc><graphic coords="8,89.28,123.19,416.72,238.12" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Example of the original and post-corrected text.Although the 'K' was turned into a number, the correct character is the '1' and not the '6'. Source:<ref type="bibr" target="#b13">[14]</ref>.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Mistakes introduced through the post-processing step on two text examples.</figDesc><table><row><cell>Ground Truth</cell><cell>OCR</cell><cell cols="2">Post-Correction</cell><cell></cell></row><row><cell>Geübte Büglerinnen und</cell><cell>Geübte Bualerinnen und</cell><cell>Geübte</cell><cell>Malerinnen</cell><cell>und</cell></row><row><cell>Lehrmädchen</cell><cell>Lehrmädchen</cell><cell cols="2">Lehrmädchen</cell><cell></cell></row><row><cell>Büglerinnen und Lehrmäd-</cell><cell>Büglerinnen und Lehrmäd-</cell><cell cols="3">Züglerinnen und Lehrmäd-</cell></row><row><cell>chen auf neue Herrenhemden</cell><cell>chen auf neue Herrenhemden</cell><cell cols="3">chen auf neue Herrenhemden</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Absolute frequencies of positions found in the OCRed text between1850-1900 and 1901-1950.    </figDesc><table><row><cell></cell><cell cols="2">1850-1900 1901-1950</cell></row><row><cell>Tapezierer (Paperhanger)</cell><cell>2</cell><cell>4</cell></row><row><cell>Stubenmädchen (Maid)</cell><cell>5</cell><cell>25</cell></row><row><cell>Verkäuferin (Shop Assistant f.)</cell><cell>7</cell><cell>14</cell></row><row><cell>Bäcker (Baker)</cell><cell>4</cell><cell>22</cell></row><row><cell>Vertreter (Agent/Representative)</cell><cell>6</cell><cell>18</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Absolute frequencies of positions found in the manually corrected text between1850-1900 and 1901- 1950.    </figDesc><table><row><cell></cell><cell cols="2">1850-1900 1901-1950</cell></row><row><cell>Tapezierer (Paperhanger)</cell><cell>6</cell><cell>10</cell></row><row><cell>Stubenmädchen (Maid)</cell><cell>7</cell><cell>28</cell></row><row><cell>Verkäuferin (Shop Assistant f.)</cell><cell>8</cell><cell>21</cell></row><row><cell>Bäcker (Baker)</cell><cell>4</cell><cell>37</cell></row><row><cell>Vertreter (Agent/Representative)</cell><cell>10</cell><cell>21</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4</head><label>4</label><figDesc>Relative frequencies of positions found in the manually corrected text between1850-1900 and 1901- 1950.    </figDesc><table><row><cell></cell><cell cols="2">1850-1900 1901-1950</cell></row><row><cell>Number of job ads</cell><cell>1016</cell><cell>1763</cell></row><row><cell>Tapezierer (Paperhanger)</cell><cell>0.591%</cell><cell>0.567%</cell></row><row><cell>Stubenmädchen (Maid)</cell><cell>0.689%</cell><cell>1.588%</cell></row><row><cell>Verkäuferin (Shop Assistant f.)</cell><cell>0.787%</cell><cell>1.191%</cell></row><row><cell>Bäcker (Baker)</cell><cell>0.394%</cell><cell>2.099%</cell></row><row><cell>Vertreter (Agent/Representative)</cell><cell>0.984%</cell><cell>1.191%</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">hmbyt5-preliminary model[18.6.2024]   </note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>We thank the Austrian National Library (ÖNB) for providing the data. We would also like to thank Meike Linnewedel, Clara Hochreiter and Melanie Frauendorfer for their efforts in correcting and annotating the ground truth. This work was supported by the FWF under grant number P 35783.</p></div>
			</div>


			<div type="availability">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Data Availability</head><p>The annotated ground truth data is publicly available at: https://github.com/JobAds-FWFProject/Ground-Truth-CHR2024.</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Appendix. List of newspaper titles present in our corpus</head><p>A r b e i t e r w i l l e A r b e i t e r − Z e i t u n g B r e g e n z e r T a g b l a t t / V o r a r l b e r g e r T a g b l a t t Das V a t e r l a n d D e u t s c h e s V o l k s b l a t t Die P r e s s e F r e i e Stimmen Fremden − B l a t t G r a z e r T a g b l a t t G r a z e r V o l k s b l a t t I l l u s t r i e r t e Kronen Z e i t u n g I n n s b r u c k e r N a c h r i c h t e n L i n z e r Tages − P o s t L i n z e r V o l k s b l a t t Morgen − P o s t Neue F r e i e P r e s s e Neues Wiener J o u r n a l : u n p a r t e i i s c h e s T a g b l a t t Neues Wiener T a g b l a t t ( T a g e s a u s g a b e ) N e u i g k e i t s −Welt − B l a t t P e s t e r L l o y d P i l s n e r T a g b l a t t P r a g e r A b e n d b l a t t P r a g e r T a g b l a t t R e i c h s p o s t S a l z b u r g e r C h r o n i k f u e r S t a d t und Land S a l z b u r g e r V o l k s b l a t t : d i e u n a b h a e n g i g e T a g e s z e i t u n g f u e r S t a d t und Land S a l z b u r g V o r a r l b e r g e r Landes − Z e i t u n g V o r a r l b e r g e r Volks − B l a t t Wiener Z e i t u n g</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers</title>
		<author>
			<persName><forename type="first">R</forename><surname>Barman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ehrmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Clematide</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Oliveira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Kaplan</surname></persName>
		</author>
		<idno type="DOI">10.46298/jdmdh.6107</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of Data Mining &amp; Digital Humanities HistoInformatics</title>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">W</forename><surname>Blatt</surname></persName>
		</author>
		<ptr target="https://anno.onb.ac.at/cgi-content/anno?aid=nwb%5C&amp;datum=19210730%5C&amp;seite=8" />
		<imprint>
			<date type="published" when="1921">1921. 1921</date>
			<biblScope unit="page">7</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">CNN-Based Page Segmentation and Object Classification for Counting Population in Ottoman Archival Documentation</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Can</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kabadayi</surname></persName>
		</author>
		<idno type="DOI">10.3390/jimaging6050032</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of Imaging</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page">32</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Q i-jtb the Raven&quot;: Taking Dirty OCR Seriously</title>
		<author>
			<persName><forename type="first">R</forename><surname>Cordell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Book History</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="188" to="225" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Arbeitsmarktpolitik im deutschen Kaiserreich: Arbeitsvermittlung, Arbeitsbeschaffung und Arbeitslosenunterstützung 1890-1918</title>
		<author>
			<persName><forename type="first">A</forename><surname>Faust</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1986">1986</date>
		</imprint>
	</monogr>
	<note>No Title</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Cultural heritage as digital noise: nineteenth century newspapers in the digital archive</title>
		<author>
			<persName><forename type="first">J</forename><surname>Jarlbrink</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Snickars</surname></persName>
		</author>
		<idno type="DOI">10.1108/jd-09-2016-0106</idno>
		<ptr target="https://doi.org/10.1108/JD-09-2016-0106" />
	</analytic>
	<monogr>
		<title level="j">Journal of Documentation</title>
		<imprint>
			<biblScope unit="volume">73</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="1228" to="1243" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Distance Measures for Image Segmentation Evaluation</title>
		<author>
			<persName><forename type="first">X</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Marti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Irniger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Bunke</surname></persName>
		</author>
		<idno type="DOI">10.1155/asp/2006/35909</idno>
		<ptr target="https://doi.org/10.1155/ASP/2006/35909" />
	</analytic>
	<monogr>
		<title level="j">EURASIP Journal on Advances in Signal Processing</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page">35909</biblScope>
			<date type="published" when="2006">2006. 2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Interacting with digitised historical newspapers: understanding the use of digital surrogates as primary sources</title>
		<author>
			<persName><forename type="first">E</forename><surname>Late</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kumpulainen</surname></persName>
		</author>
		<idno type="DOI">10.1108/jd-04-2021-0078</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of Documentation ahead-of-print</title>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Building a test collection for complex document information processing</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">D</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Agam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">E</forename><surname>Argamon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Frieder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">A</forename><surname>Grossman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Heard</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval</title>
				<meeting>the 29th annual international ACM SIGIR conference on Research and development in information retrieval</meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">U</forename><surname>Library</surname></persName>
		</author>
		<idno>frak2021-0.905. 2021</idno>
		<ptr target="https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata%5C%5Fbest/frak2021-0.905.traineddata" />
		<imprint>
			<date type="published" when="2021">frak2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">An Evaluation of DNN Architectures for Page Segmentation of Historical Newspapers</title>
		<author>
			<persName><forename type="first">B</forename><surname>Liebl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Burghardt</surname></persName>
		</author>
		<idno type="DOI">10.1109/icpr48806.2021.9412571</idno>
	</analytic>
	<monogr>
		<title level="m">2020 25th International Conference on Pattern Recognition (ICPR)</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="5153" to="5160" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Building an efÏcient OCR system for historical documents with little training data</title>
		<author>
			<persName><forename type="first">J</forename><surname>Martıńek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Lenc</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Král</surname></persName>
		</author>
		<idno type="DOI">10.1007/s00521-020-04910-x</idno>
		<ptr target="https://doi.org/10.1007/s00521-020-04910-x" />
	</analytic>
	<monogr>
		<title level="j">Neural Computing and Applications</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<biblScope unit="issue">23</biblScope>
			<biblScope unit="page" from="17209" to="17227" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Emergence of the Austrian labor market</title>
		<author>
			<persName><forename type="first">W</forename><surname>Mölzer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kleinert</surname></persName>
		</author>
		<ptr target="https://static.uni-graz.at/fileadmin/%5C%5Ffiles/%5C%5Fproject%5C%5Fsites/%5C%5Fhistorical-job-ads/Emergence%5C%5FAustrian%5C%5Flabor%5C%5Fmarket.pdf" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title/>
		<idno>.7.1911. 1911</idno>
		<ptr target="https://anno.onb.ac.at/cgi-%20content/anno?aid=ibn%5C&amp;datum=19110701%5C&amp;seite=38" />
	</analytic>
	<monogr>
		<title level="j">I. Nachrichten</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">doccano: Text Annotation Tool for Human</title>
		<author>
			<persName><forename type="first">H</forename><surname>Nakayama</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kubo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kamura</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Taniguchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Liang</surname></persName>
		</author>
		<ptr target="https://github.com/doccano/doccano" />
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<author>
			<persName><forename type="first">Ö</forename><surname>Nationalbibliothek</surname></persName>
		</author>
		<ptr target="https://anno.onb.ac.at/" />
		<title level="m">ANNO Historische Zeitungen und Zeitschriften</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Tracing Discourses in Digital Newspaper Collections</title>
		<author>
			<persName><forename type="first">S</forename><surname>Oberbichler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Pfanzelter</surname></persName>
		</author>
		<idno type="DOI">10.1515/9783110729214-007</idno>
		<ptr target="https://doi.org/10.1515/9783110729214-007" />
	</analytic>
	<monogr>
		<title level="m">Digitised Newspapers -A New Eldorado for Historians</title>
				<editor>
			<persName><forename type="first">E</forename><surname>Bunout</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Ehrmann</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Clavert</surname></persName>
		</editor>
		<imprint>
			<publisher>De Gruyter Oldenbourg</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="125" to="152" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Transkribus -A Service Platform for Transcription, Recognition and Retrieval of Historical Documents</title>
		<author>
			<persName><forename type="first">P</forename><surname>Kahle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Colutto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Hackl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Mühlberger</surname></persName>
		</author>
		<idno type="DOI">10.1109/icdar.2017.307</idno>
	</analytic>
	<monogr>
		<title level="m">2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)</title>
				<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="volume">04</biblScope>
			<biblScope unit="page" from="19" to="24" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">ICDAR 2019 Competition on Post-OCR Text Correction</title>
		<author>
			<persName><forename type="first">C</forename><surname>Rigaud</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Doucet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Coustaty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-P</forename><surname>Moreux</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 15th International Conference on Document Analysis and Recognition</title>
				<meeting>the 15th International Conference on Document Analysis and Recognition</meeting>
		<imprint>
			<date type="published" when="2019">2019. 2019</date>
			<biblScope unit="page" from="1588" to="1593" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<title level="m" type="main">Comparison of Named Entity Recognition tools for raw OCR text</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">J</forename><surname>Rodriguez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bryant</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Blanke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Luszczynska</surname></persName>
		</author>
		<idno type="DOI">10.13140/2.1.2850.3045</idno>
		<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Mining for the Meanings of a Murder: The Impact of OCR Quality on the Use of Digitized Historical Newspapers</title>
		<author>
			<persName><forename type="first">C</forename><surname>Strange</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mcnamara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wodak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Wood</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">digital humanities quarterly</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Assessing the Impact of OCR Quality on Downstream NLP Tasks</title>
		<author>
			<persName><forename type="first">D</forename><surname>Strien</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Beelen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Coll Ardanuy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Hosseini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mcgillivray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Colavizza</surname></persName>
		</author>
		<idno type="DOI">10.5220/0009169004840496</idno>
	</analytic>
	<monogr>
		<title level="j">SciTePress</title>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Mapping Texts: Examining the Effects of OCR Noise on Historical Newspaper Collections</title>
		<author>
			<persName><forename type="first">A</forename><surname>Torget</surname></persName>
		</author>
		<idno type="DOI">10.1515/9783110729214-003</idno>
	</analytic>
	<monogr>
		<title level="m">Digitised Newspapers -A New Eldorado for Historians</title>
				<editor>
			<persName><forename type="first">E</forename><surname>Bunout</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Ehrmann</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Clavert</surname></persName>
		</editor>
		<imprint>
			<publisher>De Gruyter Oldenbourg</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="47" to="66" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<title level="m" type="main">Quantifying Page Segmentation Quality in Historical Job Advertisements Retrieval</title>
		<author>
			<persName><forename type="first">K</forename><surname>Venglarova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Adam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Balasubramanian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Vogeler</surname></persName>
		</author>
		<ptr target="https://inria.hal.science/hal-04560463" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">The Making of Public Labour Intermediation: Job Search, Job Placement, and the State in Europe, 1880-1940</title>
		<author>
			<persName><forename type="first">S</forename><surname>Wadauer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Buchner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mejstrik</surname></persName>
		</author>
		<idno type="DOI">10.1017/s002085901200048x</idno>
	</analytic>
	<monogr>
		<title level="j">International Review of Social History</title>
		<imprint>
			<biblScope unit="volume">57</biblScope>
			<biblScope unit="page" from="161" to="189" />
			<date type="published" when="2012">S20 2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Evaluating Models of Latent Document Semantics in the Presence of OCR Errors</title>
		<author>
			<persName><forename type="first">D</forename><surname>Walker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Lund</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ringger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Association for Computational Linguistics</title>
				<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page">240</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Mining Historical Advertisements in Digitised Newspapers</title>
		<author>
			<persName><forename type="first">M</forename><surname>Wevers</surname></persName>
		</author>
		<idno type="DOI">10.1515/9783110729214-011</idno>
	</analytic>
	<monogr>
		<title level="m">Digitised Newspapers -A New Eldorado for Historians</title>
				<editor>
			<persName><forename type="first">E</forename><surname>Bunout</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Ehrmann</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Clavert</surname></persName>
		</editor>
		<imprint>
			<publisher>De Gruyter Oldenbourg</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="227" to="252" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Digital Humanities and Media History: A Challenge for Historical Newspaper Research1</title>
		<author>
			<persName><forename type="first">H</forename><surname>Wijfjes</surname></persName>
		</author>
		<idno type="DOI">10.18146/2213-7653.2017.277</idno>
	</analytic>
	<monogr>
		<title level="j">TMG Journal for Media History</title>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
