<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>From digitised sources to digital data: Behind the scenes of (critically) enriching a digital heritage collection?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lorella Viola</string-name>
          <email>lorella.viola@uni.lu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DH - University of Luxembourg) - Thinkering Grant and the Luxembourg National Research Fund</institution>
          ,
          <addr-line>13307816</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Luxembourg Centre for Contemporary and Digital History (C</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>and Antonio Maria Fiscarelli</institution>
        </aff>
      </contrib-group>
      <fpage>51</fpage>
      <lpage>64</lpage>
      <abstract>
        <p>Digitally available repositories are becoming not only more and more widespread but also larger and larger. Although there are both digitally-born collections and digitised material, the digital heritage scholar is typically confronted with the latter. This immediately presents new challenges, one of the most urgent being how to nd the meaningful elements that are hidden underneath such unprecedented mass of digital data. One way to respond to this challenge is to contextually enrich the digital material, for example through deep learning. Using the enrichment of the digital heritage collection ChroniclItaly 3.0 [10] as a concrete example, this article discusses the complexities of this process. Speci cally, combining statistical and critical evaluation, it describes the gains and losses resulting from the decisions made by the researcher at each step and it shows how in the passage from digitised sources to enriched material, most is gained (e.g., preservation, wider and enhanced access, more material) but some is also lost (e.g., original layout and composition, loss of information due to pre-processing steps). The article concludes that it is only through a critical approach that the digital heritage scholar can successfully meet the interpretive challenges presented by the digital and the digital heritage sector ful l the second most important purpose of digitisation, that is to enhance access.</p>
      </abstract>
      <kwd-group>
        <kwd>Digital heritage</kwd>
        <kwd>Contextual enrichment</kwd>
        <kwd>Deep learning</kwd>
        <kwd>Digital humanities</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>As digitally available repositories are becoming larger and larger, nding the
meaningful elements that are hidden within such an unprecedented mass of
digital data is increasingly challenging. Moreover, if the collection is digitised (as
opposed to be born digital), then further complexity is added to the picture,
typically due to the often low quality of the digitised source (e.g., OCR mistakes,
markings on the pages, low readability of unusual fonts, or a general poor
condition of the original text). One way to respond to the pressing need for enhancing
access and making the collections retrievable in valuable ways is to enrich the
digital material with contextual information, for example by using deep
learning. This process, however promising and continually advancing, is not free from
challenges of its own which, to be e ectively tackled, require technical expertise
but more importantly, full critical engagement.</p>
      <p>
        This article uses the enrichment of the digital heritage collection
ChroniclItaly 3.0 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] as a concrete example to discuss such challenges. ChroniclItaly 3.0
is a corpus of Italian American immigrant newspapers published between 1898
to 1936. Speci cally, combining statistical and critical evaluation, it describes
the gains and losses resulting from the decisions and interventions made by the
researcher at every step of the process of enrichment, from the so-called
preprocessing steps (tokenization, lowercasing, stemming, lemmatization, removing
stopwords, removing noise) to the enrichment itself (Named Entity Recognition
- NER, Geo-coding, Entity Sentiment Analysis - ESA). The enrichment process
presented in this article was carried out in the context of a larger project,
DeepTextMiner (DeXTER) which was supported by the Luxembourg Centre for
Contemporary and Digital History (C2DH - University of Luxembourg) Thinkering
Grant. DeXTER aims to o er a methodological contribution to digital
humanities (DH) by exploring the value of reusing and advancing existing knowledge.
Speci cally, it intends to experiment with di erent state-of-the-art NLP and deep
learning techniques for both enriching digital data with contextual information
(e.g., referential, geographical, emotional, topical, relational) and visualising it.
The long-term goal is to devise a generalisable, interoperable work ow that could
assist researchers in both these tasks.
      </p>
      <p>The discussion will show how the passage from digitised sources to enriched
material, while aiming to make collections more engaging and valuable on the
whole through preservation and wider and enhanced access, is not free from
disadvantages such as loss of the original layout and structure, loss of information
due to pre-processing steps, introduction of new errors, etc. The article
ultimately shows that it is only through an active, critical engagement with the
digital sources that the digital heritage scholar can successfully meet the
interpretive challenges presented by the digital and the digital heritage sector ful l
the second most important purpose of digitisation after preservation, that is to
enhance access.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Digital source and methodology</title>
      <p>This article describes the technical interventions and critical choices made
towards enriching the digital heritage collection ChroniclItaly 3.0 with
contextual information (i.e., referential entities, geo-coding information, sentiment).
By means of statistical and critical evaluation, it aims to show that digitally
enabled research is highly dependent on the critical engagement of the scholar
who is never `completely removed' from the computational. From pre-processing
the digital material to the enrichment itself, the paper documents the decisions
and interventions made by the researcher at every step of the process and how
these a ected the material and inevitably impacted the nal output, in turn
impacting access, analysis and reuse of the collection itself.
2.1</p>
      <sec id="sec-2-1">
        <title>ChroniclItaly 3.0</title>
        <p>
          ChroniclItaly 3.0 [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] is a corpus of Italian immigrant newspapers published in
the United States between 1898 and 1936. The corpus includes the front pages of
the following titles: L'Italia, Cronaca Sovversiva, Il Patriota, La Libera Parola,
La Rassegna, La Ragione, L'Indipendente, La Sentinella, La Sentinella del West,
and La Tribuna del Connecticut for a total of 8,653 issues. The collection includes
further issues as well as three additional titles from its two previous versions,
ChroniclItaly [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and ChroniclItaly 2.0 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. As the previous versions, ChroniclItaly
3.0 has been machine-harvested from Chronicling America, an Internet-based
U.S. directory1 of digitised historical newspapers published in the United States
from 1789 to 1963. The corpus features prominenti (mainstream), sovversivi
(anarchic), and independent newspapers,2 thus providing a very nuanced picture
of the Italian immigrant community in the United States at the turn of the
twentieth century.
        </p>
        <p>Immigrant newspapers were continually ghting against the risk of bankruptcy
and owners were often forced to discontinue the titles; for this reason, some titles
such as L'Italia - one of the most mainstream Italian immigrant publications in
the U.S. at the time - managed to last for years, while others like La Rassegna or
La Ragione could survive only for a few months. This is re ected in the
composition of the collection which therefore presents gaps across titles (Figure 1). Also
due to the newspapers' economic struggles, the number of issues vary greatly
across titles, with some titles publishing thousands of issues and others only
a few hundreds. The overall coverage of issues is nonetheless relatively evenly
distributed across the whole period and titles of di erent orientation co-exist at
di erent points in time thus ensuring that a degree of balance is kept throughout
the collection. Users should however take into account factors such as over- or
under-representation of some titles, potential polarisation of topics, etc. when
engaging with the resource.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Methodology</title>
        <p>This paper describes the process of contextually enriching the digital heritage
collection ChroniclItaly 3.0 from the so-called pre-processing steps to the
enrichment itself. Combining statistics and critical re ection, the aim is to provide an</p>
        <sec id="sec-2-2-1">
          <title>1 http://https://chroniclingamerica.loc.gov/</title>
          <p>
            2 For further information about the classi cation of the titles based on their political
orientation, please refer to [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]
insider perspective on the technical and analytical challenges posed by the task
of enriching digital heritage material. The discussion is divided in two parts: the
rst part discusses the pre-processing steps, which include tokenization,
removing words with less than two characters, removing numbers, dates and special
characters); the second part is concerned with the enrichment itself (NER,
geocoding, ESA).
3
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Towards enrichment: Pre-processing</title>
      <p>Before working towards enriching a digital (textual) collection, and indeed before
performing any Natural Language Processing (NLP) task, the digital material
needs to be prepared. It is all too often assumed that this operation, also
ambiguously referred to as cleaning, is not part of the enrichment process and can
therefore be fully automatic, unsupervised and launched as a one-step pipeline.
In this section, we show how it is on the contrary paramount that this phase is
tackled critically as each one of the taken actions will have consequences on the
material, on how the algorithms will process such material and therefore on the
output and nally, on how the collection will be accessed and the information
retrieved and interpreted.</p>
      <p>Deciding which ones of the pre-processing operations should be performed
and how depends on many factors such as the language of the data-set, the type
of material, the speci c enrichment task, to name but a few. Typical operations
that are considered part of this step are tokenization, lowercasing, stemming,
lemmatization, removing stopwords, removing noise (e.g., numbers,
punctuation marks, special characters). In principle, all these interventions are optional
as the algorithms will process whichever version of the data-set is used. In reality,
however, pre-processing the digital material is key to the subsequent operations
for several reasons. First and foremost, pre-processing the data will remove most
OCR mistakes which are always present in digital textual collections to various
degrees. This is especially true for corpora such as historical collections,
repositories of under-documented languages, or digitised archives from handwritten
texts. Second, it will reduce the size of the collection thus decreasing the required
processing power and time. Third, it is de facto a data exploration step which
allows the digital heritage scholar to look more closely at the material.</p>
      <p>It is important to remember that each one of these steps is an additional layer
of manipulation and has direct, heavy consequences on the material and therefore
on the following operations. It is critical that digital scholars assess carefully to
what degree they want to intervene on the material and how. For this reason,
this part of the process of contextual enrichment should not be considered as
separate from the enrichment itself, on the contrary, it is an integral part of the
entire process.</p>
      <p>The speci c pre-processing actions taken towards enriching ChroniclItaly 3.0
were: tokenization, removing numbers, dates, removing words with less than two
characters and special characters. Numbers and dates were removed because they
are not only typically considered irrelevant to NER and ESA, but sometimes they
may even interfere with the algorithm performance, thus potentially worsening
the quality of the output. The last two operations were performed because it
was found that special and isolated characters (characters delimited by spaces)
were in fact OCR mistakes.</p>
      <p>Other actions included merging words wrongfully separated by a newline, a
white space or punctuation. Once this step was performed, the collection totalled
up to 21,454,455 words. In Figure 2, we show how this step impacted each title
per year in terms of the percentage of material removed and preserved while
Figure 3 displays such percentages aggregated for each title.</p>
      <p>As the gures show - particularly Figure 3 - on average, after this step 30%
of the material was removed from each title. This means that, if on the one
hand the step was likely to result in more reliable and interpretable material,
on the other it came at the expenses of potentially important information. In
this respect, the challenge does not lie so much in performing the pre-processing
itself, rather in assessing which operations minimise the loss of potentially useful
information and maximise the enhancement of the resource. As the technology
is still not perfect, digital heritage scholars and institutions must respond to this
challenge by carefully pondering the pros and cons of enriching digital collections
and duly warn the users about the performed interventions.</p>
      <p>As for the speci c pre-processing operations to perform, we chose to not
remove stopwords and punctuation, even though they are typically considered</p>
      <p>Cron.S. Indip. Italia Lib.Par. PatriotaRagione Rass. Sent. Sent.W. Trib.C.
1930
Legend</p>
      <p>Not preserved</p>
      <p>Preserved</p>
      <p>Percentage
as not semantically salient and in fact even detrimental to the models'
performance. The choice was motivated by the enrichment actions to follow, namely
NER, geo-coding, and ESA: prepositions and articles are often part of locations,
organisations and names while punctuation is a typical sentence delimiter, that
is it sets the sentences' boundaries, indispensable to perform sentiment analysis.
Therefore, removing such material from the collection could have had a
negative impact on the enrichment quality. We also decided to not lowercase the
material. Lowercasing can be a double-edged sword; for instance, if
lowercasing is not performed, the algorithm will treat `USA', `Usa', `usa', `UsA', `uSA',
etc. as distinct tokens, even though they may all refer to the same entity. On
the other hand, once lowercased, it may become di cult for the algorithm to
recognise entities, thus outputting many false negatives, and for the scholar to
distinguish between homonyms, thus potentially skewing the output. As entities
such as persons, locations and organisations are typically capitalised, we decided
to not perform lowercasing in preparation for NER and geo-coding. Once these
steps were completed, however, the entities were lowercased so that the issue
of multiple items referring to the same entity (e.g., `USA' and `Usa') could be
overcome (cfr. Section 5).
4</p>
    </sec>
    <sec id="sec-4">
      <title>Enrichment: Named Entity Recognition</title>
      <p>The growing digitisation of material and immaterial cultural heritage has
proportionally increased the relevance of Arti cial Intelligence (AI) for researchers</p>
      <p>La Tribuna
del Connecticut</p>
      <p>La Sentinella</p>
      <p>del West
La Sentinella</p>
      <p>
        La Rassegna
in the humanities as well as libraries and cultural heritage institutions [
        <xref ref-type="bibr" rid="ref13 ref3">3, 13</xref>
        ].
AI o ers users and researchers methods to analyse, explore and make sense of
the di erent layers of information sometimes hidden in large quantities of
material. For instance, one e ective application of AI is the possibility to enrich
the digital material with data that could allow for in-depth cultural analyses.
One example of such text enrichment is Named Entity Recognition (NER), that
is using contextual information to identify referential entities such as names of
persons, locations and organisations.
      </p>
      <p>
        NER tasks have over the years undergone major changes, however it has been
repeatedly proven that machine learning algorithms based on neural networks
outperform all previous methods. For this reason, the NER enrichment of
ChroniclItaly 3.0 was performed by using a deep learning sequence tagging tool that
implements Tensor ow [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The algorithm combines Bidirectional Long
ShortTerm Memory (BiLSTM) and Conditional Random Fields (CRF) as top layer
with character embeddings which was found to outperform CRFs with
handcoded features. Methodologically, the character embeddings are trained with
pre-trained word embeddings while training the model itself. The character and
subword based word embeddings are computed with FastText [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] as it was found
that, by retrieving embeddings for unknown words through the incorporation of
subword information, issues with out-of-vocabulary words are signi cantly
alleviated.
      </p>
      <p>The sequence tagging model for the Italian language was trained on I-CAB3
(Italian Content Annotation Bank), an open access corpus annotated for entities
(i.e. persons-PER, organisations-ORG, locations-LOC, and geopolitical
entitiesGPE), temporal expressions, and relations between entities. I-CAB contains 525
news articles taken from the Italian newspaper L'Adige and totals up around
180,000 words. Embeddings were computed using Italian Wikipedia and trained
using Fastext with 300 dimensions. The results of the F1 score for the Italian</p>
      <sec id="sec-4-1">
        <title>3 http://ontotext.fbk.eu/icab.html</title>
        <p>models are: accuracy: 98.15%; precision: 83.64%; recall: 82.14%; F1: 82.88. Figure
4 shows the tags of the NER algorithm, Figure 5 shows the accuracy scores
per entity, while Figure 6 shows the nal output of the sequence tagger on
ChroniclItaly 3.0.</p>
        <p>The NER algorithm retrieved 547,667 entities, which occurred 1,296,318
times across the ten titles. A close analysis of the entities, however, revealed
a number of issues which required a manipulation of the output. These issues
included: entities that had been assigned the wrong tag (e.g., New York - PER),
multiple entities referring to the same entity (e.g., Woodraw Wilson, President
Woodraw Wilson), elements wrongfully tagged as entities (e.g., venerd `Friday'
- ORG). Therefore, a list of these exceptions was compiled and the results
adjusted accordingly. Once the data were modi ed, the data-set counted 521,954
unique entities which occurred 1,205,880 times. Figure 6 shows how the
intervention a ected the distribution of entities across the four categories - geopolitical,
persons, locations, organisations - per title.</p>
        <p>As it can be seen, the redistribution of entities varied across categories and
titles, in some cases dramatically. For instance in La Rassegna, the number of
entities in the LOC category signi cantly decreased whereas it increased in L'Italia.
This action required a combination of expert knowledge and technical ability as
the entities had to be carefully analysed and historically triangulated in order
to make informed decisions on how to intervene on the output without
introducing errors. Although time-consuming and in principle optional, this critical
evaluation intervention nevertheless signi cantly improved the accuracy of the
tags thus overall increasing the quality of the NER output in preparation for the
following stages of the enrichment, namely geo-coding and ESA and ultimately,
adding more value to the user's experience for access and reuse. It is therefore a
highly recommended operation.
Fig. 7. Distribution of entities per title after intervention. Positive bars indicate a
decreased number of entities after the process, negative bars indicate an increased
number.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Enrichment: Geo-coding</title>
      <p>
        The relevance of geo-coding for digital heritage collections lies in what has been
referred to as Spatial turn, the study of space and place [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] as distinct entities in
that it sees place as created through social experiences and can therefore be both
real and imagined whereas space is essentially geographic. Spatial humanities is
a recently emerged interdisciplinary eld within digital humanities which,
following the Spatial turn, focuses on geographic and conceptual space, particularly
from a historical perspective. Based on Geographic Information Systems (GIS),
locations in data-sets are geo-coded and displayed on a map. Especially in the
case of large collections with hundreds of thousands of geographical entities, the
visualisation is believed to help scholars access the di erent layers of information
that may be behind geo-references. Indeed, such process of cross-referencing
geolocations with other type of data (e.g., historical, social, temporal) has provided
researchers working in elds such as environmental history, historical
demography, and economic, urban and medieval history with new insights leading them
to propose alternatives to traditional positions and even explore new research
avenues [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The theoretical relevance of using NER to enrich digital heritage
collections lies precisely in its great potential for discovering the cultural
significance underneath referential units and how that may have changed over time.
      </p>
      <p>
        Another challenge posed to digital heritage scholars is that of the language
of the collection. In the case of ChroniclItaly 3.0, for example, almost all choices
made by the researcher towards enriching, including NER and geo-coding, were
conditioned by the fact that the language of the collection is not English. The
relative lack of appropriate computational resources available for languages other
than English often dictates which tools and platforms can be used for speci c
tasks. For geo-coding, for instance, it was found that setting the API language
as the language of the data-set improves the accuracy of the geo-coding results
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. For this reason, the geographical entities in ChroniclItaly 3.0 have been
geocoded using the Google Cloud Natural Language API within the Google Cloud
Platform Console which provides a range of NLP technologies in a wide range
of languages, including Italian.
      </p>
      <p>Moreover, when enriching collections in languages other than English, digital
heritage scholars often nd themselves confronted with the additional challenge
of having to choose between training their own models or using existing models.
While the former scenario may sometimes not be an option due for instance to
the lack of speci c training of the scholar, time or money limits, etc., the latter
alternative may also not be ideal. Even when trained in the target language
(like in the case of ChroniclItaly 3.0), typically the training would have occurred
within the context of another project, for di erent purposes, possibly using
datasets with very di erent speci cs from the one the scholar is enriching. Researchers
are then usually forced to sacri ce the possibility to achieve a potentially much
higher quality and/or accuracy of their results in the interest of time, money
or both. For example, as shown in Figure 5, although the overall F1 score of
the NER algorithm for Italian models was satisfactory (82.88), the individual
performance for the entity LOC was rather poor (54.19). This may depend on
several factors (e.g., lack of this type of locations in the data-set used for training
the model, di erent locations tagged as LOC) which are related to the wider
challenge of accessing already available trained models in the desired language.
Because of the low score of the category LOC, we decided to geo-code only
GPE entities. Though not optimal, the decision was made also considering that
GPE entities are generally more informative as they would typically refer to
countries and cities (though it was found to retrieve also counties and States)
while LOC entities are typically rivers, lakes, and geographical areas (e.g., the
Paci c Ocean). Future work on the collection could focus on performing NER
using a more ne-tuned algorithm and geo-code the LOC-type entities.</p>
      <p>In total, 2,160 GPE-type entities were geo-coded, these are referred to 283,879
times throughout the whole corpus. Figure 8 shows the distribution of unique
GPE entities per title.</p>
    </sec>
    <sec id="sec-6">
      <title>Enrichment: Entity Sentiment Analysis</title>
      <p>
        Another enriching technique used to add value to digital heritage collections is
Entity Sentiment Analysis (ESA), a way to identify the prevailing emotional
attitude of the writer towards referential entities. While Sentiment Analysis (SA)
identi es the prevailing emotional opinion within a given text, ESA adds a more
targeted and sophisticated quality to the enrichment as it allows users to
identify the layers of meaning humans attached historically to people, organisations,
geographical spaces. Understanding the meaning humans invested in such
entities and, in the case of historical collections such as ChroniclItaly 3.0, how that
meaning may have changed over time, provides digital heritage scholars with
powerful means to access part of collective narratives of fear, pride, longing,
loss. ESA enables us to connect these subjective connotations to the named
entities extracted from digitised texts [
        <xref ref-type="bibr" rid="ref12 ref2 ref6 ref7">12, 2, 6, 7</xref>
        ] thus o ering a more engaging and
complete fruition of information as well as tackling new questions or revising old
assumptions. For example, the ESA of ChroniclItaly 3.0 may allow researchers
to investigate how immigrants made sense of their diasporic identities within
the host community and how their relationship with the homeland may have
changed over time.
      </p>
      <p>The process of performing ESA on the collection required several steps:
{ Identify the sentence delimiters (i.e., full stop, semicolon, colon, exclamation
mark, question mark) and divide the textual material accordingly. At the
end of this step, 677,030 sentences were obtained;
{ Select the most frequent entities for each category and in each title. As
each title di ers in size, we used a logarithmic function to obtain a more
representative number of entities per title (2*log2 function used). At the
end of this step, 228 entities were obtained distributed across titles as shown
in Figure 9;
{ Select only the sentences that contained the entities identi ed in the
previous step. This step was done to limit the number of API requests and
reduce processing time and costs. The selection returned 133,281 sentences
distributed across titles as shown in Figure 10;
{ Perform ESA on the sentences so selected. When the study was carried out,
no suitable SA models trained on Italian were found, therefore this step was
performed once again using the Google Cloud Platform Console.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions</title>
      <p>
        This article discussed the technical and analytical challenges posed by the task
of enriching a digital heritage collection and used the enrichment of
ChroniclItaly 3.0 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] as a concrete example. Speci cally, the article combined statistical
analysis and critical evaluation to describe the impact that the decisions and
interventions made at every step of the enrichment process had on the collection.
Through such discussion, the aim was to show that, paradoxically, enriching a
digital heritage collection means sacri cing content. Therefore, an active, critical
engagement of the digital heritage scholar is an absolutely necessary requirement
to minimise such losses and ensure that the enrichment actions truly increase
the value of the collections, making the users' experience more engaging and
thus widening and enhancing access.
      </p>
      <p>La Tribuna
del Connecticut</p>
      <p>La Sentinella</p>
      <p>del West
La Sentinella</p>
      <p>The enrichment actions discussed here are part of a wider process of
enrichment within DeXTER. Additional enrichment operations include topic modelling
and network analysis, which were not discussed here due to word count
limitations. Though not exhaustive, we nevertheless hoped to have shown that it is
only through a continuous, critical engagement with the digital sources that the
digital heritage scholar can successfully meet the challenges presented by the
digital and ful l the main purposes of digitisation: preservation and knowledge
access.
0.0</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Enriching word vectors with subword information (</article-title>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Donaldson</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gregory</surname>
            ,
            <given-names>I.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , J.E.:
          <article-title>Locating the beautiful, picturesque, sublime and majestic: spatially analysing the application of aesthetic terminology in descriptions of the english lake district</article-title>
          .
          <source>Journal of Historical Geography</source>
          <volume>56</volume>
          ,
          <issue>43</issue>
          {
          <fpage>60</fpage>
          (
          <year>2017</year>
          ), doi: 10.1016/j.jhg.
          <year>2017</year>
          .
          <volume>01</volume>
          .006
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Fiorucci</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khoroshiltseva</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pontil</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Traviglia</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Del Bue</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>James</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Machine learning for cultural heritage: A survey</article-title>
          .
          <source>Pattern Recognition Letters</source>
          <volume>133</volume>
          ,
          <issue>102</issue>
          {
          <fpage>108</fpage>
          (
          <year>2020</year>
          ). https://doi.org/https://doi.org/10.1016/j.patrec.
          <year>2020</year>
          .
          <volume>02</volume>
          .017, http://www.sciencedirect.com/science/article/pii/S0167865520300532
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Murrieta-Flores</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martins</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>The geospatial humanities: past, present and future</article-title>
          .
          <source>International Journal of Geographical Information Science</source>
          <volume>33</volume>
          (
          <issue>12</issue>
          ),
          <volume>2424</volume>
          {
          <fpage>2429</fpage>
          (
          <year>2019</year>
          ). https://doi.org/10.1080/13658816.
          <year>2019</year>
          .
          <volume>1645336</volume>
          , https://doi.org/ 10.1080/13658816.
          <year>2019</year>
          .1645336
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Riedl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pado</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A named entity recognition shootout for german</article-title>
          .
          <source>In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)</source>
          . pp.
          <volume>120</volume>
          {
          <issue>125</issue>
          (
          <year>2018</year>
          ), doi: 10.18653/v1/
          <fpage>P18</fpage>
          -2020 0.
          <fpage>2</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Tally</surname>
            <given-names>Jr</given-names>
          </string-name>
          , R.T.:
          <article-title>Geocritical explorations: Space, place, and mapping in literary and cultural studies</article-title>
          .
          <source>Springer</source>
          (
          <year>2011</year>
          ), doi: 10.1057/9780230337930
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , J.E.,
          <string-name>
            <surname>Donaldson</surname>
            ,
            <given-names>C.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gregory</surname>
            ,
            <given-names>I.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Butler</surname>
            ,
            <given-names>J.O.</given-names>
          </string-name>
          :
          <article-title>Mapping digitally, mapping deep: exploring digital literary geographies</article-title>
          .
          <source>Literary Geographies</source>
          <volume>4</volume>
          (
          <issue>1</issue>
          ),
          <volume>10</volume>
          {
          <fpage>19</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Viola</surname>
          </string-name>
          , L.:
          <article-title>ChroniclItaly. A corpus of Italian language newspapers published in the United States between 1898 and 1922</article-title>
          . Utrecht University (
          <year>2018</year>
          ), doi: 10.24416/ UU01-T4YMOW
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Viola</surname>
          </string-name>
          , L.:
          <article-title>ChroniclItaly 2.0. A corpus of Italian American newspapers annotated for entities,</article-title>
          <year>1898</year>
          -
          <fpage>1920</fpage>
          . Utrecht University (
          <year>2019</year>
          ), doi: 10.24416/
          <fpage>UU01</fpage>
          -4MECRO
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Viola</surname>
          </string-name>
          , L.:
          <article-title>ChroniclItaly 3.0. A contextually enriched digital heritage collection of Italian immigrant newspapers published in the USA,</article-title>
          <year>1898</year>
          -
          <fpage>1936</fpage>
          (In press)
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Viola</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verheul</surname>
          </string-name>
          , J.:
          <article-title>Mining ethnicity: Discourse-driven topic modelling of immigrant discourses in the usa</article-title>
          ,
          <source>1898{1920. Digital Scholarship in the Humanities</source>
          <volume>35</volume>
          (
          <issue>4</issue>
          ),
          <volume>921</volume>
          {
          <fpage>943</fpage>
          (
          <year>2019</year>
          ), doi: 10.1093/llc/fqz068
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Viola</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verheul</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Machine learning to geographically enrich understudied sources: A conceptual approach</article-title>
          .
          <source>In: Proceedings of the 12th International Conference on Agents and Arti cial Intelligence-Volume 1: ARTIDIGH</source>
          . pp.
          <volume>469</volume>
          {
          <fpage>475</fpage>
          .
          <string-name>
            <surname>SCITEPRESS</surname>
          </string-name>
          (
          <year>2020</year>
          ), doi: 10.5220/0009094204690475
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Weber</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ameryan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wolstencroft</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stork</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heerlien</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schomaker</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Towards a digital infrastructure for illustrated handwritten archives</article-title>
          .
          <source>In: Digital Cultural Heritage</source>
          , pp.
          <volume>155</volume>
          {
          <fpage>166</fpage>
          . Springer (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>