=Paper= {{Paper |id=Vol-2810/paper5 |storemode=property |title=From Digitized Sources to Digital Data: Behind the Scences of (Critically) Enriching a Digital Heritage Collection |pdfUrl=https://ceur-ws.org/Vol-2810/paper5.pdf |volume=Vol-2810 |authors=Lorella Viola,Antonio Maria Fiscarelli |dblpUrl=https://dblp.org/rec/conf/colco/ViolaF20 }} ==From Digitized Sources to Digital Data: Behind the Scences of (Critically) Enriching a Digital Heritage Collection== https://ceur-ws.org/Vol-2810/paper5.pdf
                       From digitised sources to digital data: Behind
                        the scenes of (critically) enriching a digital
                                    heritage collection?

                                         Lorella Viola1       and Antonio Maria Fiscarelli1

                       Luxembourg Centre for Contemporary and Digital History (C2DH - University of
                        Luxembourg) Maison des Sciences Humaines, 11, Porte des Sciences, Esch-sur-
                                              Alzette, L-4366, Luxembourg
                                                 lorella.viola@uni.lu
                                              antonio.fiscarelli@uni.lu




                           Abstract. Digitally available repositories are becoming not only more
                           and more widespread but also larger and larger. Although there are
                           both digitally-born collections and digitised material, the digital heritage
                           scholar is typically confronted with the latter. This immediately presents
                           new challenges, one of the most urgent being how to find the meaning-
                           ful elements that are hidden underneath such unprecedented mass of
                           digital data. One way to respond to this challenge is to contextually en-
                           rich the digital material, for example through deep learning. Using the
                           enrichment of the digital heritage collection ChroniclItaly 3.0 [10] as a
                           concrete example, this article discusses the complexities of this process.
                           Specifically, combining statistical and critical evaluation, it describes the
                           gains and losses resulting from the decisions made by the researcher at
                           each step and it shows how in the passage from digitised sources to en-
                           riched material, most is gained (e.g., preservation, wider and enhanced
                           access, more material) but some is also lost (e.g., original layout and
                           composition, loss of information due to pre-processing steps). The arti-
                           cle concludes that it is only through a critical approach that the digital
                           heritage scholar can successfully meet the interpretive challenges pre-
                           sented by the digital and the digital heritage sector fulfil the second
                           most important purpose of digitisation, that is to enhance access.

                           Keywords: Digital heritage · Contextual enrichment · Deep learning ·
                           Digital humanities


                  1      Introduction
                  As digitally available repositories are becoming larger and larger, finding the
                  meaningful elements that are hidden within such an unprecedented mass of dig-
                  ital data is increasingly challenging. Moreover, if the collection is digitised (as
                   ?
                       This research is supported by the Luxembourg Centre for Contemporary and Digital
                       History (C2DH - University of Luxembourg) - Thinkering Grant and the Luxembourg
                       National Research Fund (13307816).




Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
52      L. Viola & A. M. Fiscarelli

opposed to be born digital), then further complexity is added to the picture, typ-
ically due to the often low quality of the digitised source (e.g., OCR mistakes,
markings on the pages, low readability of unusual fonts, or a general poor condi-
tion of the original text). One way to respond to the pressing need for enhancing
access and making the collections retrievable in valuable ways is to enrich the
digital material with contextual information, for example by using deep learn-
ing. This process, however promising and continually advancing, is not free from
challenges of its own which, to be effectively tackled, require technical expertise
but more importantly, full critical engagement.
    This article uses the enrichment of the digital heritage collection ChroniclI-
taly 3.0 [10] as a concrete example to discuss such challenges. ChroniclItaly 3.0
is a corpus of Italian American immigrant newspapers published between 1898
to 1936. Specifically, combining statistical and critical evaluation, it describes
the gains and losses resulting from the decisions and interventions made by the
researcher at every step of the process of enrichment, from the so-called pre-
processing steps (tokenization, lowercasing, stemming, lemmatization, removing
stopwords, removing noise) to the enrichment itself (Named Entity Recognition
- NER, Geo-coding, Entity Sentiment Analysis - ESA). The enrichment process
presented in this article was carried out in the context of a larger project, Deep-
TextMiner (DeXTER) which was supported by the Luxembourg Centre for Con-
temporary and Digital History (C2 DH - University of Luxembourg) Thinkering
Grant. DeXTER aims to offer a methodological contribution to digital human-
ities (DH) by exploring the value of reusing and advancing existing knowledge.
Specifically, it intends to experiment with different state-of-the-art NLP and deep
learning techniques for both enriching digital data with contextual information
(e.g., referential, geographical, emotional, topical, relational) and visualising it.
The long-term goal is to devise a generalisable, interoperable workflow that could
assist researchers in both these tasks.
    The discussion will show how the passage from digitised sources to enriched
material, while aiming to make collections more engaging and valuable on the
whole through preservation and wider and enhanced access, is not free from dis-
advantages such as loss of the original layout and structure, loss of information
due to pre-processing steps, introduction of new errors, etc. The article ulti-
mately shows that it is only through an active, critical engagement with the
digital sources that the digital heritage scholar can successfully meet the inter-
pretive challenges presented by the digital and the digital heritage sector fulfil
the second most important purpose of digitisation after preservation, that is to
enhance access.


2    Digital source and methodology

This article describes the technical interventions and critical choices made to-
wards enriching the digital heritage collection ChroniclItaly 3.0 with contex-
tual information (i.e., referential entities, geo-coding information, sentiment).
By means of statistical and critical evaluation, it aims to show that digitally
                                           From digitised sources to digital data        53

enabled research is highly dependent on the critical engagement of the scholar
who is never ‘completely removed’ from the computational. From pre-processing
the digital material to the enrichment itself, the paper documents the decisions
and interventions made by the researcher at every step of the process and how
these affected the material and inevitably impacted the final output, in turn
impacting access, analysis and reuse of the collection itself.


2.1     ChroniclItaly 3.0

ChroniclItaly 3.0 [10] is a corpus of Italian immigrant newspapers published in
the United States between 1898 and 1936. The corpus includes the front pages of
the following titles: L’Italia, Cronaca Sovversiva, Il Patriota, La Libera Parola,
La Rassegna, La Ragione, L’Indipendente, La Sentinella, La Sentinella del West,
and La Tribuna del Connecticut for a total of 8,653 issues. The collection includes
further issues as well as three additional titles from its two previous versions,
ChroniclItaly [8] and ChroniclItaly 2.0 [9]. As the previous versions, ChroniclItaly
3.0 has been machine-harvested from Chronicling America, an Internet-based
U.S. directory1 of digitised historical newspapers published in the United States
from 1789 to 1963. The corpus features prominenti (mainstream), sovversivi
(anarchic), and independent newspapers,2 thus providing a very nuanced picture
of the Italian immigrant community in the United States at the turn of the
twentieth century.
    Immigrant newspapers were continually fighting against the risk of bankruptcy
and owners were often forced to discontinue the titles; for this reason, some titles
such as L’Italia - one of the most mainstream Italian immigrant publications in
the U.S. at the time - managed to last for years, while others like La Rassegna or
La Ragione could survive only for a few months. This is reflected in the composi-
tion of the collection which therefore presents gaps across titles (Figure 1). Also
due to the newspapers’ economic struggles, the number of issues vary greatly
across titles, with some titles publishing thousands of issues and others only
a few hundreds. The overall coverage of issues is nonetheless relatively evenly
distributed across the whole period and titles of different orientation co-exist at
different points in time thus ensuring that a degree of balance is kept throughout
the collection. Users should however take into account factors such as over- or
under-representation of some titles, potential polarisation of topics, etc. when
engaging with the resource.


2.2     Methodology

This paper describes the process of contextually enriching the digital heritage
collection ChroniclItaly 3.0 from the so-called pre-processing steps to the enrich-
ment itself. Combining statistics and critical reflection, the aim is to provide an
1
    http://https://chroniclingamerica.loc.gov/
2
    For further information about the classification of the titles based on their political
    orientation, please refer to [11]
54      L. Viola & A. M. Fiscarelli




Fig. 1. Distribution of issues within ChroniclItaly 3.0 per title. Red lines indicate at
least one issue in a three month period.


insider perspective on the technical and analytical challenges posed by the task
of enriching digital heritage material. The discussion is divided in two parts: the
first part discusses the pre-processing steps, which include tokenization, remov-
ing words with less than two characters, removing numbers, dates and special
characters); the second part is concerned with the enrichment itself (NER, geo-
coding, ESA).


3    Towards enrichment: Pre-processing
Before working towards enriching a digital (textual) collection, and indeed before
performing any Natural Language Processing (NLP) task, the digital material
needs to be prepared. It is all too often assumed that this operation, also am-
biguously referred to as cleaning, is not part of the enrichment process and can
therefore be fully automatic, unsupervised and launched as a one-step pipeline.
In this section, we show how it is on the contrary paramount that this phase is
tackled critically as each one of the taken actions will have consequences on the
material, on how the algorithms will process such material and therefore on the
output and finally, on how the collection will be accessed and the information
retrieved and interpreted.
    Deciding which ones of the pre-processing operations should be performed
and how depends on many factors such as the language of the data-set, the type
                                       From digitised sources to digital data     55

of material, the specific enrichment task, to name but a few. Typical operations
that are considered part of this step are tokenization, lowercasing, stemming,
lemmatization, removing stopwords, removing noise (e.g., numbers, punctua-
tion marks, special characters). In principle, all these interventions are optional
as the algorithms will process whichever version of the data-set is used. In reality,
however, pre-processing the digital material is key to the subsequent operations
for several reasons. First and foremost, pre-processing the data will remove most
OCR mistakes which are always present in digital textual collections to various
degrees. This is especially true for corpora such as historical collections, repos-
itories of under-documented languages, or digitised archives from handwritten
texts. Second, it will reduce the size of the collection thus decreasing the required
processing power and time. Third, it is de facto a data exploration step which
allows the digital heritage scholar to look more closely at the material.
    It is important to remember that each one of these steps is an additional layer
of manipulation and has direct, heavy consequences on the material and therefore
on the following operations. It is critical that digital scholars assess carefully to
what degree they want to intervene on the material and how. For this reason,
this part of the process of contextual enrichment should not be considered as
separate from the enrichment itself, on the contrary, it is an integral part of the
entire process.
    The specific pre-processing actions taken towards enriching ChroniclItaly 3.0
were: tokenization, removing numbers, dates, removing words with less than two
characters and special characters. Numbers and dates were removed because they
are not only typically considered irrelevant to NER and ESA, but sometimes they
may even interfere with the algorithm performance, thus potentially worsening
the quality of the output. The last two operations were performed because it
was found that special and isolated characters (characters delimited by spaces)
were in fact OCR mistakes.
    Other actions included merging words wrongfully separated by a newline, a
white space or punctuation. Once this step was performed, the collection totalled
up to 21,454,455 words. In Figure 2, we show how this step impacted each title
per year in terms of the percentage of material removed and preserved while
Figure 3 displays such percentages aggregated for each title.
    As the figures show - particularly Figure 3 - on average, after this step 30%
of the material was removed from each title. This means that, if on the one
hand the step was likely to result in more reliable and interpretable material,
on the other it came at the expenses of potentially important information. In
this respect, the challenge does not lie so much in performing the pre-processing
itself, rather in assessing which operations minimise the loss of potentially useful
information and maximise the enhancement of the resource. As the technology
is still not perfect, digital heritage scholars and institutions must respond to this
challenge by carefully pondering the pros and cons of enriching digital collections
and duly warn the users about the performed interventions.
    As for the specific pre-processing operations to perform, we chose to not
remove stopwords and punctuation, even though they are typically considered
56            L. Viola & A. M. Fiscarelli


              Cron.S. Indip.   Italia Lib.Par. PatriotaRagione Rass.   Sent. Sent.W. Trib.C.




       1930




       1920                                                                                    Legend
Year




                                                                                                  Not preserved
                                                                                                  Preserved

       1910




       1900




                                            Percentage

               Fig. 2. Impact of pre-processing ChroniclItaly 3.0 per title per year.


as not semantically salient and in fact even detrimental to the models’ perfor-
mance. The choice was motivated by the enrichment actions to follow, namely
NER, geo-coding, and ESA: prepositions and articles are often part of locations,
organisations and names while punctuation is a typical sentence delimiter, that
is it sets the sentences’ boundaries, indispensable to perform sentiment analysis.
Therefore, removing such material from the collection could have had a neg-
ative impact on the enrichment quality. We also decided to not lowercase the
material. Lowercasing can be a double-edged sword; for instance, if lowercas-
ing is not performed, the algorithm will treat ‘USA’, ‘Usa’, ‘usa’, ‘UsA’, ‘uSA’,
etc. as distinct tokens, even though they may all refer to the same entity. On
the other hand, once lowercased, it may become difficult for the algorithm to
recognise entities, thus outputting many false negatives, and for the scholar to
distinguish between homonyms, thus potentially skewing the output. As entities
such as persons, locations and organisations are typically capitalised, we decided
to not perform lowercasing in preparation for NER and geo-coding. Once these
steps were completed, however, the entities were lowercased so that the issue
of multiple items referring to the same entity (e.g., ‘USA’ and ‘Usa’) could be
overcome (cfr. Section 5).


4       Enrichment: Named Entity Recognition
The growing digitisation of material and immaterial cultural heritage has pro-
portionally increased the relevance of Artificial Intelligence (AI) for researchers
                                                  From digitised sources to digital data   57

                             La Tribuna
                        del Connecticut
                                            68%             32%
                          La Sentinella
                              del West
                                            69%             31%
                          La Sentinella     71%             29%
                          La Rassegna       70%             30%
                                                                     Legend
                           La Ragione       68%             32%


                Title
                                                                        Not preserved
                             La Libera
                                Parola
                                            70%             30%         Preserved

                                 L'Italia   70%             30%
                         L'Indipendente     68%             32%
                             Il Patriota    70%             30%
                              Cronaca
                            Sovversiva
                                            71%             29%
                                             Percentage


            Fig. 3. Impact of pre-processing on ChroniclItaly 3.0 per title.


in the humanities as well as libraries and cultural heritage institutions [3, 13].
AI offers users and researchers methods to analyse, explore and make sense of
the different layers of information sometimes hidden in large quantities of ma-
terial. For instance, one effective application of AI is the possibility to enrich
the digital material with data that could allow for in-depth cultural analyses.
One example of such text enrichment is Named Entity Recognition (NER), that
is using contextual information to identify referential entities such as names of
persons, locations and organisations.
    NER tasks have over the years undergone major changes, however it has been
repeatedly proven that machine learning algorithms based on neural networks
outperform all previous methods. For this reason, the NER enrichment of Chron-
iclItaly 3.0 was performed by using a deep learning sequence tagging tool that
implements Tensorflow [5]. The algorithm combines Bidirectional Long Short-
Term Memory (BiLSTM) and Conditional Random Fields (CRF) as top layer
with character embeddings which was found to outperform CRFs with hand-
coded features. Methodologically, the character embeddings are trained with
pre-trained word embeddings while training the model itself. The character and
subword based word embeddings are computed with FastText [1] as it was found
that, by retrieving embeddings for unknown words through the incorporation of
subword information, issues with out-of-vocabulary words are significantly alle-
viated.
    The sequence tagging model for the Italian language was trained on I-CAB3
(Italian Content Annotation Bank), an open access corpus annotated for entities
(i.e. persons-PER, organisations-ORG, locations-LOC, and geopolitical entities-
GPE), temporal expressions, and relations between entities. I-CAB contains 525
news articles taken from the Italian newspaper L’Adige and totals up around
180,000 words. Embeddings were computed using Italian Wikipedia and trained
using Fastext with 300 dimensions. The results of the F1 score for the Italian
3
    http://ontotext.fbk.eu/icab.html
58      L. Viola & A. M. Fiscarelli

models are: accuracy: 98.15%; precision: 83.64%; recall: 82.14%; F1: 82.88. Figure
4 shows the tags of the NER algorithm, Figure 5 shows the accuracy scores
per entity, while Figure 6 shows the final output of the sequence tagger on
ChroniclItaly 3.0.




                     Fig. 4. Output tags of the NER algorithm.




     Fig. 5. Accuracy scores of the NER algorithm for Italian models per entity.




             Fig. 6. Output of the NER algorithm on ChroniclItaly 3.0.


    The NER algorithm retrieved 547,667 entities, which occurred 1,296,318
times across the ten titles. A close analysis of the entities, however, revealed
a number of issues which required a manipulation of the output. These issues
included: entities that had been assigned the wrong tag (e.g., New York - PER),
multiple entities referring to the same entity (e.g., Woodraw Wilson, President
Woodraw Wilson), elements wrongfully tagged as entities (e.g., venerdı̀ ‘Friday’
                                                       From digitised sources to digital data        59

- ORG). Therefore, a list of these exceptions was compiled and the results ad-
justed accordingly. Once the data were modified, the data-set counted 521,954
unique entities which occurred 1,205,880 times. Figure 6 shows how the interven-
tion affected the distribution of entities across the four categories - geopolitical,
persons, locations, organisations - per title.
    As it can be seen, the redistribution of entities varied across categories and
titles, in some cases dramatically. For instance in La Rassegna, the number of en-
tities in the LOC category significantly decreased whereas it increased in L’Italia.
This action required a combination of expert knowledge and technical ability as
the entities had to be carefully analysed and historically triangulated in order
to make informed decisions on how to intervene on the output without intro-
ducing errors. Although time-consuming and in principle optional, this critical
evaluation intervention nevertheless significantly improved the accuracy of the
tags thus overall increasing the quality of the NER output in preparation for the
following stages of the enrichment, namely geo-coding and ESA and ultimately,
adding more value to the user’s experience for access and reuse. It is therefore a
highly recommended operation.



                      Cronaca
                                      Il Patriota          L'Indipendente      L'Italia
                     Sovversiva

              0.4
              0.2
              0.0
             −0.2
                     La Libera
                                     La Ragione            La Rassegna      La Sentinella
                      Parola
                                                                                            type
Percentage




              0.4                                                                                  GPE
              0.2                                                                                  LOC
              0.0                                                                                  ORG
             −0.2                                                                                  PER
                    La Sentinella     La Tribuna
                      del West      del Connecticut

              0.4
              0.2
              0.0
             −0.2
                                                    Entities

Fig. 7. Distribution of entities per title after intervention. Positive bars indicate a
decreased number of entities after the process, negative bars indicate an increased
number.
60      L. Viola & A. M. Fiscarelli

5    Enrichment: Geo-coding
The relevance of geo-coding for digital heritage collections lies in what has been
referred to as Spatial turn, the study of space and place [4] as distinct entities in
that it sees place as created through social experiences and can therefore be both
real and imagined whereas space is essentially geographic. Spatial humanities is
a recently emerged interdisciplinary field within digital humanities which, fol-
lowing the Spatial turn, focuses on geographic and conceptual space, particularly
from a historical perspective. Based on Geographic Information Systems (GIS),
locations in data-sets are geo-coded and displayed on a map. Especially in the
case of large collections with hundreds of thousands of geographical entities, the
visualisation is believed to help scholars access the different layers of information
that may be behind geo-references. Indeed, such process of cross-referencing geo-
locations with other type of data (e.g., historical, social, temporal) has provided
researchers working in fields such as environmental history, historical demogra-
phy, and economic, urban and medieval history with new insights leading them
to propose alternatives to traditional positions and even explore new research
avenues [12]. The theoretical relevance of using NER to enrich digital heritage
collections lies precisely in its great potential for discovering the cultural signif-
icance underneath referential units and how that may have changed over time.
    Another challenge posed to digital heritage scholars is that of the language
of the collection. In the case of ChroniclItaly 3.0, for example, almost all choices
made by the researcher towards enriching, including NER and geo-coding, were
conditioned by the fact that the language of the collection is not English. The
relative lack of appropriate computational resources available for languages other
than English often dictates which tools and platforms can be used for specific
tasks. For geo-coding, for instance, it was found that setting the API language
as the language of the data-set improves the accuracy of the geo-coding results
[12]. For this reason, the geographical entities in ChroniclItaly 3.0 have been geo-
coded using the Google Cloud Natural Language API within the Google Cloud
Platform Console which provides a range of NLP technologies in a wide range
of languages, including Italian.
    Moreover, when enriching collections in languages other than English, digital
heritage scholars often find themselves confronted with the additional challenge
of having to choose between training their own models or using existing models.
While the former scenario may sometimes not be an option due for instance to
the lack of specific training of the scholar, time or money limits, etc., the latter
alternative may also not be ideal. Even when trained in the target language
(like in the case of ChroniclItaly 3.0), typically the training would have occurred
within the context of another project, for different purposes, possibly using data-
sets with very different specifics from the one the scholar is enriching. Researchers
are then usually forced to sacrifice the possibility to achieve a potentially much
higher quality and/or accuracy of their results in the interest of time, money
or both. For example, as shown in Figure 5, although the overall F1 score of
the NER algorithm for Italian models was satisfactory (82.88), the individual
performance for the entity LOC was rather poor (54.19). This may depend on
                                             From digitised sources to digital data      61

several factors (e.g., lack of this type of locations in the data-set used for training
the model, different locations tagged as LOC) which are related to the wider
challenge of accessing already available trained models in the desired language.
Because of the low score of the category LOC, we decided to geo-code only
GPE entities. Though not optimal, the decision was made also considering that
GPE entities are generally more informative as they would typically refer to
countries and cities (though it was found to retrieve also counties and States)
while LOC entities are typically rivers, lakes, and geographical areas (e.g., the
Pacific Ocean). Future work on the collection could focus on performing NER
using a more fine-tuned algorithm and geo-code the LOC-type entities.
   In total, 2,160 GPE-type entities were geo-coded, these are referred to 283,879
times throughout the whole corpus. Figure 8 shows the distribution of unique
GPE entities per title.


             La Tribuna
        del Connecticut
          La Sentinella
              del West
          La Sentinella

          La Rassegna

           La Ragione
Title




             La Libera
                Parola
                 L'Italia

         L'Indipendente

             Il Patriota
              Cronaca
            Sovversiva
                            0                500               1000               1500
                                                   GPE entities

                     Fig. 8. Distribution of unique GPE entities per title.




6        Enrichment: Entity Sentiment Analysis

Another enriching technique used to add value to digital heritage collections is
Entity Sentiment Analysis (ESA), a way to identify the prevailing emotional at-
titude of the writer towards referential entities. While Sentiment Analysis (SA)
identifies the prevailing emotional opinion within a given text, ESA adds a more
62      L. Viola & A. M. Fiscarelli

targeted and sophisticated quality to the enrichment as it allows users to iden-
tify the layers of meaning humans attached historically to people, organisations,
geographical spaces. Understanding the meaning humans invested in such enti-
ties and, in the case of historical collections such as ChroniclItaly 3.0, how that
meaning may have changed over time, provides digital heritage scholars with
powerful means to access part of collective narratives of fear, pride, longing,
loss. ESA enables us to connect these subjective connotations to the named en-
tities extracted from digitised texts [12, 2, 6, 7] thus offering a more engaging and
complete fruition of information as well as tackling new questions or revising old
assumptions. For example, the ESA of ChroniclItaly 3.0 may allow researchers
to investigate how immigrants made sense of their diasporic identities within
the host community and how their relationship with the homeland may have
changed over time.
    The process of performing ESA on the collection required several steps:

 – Identify the sentence delimiters (i.e., full stop, semicolon, colon, exclamation
   mark, question mark) and divide the textual material accordingly. At the
   end of this step, 677,030 sentences were obtained;
 – Select the most frequent entities for each category and in each title. As
   each title differs in size, we used a logarithmic function to obtain a more
   representative number of entities per title (2*log2 function used). At the
   end of this step, 228 entities were obtained distributed across titles as shown
   in Figure 9;
 – Select only the sentences that contained the entities identified in the pre-
   vious step. This step was done to limit the number of API requests and
   reduce processing time and costs. The selection returned 133,281 sentences
   distributed across titles as shown in Figure 10;
 – Perform ESA on the sentences so selected. When the study was carried out,
   no suitable SA models trained on Italian were found, therefore this step was
   performed once again using the Google Cloud Platform Console.



7    Conclusions

This article discussed the technical and analytical challenges posed by the task
of enriching a digital heritage collection and used the enrichment of ChroniclI-
taly 3.0 [10] as a concrete example. Specifically, the article combined statistical
analysis and critical evaluation to describe the impact that the decisions and in-
terventions made at every step of the enrichment process had on the collection.
Through such discussion, the aim was to show that, paradoxically, enriching a
digital heritage collection means sacrificing content. Therefore, an active, critical
engagement of the digital heritage scholar is an absolutely necessary requirement
to minimise such losses and ensure that the enrichment actions truly increase
the value of the collections, making the users’ experience more engaging and
thus widening and enhancing access.
                                               From digitised sources to digital data   63


                            La Tribuna
                       del Connecticut
                         La Sentinella
                             del West
                         La Sentinella

                         La Rassegna

                          La Ragione
               Title
                            La Libera
                               Parola
                                L'Italia

                        L'Indipendente

                            Il Patriota
                             Cronaca
                           Sovversiva
                                           0   20       40       60       80
                                                      Entities


            Fig. 9. Distribution of selected entities for ESA across titles.


    The enrichment actions discussed here are part of a wider process of enrich-
ment within DeXTER. Additional enrichment operations include topic modelling
and network analysis, which were not discussed here due to word count limita-
tions. Though not exhaustive, we nevertheless hoped to have shown that it is
only through a continuous, critical engagement with the digital sources that the
digital heritage scholar can successfully meet the challenges presented by the
digital and fulfil the main purposes of digitisation: preservation and knowledge
access.

References
 1. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with
    subword information (2017)
 2. Donaldson, C., Gregory, I.N., Taylor, J.E.: Locating the beautiful, picturesque,
    sublime and majestic: spatially analysing the application of aesthetic terminology
    in descriptions of the english lake district. Journal of Historical Geography 56,
    43–60 (2017), doi: 10.1016/j.jhg.2017.01.006
 3. Fiorucci, M., Khoroshiltseva, M., Pontil, M., Traviglia, A., Del Bue, A., James, S.:
    Machine learning for cultural heritage: A survey. Pattern Recognition Letters 133,
    102 – 108 (2020). https://doi.org/https://doi.org/10.1016/j.patrec.2020.02.017,
    http://www.sciencedirect.com/science/article/pii/S0167865520300532
 4. Murrieta-Flores, P., Martins, B.: The geospatial humanities: past, present and
    future. International Journal of Geographical Information Science 33(12), 2424–
    2429 (2019). https://doi.org/10.1080/13658816.2019.1645336, https://doi.org/
    10.1080/13658816.2019.1645336
 5. Riedl, M., Padó, S.: A named entity recognition shootout for german. In: Proceed-
    ings of the 56th Annual Meeting of the Association for Computational Linguistics
    (Volume 2: Short Papers). pp. 120–125 (2018), doi: 10.18653/v1/P18-2020
64      L. Viola & A. M. Fiscarelli

                            La Tribuna
                       del Connecticut
                         La Sentinella
                             del West
                         La Sentinella

                         La Rassegna

                          La Ragione



               Title
                            La Libera
                               Parola
                                L'Italia

                        L'Indipendente

                            Il Patriota
                             Cronaca
                           Sovversiva
                                           0.0   0.1           0.2
                                                  Percentage


         Fig. 10. Ratio of sentences containing selected entities across titles.


 6. Tally Jr, R.T.: Geocritical explorations: Space, place, and mapping in literary and
    cultural studies. Springer (2011), doi: 10.1057/9780230337930
 7. Taylor, J.E., Donaldson, C.E., Gregory, I.N., Butler, J.O.: Mapping digitally, map-
    ping deep: exploring digital literary geographies. Literary Geographies 4(1), 10–19
    (2018)
 8. Viola, L.: ChroniclItaly. A corpus of Italian language newspapers published in the
    United States between 1898 and 1922. Utrecht University (2018), doi: 10.24416/
    UU01-T4YMOW
 9. Viola, L.: ChroniclItaly 2.0. A corpus of Italian American newspapers annotated
    for entities, 1898-1920. Utrecht University (2019), doi: 10.24416/UU01-4MECRO
10. Viola, L.: ChroniclItaly 3.0. A contextually enriched digital heritage collection of
    Italian immigrant newspapers published in the USA, 1898-1936 (In press)
11. Viola, L., Verheul, J.: Mining ethnicity: Discourse-driven topic modelling of im-
    migrant discourses in the usa, 1898–1920. Digital Scholarship in the Humanities
    35(4), 921–943 (2019), doi: 10.1093/llc/fqz068
12. Viola, L., Verheul, J.: Machine learning to geographically enrich understudied
    sources: A conceptual approach. In: Proceedings of the 12th International Con-
    ference on Agents and Artificial Intelligence-Volume 1: ARTIDIGH. pp. 469–475.
    SCITEPRESS (2020), doi: 10.5220/0009094204690475
13. Weber, A., Ameryan, M., Wolstencroft, K., Stork, L., Heerlien, M., Schomaker, L.:
    Towards a digital infrastructure for illustrated handwritten archives. In: Digital
    Cultural Heritage, pp. 155–166. Springer (2018)