=Paper=
{{Paper
|id=Vol-3834/paper75
|storemode=property
|title=Combining Automatic Annotation with Human Validation for the Semantic Enrichment of Cultural Heritage Metadata
|pdfUrl=https://ceur-ws.org/Vol-3834/paper75.pdf
|volume=Vol-3834
|authors=Eirini Kaldeli,Alexandros Chortaras,Vassilis Lyberatos,Jason Liartis,Spyridon Kantarelis,Giorgos Stamou
|dblpUrl=https://dblp.org/rec/conf/chr/KaldeliCLLKS24
}}
==Combining Automatic Annotation with Human Validation for the Semantic Enrichment of Cultural Heritage Metadata==
<pdf width="1500px">https://ceur-ws.org/Vol-3834/paper75.pdf</pdf>
<pre>
                                Combining Automatic Annotation with Human
                                Validation for the Semantic Enrichment of Cultural
                                Heritage Metadata
                                Eirini Kaldeli, Alexandros Chortaras, Vassilis Lyberatos, Jason Liartis,
                                Spyridon Kantarelis and Giorgos Stamou
                                AI and Learning Systems Lab, School of Electrical and Computer Engineering, National Technical University of
                                Athens, Greece


                                            Abstract
                                            The addition of controlled terms from linked open datasets and vocabularies to metadata can increase
                                            the discoverability and accessibility of digital collections. However, the task of semantic enrichment
                                            requires a lot of effort and resources that cultural heritage organizations often lack. State-of-the-art AI
                                            technologies can be employed to analyse textual metadata and match it with external semantic resources.
                                            Depending on the data characteristics and the objective of the enrichment, different approaches may
                                            need to be combined to achieve high-quality results. What is more, human inspection and validation
                                            of the automatic annotations should be an integral part of the overall enrichment methodology. In the
                                            current paper, we present a methodology and supporting digital platform, which combines a suite of
                                            automatic annotation tools with human validation for the enrichment of cultural heritage metadata
                                            within the European data space for cultural heritage. The methodology and platform have been applied
                                            and evaluated on a set of datasets on crafts heritage, leading to the publication of more than 133K
                                            enriched records to the Europeana platform. A statistical analysis of the achieved results is performed,
                                            which allows us to draw some interesting insights as to the appropriateness of annotation approaches in
                                            different contexts. The process also led to the creation of an openly available annotated dataset, which
                                            can be useful for the in-domain adaptation of ML-based enrichment tools.

                                            Keywords
                                            semantic enrichment, cultural heritage metadata, named entity recognition and disambiguation


                                1. Introduction
                                Semantic enrichment is the process of adding new semantics to unstructured data, such as free
                                text, so that machines can make sense of it and build connections to it. In the case of the meta-
                                data that describes Cultural Heritage (CH) items, unstructured data comes in the form of free
                                text that details several aspects of the item, for example its main characteristics, its location,
                                creator, etc. Through the process of semantic enrichment, those textual descriptions are an-
                                alyzed and augmented with controlled terms from Linked Open datasets, such as Wikidata1
                                and Geonames2 , or controlled vocabularies, such as the Getty Art & Architecture Thesaurus3
                                CHR 2024: Computational Humanities Research Conference, December 4–6, 2024, Aarhus, Denmark
                                £ ekaldeli@image.ece.ntua.gr (E. Kaldeli)
                                ȉ 0000-0001-7045-2588 (E. Kaldeli)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                1
                                  https://www.wikidata.org/
                                2
                                  https://www.geonames.org/
                                3
                                  https://vocab.getty.edu/aat/


                                                                                                           353
CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
(AAT). Those terms represent concepts and attributes (e.g. “costume”, “Renaissance”, colors),
named entities, such as persons, locations, and organisations, or chronological periods. For
example, the strings “Leonardo da Vinci” and “da Vinci, Leonardo” can be both linked to the
Wikidata term representing the Italian Renaissance polymath. This additional piece of infor-
mation associated with a CH resource is commonly referred to as an annotation, which links
the CH object with some URI (Unique Reference Identifier) derived from vocabularies or open
data sources.
   Semantic enrichment adds meaning and context to digital collections and makes them more
easily discoverable. Given its importance, it has been a main concern and focus of efforts by
the Europeana digital library4 as well as individual data aggregators and providers. Firstly,
linked data makes the meaning of textual metadata unambiguous [25]. For example, the string
“Leonardo da Vinci” may refer, depending on the context, to the Italian Renaissance polymath
or the homonymous airport in Fiumicino, Italy, or a battleship with the same name. By linking
the text with the correct URI, it becomes clear what the text refers to. Secondly, linked data
allows us to retrieve additional information about a certain entity in an automated way, build
connections between different resources and contextualize them [9]. For example, it allows
us to link items tagged with the term “ring” with the broader concept of “jewelry” and, thus,
interconnect them with items enriched with the term “bracelet”, which is also an instance of
”jewelry ”. Moreover, linked data usually comes with translated labels, thus improving the
capabilities for multilingual search [10, 12]:
   Semantic enrichment is a labour-intensive process, which requires effort and resources that
CH institutions often lack. State-of-the-art AI technologies can be employed to automate the
time-consuming and often mundane process of manual metadata enrichment. Natural lan-
guage processing (NLP) tools can be used to analyse textual metadata and detect and classify
concepts or named entities mentioned in unstructured text. Machine Learning (ML) approaches
are extensively used for the task of disambiguation, which is responsible for deciding if the ref-
erence to ‘Leonardo da Vinci’ in the text refers to the Italian polymath or to the battleship.
However, the accuracy of the automatic results highly hinges on the specific task at hand vis-
a-vis the algorithm applied. For example, short textual descriptions, which are common in
CH metadata, lack context and thus ML algorithms trained on Wikipedia articles may result
in many incorrect matches. For similar reasons, they may often miss domain-specific matches
that are relevant in the specific CH context. What’s more, even if the automatically detected
links are correct, they may be considered undesirable for a certain case study. For example,
linking metadata records with terms representing colours may be important for a fashion col-
lection, but it may be undesirable for describing a manuscript that happens to mention a certain
colour.
   As a result, depending on a number of factors, such as the text characteristics (e.g. its length
and language), the vocabulary that we wish to link it to, and the type of entities to detect (e.g. do
we wish to identify a broad variety of concepts or to limit ourselves to certain domain-specific
terms?), a different combination of tools and steps is required to achieve the best possible re-
sults for each specific task. For example, for certain tasks with a well-defined restricted context,
a simple lemmatisation and string matching approach may be more appropriate than complex

4
    https://www.europeana.eu


                                                354
ML-based algorithms. Besides the need for flexibility in combining and experimenting with dif-
ferent approaches and tools, another crucial aspect that needs to be considered is the need to
make human inspection and validation an integral part of the end-to-end semantic enrichment
workflow [13]. Given that manual validation is a resource-consuming task, practically, evalu-
ation focuses on an appropriately selected sample of all the automatic annotations, depending
on the collected feedback and the objective, appropriate filtering criteria are applied.
   To address the aforementioned challenges, in this paper, we define, implement, and test a
methodology and associated digital platform, called SAGE5 , which combines automatic anno-
tation tools with human validation for the enrichment of CH items at scale. SAGE is an open
source tool6 that streamlines and facilitates the whole workflow of semantic enrichment, from
data import and the automatic production of semantic annotations to human validation and
data publication. The platform has been configured to serve the needs of the cultural sector
and supports seamless interoperability with the common European data space for CH7 and in
particular with Europeana.
   The methodology and platform have been applied to enrich the metadata records from
datasets on various aspects of crafts heritage (from furniture to jewelry and costumes to clocks)
coming from 8 different CH organisations, including the Fashion Museum Antwerp, the Nether-
lands Institute for Sound and Vision, the Open University of the Netherlands, the Greek Na-
tional Documentation Centre, the Museum of Arts and Crafts in Zagreb, the Palais Galliera and
Mobilier National in Paris, and the Textile Museum of Prato. The rest of the paper is structured
as follows. After discussing related work, we present the steps of the methodology to semantic
enrichment that we followed along with the technical architecture and the supporting SAGE
platform, the evaluation performed and the results achieved. Finally, we conclude the paper
with some general lessons learned.


2. Related Work
State-of-the-art Natural Language Processing and Machine Learning technologies have been
extensively used in the CH domain to analyze unstructured text and extract structured informa-
tion from it. To achieve automated subject indexing, Annif [22] is an open-source multilingual
toolkit by the National Library of Finland that automatically assigns documents with subjects
from a controlled vocabulary. In [1], a topic detection approach is applied to group historical
documents into thematic collections. Additionally, the HerCulB system [23] has been devel-
oped to automatically annotate the Balkans’ intangible CH. Other approaches propose the use
of semi-automatic tools to assist humans in the task of manual annotation by identifying align-
ments between vocabularies, such as CultuurLINK [16].
   Among information retrieval approaches, there have been several attempts to apply Named
Entity Recognition (NER) as well as Disambiguation (NED) in the CH and digital humanities
sectors, considering different types of data. In [11], NERD is applied to enrich metadata for the
5
  https://pro.europeana.eu/post/close-encounters-with-ai-an-interview-on-automatic-semantic-enrichment
6
  Source code: https://github.com/ails-lab/sage-backend and https://github.com/ails-lab/sage-frontend
  Documentation: https://ails-lab.github.io/SAGE_Documentation/ and https://www.youtube.com/playlist?list=PL
  Zhh656xkjIsxMKShH7aV7aR8TAwmU508
7
  https://dataspace-culturalheritage.eu/


                                                   355
exhibits of the Smithsonian Cooper–Hewitt National Design Museum in New York. In [8], an
overview of NER approaches applied to historical documents is provided. An entity matching
approach that works at the level of structured knowledge graphs, aiming to identify duplicate
entities in data sources containing historical data is presented in [2]. In [3], the authors con-
duct a comparative study of different NERD tools on digital archive collections in order to
link Engish textual metadata to Wikidata entities. In their study, the multilingual NERD tool
mGENRE [6], which we employ in the current study, outperforms other approaches including
BLINK [24] and EDGEL [14]. The need to deal with multilingual text is another important
concern in the CH domain, e.g. named entity recommendation has been explored as a means
to enhance multilingual retrieval on Europeana [10]. In this respect, the multilingual autore-
gressive entity linking approach employed by mGENRE is another advantage of the particular
tool.
   It should also be noted that NERD tools are trained on generic corpora [6, 24], that have
limited overlap with CH-related textual metadata [12]. Adapting these tools to new domains
by fine tuning them requires large amounts of well-annotated data, with labels that need to
be generated or validated by domain experts, as well as large computational power, time and
funds. These challenges are extensively discussed in [21] for the domain of Digital Humani-
ties. Although domain adaptation of ML models is beyond the scope of the current paper, the
methodology we advocate can lead to the production of high-quality ground truth data with
reduced costs: validators are provided with datasets that have been already automatically an-
notated, an approach that highly facilitates their manual task, which becomes more focused
and less cumbersome. This process allows us to make openly available a selection of appropri-
ately processed annotated metadata from the CH domain (see Section 4), thus contributing to
increasing the availability of annotated metadata that can be used for the in-domain tuning of
NERD tools.
   As the uptake of AI tools is expanding, there is increasing need for validation and modera-
tion by humans to overcome the errors of the machine and achieve higher quality results [20].
Crowdsourcing methods and tools have been employed by CH organisations in this respect [13]
as a means to mobilise human participants in the evaluation and correction of AI algorithm
outcomes, also leading to the preparation of ground-truth data [15, 12]. For tasks that require
specialised expertise, in [7] a niche-sourcing methodology and tool for the annotation of CH
metadata is proposed, which, similar to our approach, uses an RDF triple store to store the re-
sults. However, as opposed to the the current work, the methodology relies solely on manual
selections by experts with no use of automatic annotation tools.
   Overall, our work distinguishes itself from previous work on semantic enrichment mainly
in that it is based on a generic data management approach, which allows the combination of
various annotation tools with flexible parameterisation capabilities (such as the definition of
string matching and filtering rules); in that it includes human validation as an integral part
of its workflow; and that it supports integrations with other CH-specific data representations
and platforms, making it readily reusable in the CH data space. It should be noted that the
integration with external annotation tools and CH-related platforms is loosely coupled, via in-
teractions with the APIs (Application Programming Interface) and SPARQL endpoints exposed
by the third-party components.


                                              356
3. Methodology and Technical Architecture
The methodology we followed for the semantic enrichment of CH metadata consists of the
following high-level steps:
      1. A: Data aggregation and requirements analysis
         The first step concerns the preparatory tasks of aggregating the data and specifying the
         requirements for the the enrichment (e.g. which metadata fields to analyse, which vo-
         cabularies to link to etc).
      2. B: Automatic metadata enrichment
         The second step involves the automatic analysis of the textual metadata, with the aim to
         derive useful annotations in line with the identified requirements.
      3. C: Human validation
         Humans are solicited to review and validate the automatically generated annotations as
         well as to manually add new annotations, that the automatic algorithm has not been able
         to detect.
      4. D: Filtering and data publication
         The outcomes of the human validation are analysed to establish appropriate thresholds
         for filtering and the filtered annotations are embedded as enrichments to the metadata
         records. The enriched metadata records are ultimately published to the Europeana plat-
         form.

  Figure 1 provides an overview of the main digital components that support the above method-
ology.


                                             Source
                                             records
                                                                                      Vocabularies
                                                                 Semantic
           Publish enriched                                      analysis and
              datasets                                           enrichment            Knowledge
                                                                                       Bases API
                                             EDM          Automatic
                                            records       annotations    Feedback
       New
     records
                      Data Aggregation,
                      Mapping, and                                  Human
                      Publication           Filtered
                                                                   Validation
                                            annotations
                                                                                    Human intelligence

Figure 1: Architectural Overview


  MINT is a metadata management tool8 that is part of the data space for CH and is used by
several aggregators to prepare and publish their data to Europeana. It acts as the link between
8
    https://mint-wordpress.image.ntua.gr/


                                                   357
SAGE and Europeana and supports steps A and D of the aforementioned methodology, serving
the following purposes: (i) aggregate the metadata records from the data providers and for
mapping them to the Europeana Data Model (EDM) [4] that is then passed to SAGE; and (ii)
embed the annotations produced by SAGE, after filtering in light of the human feedback, into
the original metadata records in line with the expected EDM extension that accommodates
for enrichments9 and ultimately publishing the results to Europeana. It should be noted that
data already published on the Europeana platform can also be sourced directly by SAGE for
annotation, via a direct interconnection with the Europeana search API10 .

3.1. The SAGE tool for automatic enrichment and validation
SAGE is a web-based platform for generating, enriching, validating, publishing, and searching
RDF data. In the context of our methodology, it is responsible for the core steps B and C.
The RDF data can be produced from heterogeneous data sources and data formats using the
D2RML mapping language [5], and enriched using annotators that wrap web-based or other
third party services. The enrichments can then be manually validated, and finally, the entire
data can be published in an RDF store and indexed. The SAGE platform has been configured to
facilitate the semantic enrichment of CH metadata. In this respect, it offers a suite of already
set-up annotators, i.e. parameterisable enrichment templates, that are connected with relevant
in-domain vocabularies and knowledge bases. It also facilitates the direct import/publication
of metadata from/to platforms of the European data space for CH, including Europeana and
MINT, making use of established APIs and formats .
   A dataset is annotated per property, i.e. the user can select from the schema preview a prop-
erty that links entities to values, and execute an annotator on the values of that property. An
annotator in SAGE is a mediator that retrieves all desired values from the triple store where
the dataset content is published, generates the appropriate calls to the web or other service,
and transforms the results to the RDF annotation specification. As in the case of datasets, the
results of an annotator execution are Terse RDF Triple Language11 files stored in the file system
of SAGE. In the framework of the data space of CH, annotations are also expressed in a JSON-
LD equivalent representation model12 , which bases on the W3C’s Web Annotation Model13
supported by Europeana. The annotation model is generic enough to accommodate for vari-
ous enrichment types (e.g. annotations resulting from automatic translation tools, from image
analysis etc) and provides sufÏcient provenance information, including information about the
annotations’ confidence scores and the validation feedback provided by humans. For meta-
data records that are compatible with EDM, the annotations are ultimately embedded in the
metadata in line with the EDM extension that instructs the representation of metadata state-
ments resulting from semantic enrichment14 . This way, the enrichments can be appropriately
9
 https://pro.europeana.eu/files/Europeana_Professional/Share_your_data/Technical_requirements/EDM_profiles/
 EDM_provenance_profile_external_202111.pdf
10
   https://www.europeana.eu/en/apis
11
   https://www.w3.org/TR/turtle/
12
   https://docs.google.com/document/d/1Cq1Qqx0ji7Vw8iwLVis1CfpYKtv-72ojkcvjnQzrKjs/edit?usp=sharing
13
   https://www.w3.org/TR/annotation-model/
14
   https://pro.europeana.eu/files/Europeana_Professional/Share_your_data/Technical_requirements/EDM_profile
   s/EDM_provenance_profile_external_202111.pdf


                                                   358
handled and presented to the end-user by the Europeana platform.
   SAGE supports three main types of annotators, which can be parameterised with respect to
different aspects (e.g. vocabulary, language, preprocessing functions etc) to serve different case
studies:

        • Thesaurus annotators: They link texts to URIs from thesauri that can be imported to the
          platform by performing smart string matching on the thesaurus labels using lemmatiz-
          ers (such as the ones provided by the Stanza library15 ) and other functions to produce
          improved results (e.g. apply dedicated Regex rules). They are appropriate for applica-
          tion both on generic textual fields and on focused short fields. By selecting thesauri that
          represent concepts referring to specific domains (e.g. fashion), it is more likely that the
          extracted terms are relevant to the object in question. Moreover, such annotators can
          perform massive enrichments in a very short time compared to the other annotators,
          since they rely on locally stored data. Figure 3 provides an overview of how a Thesaurus
          Annotator works on a specific example.
        • Generic NERD annotators: They employ pre-trained NERD tools to detect named enti-
          ties and link them to respective entities from Wikidata. SAGE supports two different
          pipelines for generic NERD. The first pipeline makes use of the AIDA tool [18] for entity
          detection and disambiguation. The second pipeline makes use of the spaCy library16 for
          performing the NER part for different languages, i.e. for recognising entities and their
          string boundaries within a sentence, and then of the multilingual mGENRE model [6]
          for the disambiguation stage and for linking with a URI from Wikidata. Such annota-
          tors can be used as they are, with minimal or no configurations and are appropriate for
          general-purpose enrichments. They conduct disambiguation by using the context con-
          tained in longer texts (e.g. description), since they are trained on textual corpora such as
          Wikipedia articles. At the same time, this process is more likely than the other annota-
          tors to link with terms that are too generic or irrelevant in the context of a specific case
          study, while it is hard to infer with sufÏcient accuracy the type of the extracted entity
          and its relation to the object in question (e.g. whether it represents the item’s creator, a
          place of display etc). As a result, in practice, they often produce more accurate results
          when applied in fields with pre-specified focused semantics.
        • SPARQL Annotators: SPARQL annotators communicate with external knowledge bases
          (such as Wikidata and Geonames) through SPARQL endpoints. Thus, they are the best
          fit when dealing with large knowledge bases that cannot be downloaded locally.They
          can be applied on focused fields that refer to a single entity. The values of such fields
          often follow certain patterns (e.g. “surname, name”, “city/region/country” etc) and, thus,
          pre-processing with Regex is key to the success of the method, so that a normal form of
          the entity name can be extracted. An example of a query that matches Wikidata entities
          with the occupation of a painter is presented in Figure 2.


15
     https://stanfordnlp.github.io/stanza/
16
     https://spacy.io/


                                                  359
Figure 2: An example of a SPARQL query searching Wikidata. It matches labels with English Wikidata
labels of items having an occupation (wdt:P106) painter (wd:Q1028181). It also estimates a confidence
score as 1/(𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓 _𝑚𝑎𝑡𝑐ℎ𝑒𝑠).


3.2. Human Validation
Human validation was conducted via a dedicated environment provided by SAGE (see Fig-
ure 4). Humans are invited to inspect the automatic annotations produced by the AI tools and
accept or reject them. Moreover, they can add missed annotations, i.e. relevant annotations
that the automatic algorithm failed to identify. During the validation of the results of the se-
mantic analysis, validators are also able to edit the predefined target metadata field in which
the URI will end up. It should be noted that SAGE groups together annotations repeated across
many records in a dataset and flags annotations referring to URIs that are already included in
the metadata. In total, 14 CH professionals with specialized knowledge about the considered
collections participated in the validation process, with two to three validators per collection.
Participants were instructed to accept or reject annotations based on what they consider as
desirable for inclusion in the final metadata. That is, they evaluated not only whether an an-
notation is a correct match but also in terms of relevance (e.g. matches with the term ”human”


Figure 3: Overview of the SAGE Thesaurus Annotator workflow on a metadata description.


                                                360
may be considered too generic) .
   The appropriate size and characteristics of the sample to be validated depend on the avail-
able resources that can be invested in the validation process and the nature of the use case.
What is considered a ”sufÏcient” amount hinges on many factors, including the total number
of automatically produced annotations, their characteristics (e.g. what metadata fields they
refer to, their granularity, etc), the characteristics of the automated algorithm that produced
them (e.g. its accuracy, the reliability of the automatic confidence scores it assigned to them),
the number of participants and the amount of time they can devote to the task. The following
criteria were used to guide the selection of the annotations sample to be validated, so as to
ensure representativeness across various parameters:

    • Inspect annotations that appear in a high number of records and thus will have a high
      impact.
    • Ensure a balanced representation of metadata fields, including fields with varying seman-
      tics and expected text length.
    • Take into consideration automatic confidence levels assigned by automatic algorithms,
      if available: inspect annotations with a rather low confidence score but also a sufÏcient
      number of annotations with a rather high one.


Figure 4: Screenshot from the SAGE validation environment. Strings of metadata that have been
matched are shown on the left and the URI(s) they have been linked to on the right.


                                              361
3.3. Analysis and Filtering of Annotations
Validation feedback was analysed with the aim of establishing thresholds for annotations that
are considered acceptable for publication. To this end, the following metrics have been calcu-
lated per dataset, per Annotator, and per analysed metadata field:
        • Precision considering only unique annotations, that is, unique triples of field textual
          values, matched sub-string, and identified URI.
        • Precision considering all annotations, that is, without grouping together identical field
          textual values (in other words, counting all times the same annotation, defined as a triple,
          may appear in different items).
In both cases, precision was calculated as 𝑇 𝑃/(𝑇 𝑃 + 𝐹 𝑃), where 𝑇 𝑃 = 𝑎𝑐𝑐𝑒𝑝𝑡𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠
and 𝐹 𝑃 = 𝑟𝑒𝑗𝑒𝑐𝑡𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠. Precision was used as a threshold for filtering out not-reviewed
annotations on a field or Annotator basis. What is considered a sufÏciently high precision
depends on the requirements of each case study and the expectations of the data provider.
   For the use cases we considered, most human experts did not focus on the manual insertion
of new annotations and the few manually added annotations we collected do not allow us to
sufÏciently estimate false negatives and thus compute recall and the F-score. It should also be
noted that for publication to Europeana, a threshold based on precision is considered the most
appropriate metric to be used17 .
   Human judgments can also be used as a means to assess the trustworthiness of the automatic
confidence scores assigned by the AI algorithms. For example, if humans tend to accept all
sample annotations above a certain score, then we may conclude that all annotations above that
score can be regarded as acceptable. In this vein, we explored whether there is a correlation
between the automatic confidence scores, when available, and human judgments. We therefore
plotted the logistic regression between the two variables considering the following metrics:
        • The 𝑝 − 𝑣𝑎𝑙𝑢𝑒 [19]. A value greater than 0.05 means that no statistically significant rela-
          tionship between the automatic scores and the human judgments was observed.
        • The expected automatic score for which the predicted probability is greater than 0.7, that
          is annotations above this score have a probability above 0.7 to be accepted by humans,
          based on the sample data.


4. Results on Crafts Heritage Datasets
The aforementioned methodology and supporting tools have been applied to metadata records
describing crafts heritage items as mentioned in Section 1. The analysed metadata comes in the
following languages: Dutch, Italian, French, Greek, English and Croatian. In total, the SAGE
annotators were applied on 216, 115 metadata records, giving rise to 915, 472 total annotations
and 549, 402 unique annotations. It should be noted that numbers calculated based on unique
annotations are considered more reliable since numbers that count in item impact are skewed
towards textual values that are repeated in multiple items. In total, 12 experts from 8 CH organ-
isations took part in the validation campaigns. Overall, 30, 910 unique annotations referring to
17
     https://pro.europeana.eu/post/methodology-for-validating-enrichments


                                                       362
more than 15K records were reviewed via SAGE (i.e. 5.6% of the automatically produced unique
annotations), with the sample being selected following the criteria outlined in Section 3.2. Of
those annotations, 23, 426 were accepted and 7, 474 were rejected.
   The overall precision, defined as the number of all accepted automatic annotations produced
by SAGE over the number of reviewed automatic annotations, is 0.76, considering unique an-
notations. If all annotations are counted in, then the overall precision is 0.82. Precision varied
largely depending on the analysed metadata field, the type of annotator that was used, and the
datasets that were analysed. Table 1 provides an overview of the results achieved by different
annotators. The minimum and maximum precision reported in the table refer to a per metadata
field level.

Table 1
Precision of used SAGE annotators
     Annotator                                     Min Precision      Max Precision        Avg precision /
                                                    (all/unique)       (all/unique)         (all/unique)
     Fashion Thesaurus Annotator                     0.436/0.672       0.964/0.943          0.801/0.832
     AAT Annotator                                   0.644/0.658       0.994/0.987          0.819/0.822      .
     Greek Crafts Thesaurus Annotator                0.982/0.947       0.982/0.947          0.982/0.947
     EUScreen Thesaurus Annotator                    0.607/0.878       0.952/0.927          0.779/0.902
     Wikidata SPARQL                                 0.894/0.817           1/1              0.981/0.963
     Generic NERD with Wikidata - mGENRE                0.4/0.4            1/1              0.935/0.748


   The choice of the vocabulary used by the thesaurus annotators depended on the respective
dataset characteristics and providers’ objectives. The following vocabularies were used: the
Europeana fashion thesaurus18 ; AAT; the EUScreen vocabulary on audiovisual heritage19 ; and
a SKOS vocabulary on Greek crafts heritage20 . Thesaurus annotators were applied to both
longer (e.g. dc:description, dc:title) and shorter fields (e.g. dc:format, dc:type), often
after case-appropriate regex pre-processing, giving rise to generally satisfactory results in both
cases. SPARQL queries on Wikidata were used to retrieve creators for the dc:creator and loca-
tions for the dc:spatial fields. Although in most cases it did not produce a high number of an-
notations, it scored a high precision. mGENRE and AIDA were applied to dc:description and
dc:title fields as well as shorter fields (including dc:creator, dc:spatial, and dc:rights).
They both produced similar results, performing well for short fields but poorly for longer ones.
In the latter case, they both struggled with disambiguation between multiple candidate entities
and, even when producing matches that were in principle correct, those were often too generic
and considered irrelevant by validators.
   For annotators for which an automatic score was produced, we also attempted to plot the
logistic regression between the automatic score and the human judgments. However, no cor-
relation was found between the two variables and therefore automatic scores were not used
as factors in the filtering rules. A possible explanation for this is that in the case of thesauri
18
   http://thesaurus.europeanafashion.eu/
19
   http://thesaurus.euscreen.eu/EUscreenXL/v1
20
   https://www.semantics.gr/authorities/vocabularies/craft-item-types/vocabulary-entries


                                                     363
Figure 5: View on Europeana of an item provided by the Museum of Arts and Crafts in Zagreb. The
’Reed’, ’Wood’, and ’Beech Wood’ terms are all automatic enrichments added by SAGE that are visible
on the item page.


annotators, scores are usually quite high for all annotations: they reflect the string difference
(1-Levenshtein distance [17]) between words endings (since the matching is based on the lem-
matised versions of the textual metadata and the thesaurus terms). For the generic NERD tools
the scores turn to be quite unreliable: they are inversely proportional to the number of candi-
date URIs and do not sufÏciently account for disambiguation.
   Annotations have been filtered by discarding all annotations rejected by humans, while in-
cluding all explicitly accepted ones. (considering a majority vote). For non-reviewed annota-
tions, a threshold based on precision between 0.75 and 0.8 (considering unique annotations)
was considered acceptable by data providers. In total, 549.460 have been regarded acceptable,
leading to the enrichment of 133.405 out of 216.115 analysed records. All enriched records
have been published to Europeana. Enrichments have been indexed to become searchable and
are visible as part of the item view via distinct tags, thus contributing to making the respec-
tive items more discoverable, contextual, and multilingual. Figure 5 shows an example of how
automatic annotations look like on the Europeana platform.
   Although domain adaptation is beyond the scope of the current case study, the dataset that
resulted from the validation process can be valuable for the training and fine-tuning of NERD
tools in the field of CH. To this end, a curated selection of annotated metadata enriched and


                                               364
validated via SAGE has been made openly available21 under a CC0 license, so that it can be
freely reused as data amenable for computational purposes. The dataset includes more than
10K unique annotations (pairs of analysed textual values and URIs). The in-domain adaptation
of NERD tools so that they can more effectively deal with the particular characteristics of CH
metadata [12], such as short text and specialised terminology, remains part of future work.


5. Conclusions
In the current paper, we present a generic and reusable methodology and supporting digital
platform that combines automatic annotators with human expertise in order to enrich them
with terms from various linked data sources. The methodology has been applied and evaluated
on a case study involving crafts heritage datasets, leading to measurable improvements in the
quality of metadata and enhancing the discoverability and usability of the respective resources
on Europeana. Building on the practical experience we gained, the current case study allows
us to draw some lessons learned, which can prove useful for interested stakeholders who may
wish to follow a similar process to enrich their datasets.
   Before proceeding to the actual enrichment, it is crucial to scrutinise the data to be analysed,
gain a deep understanding of its characteristics and define feasible and meaningful enrichment
objectives. One should define the expected benefit of possible enrichments and how they will
bring value to the collection. In this respect, one should ask questions such as: What kind of
concepts are useful to detect (e.g. persons, locations, domain-specific concepts etc)? Which
metadata fields contain relevant information (e.g. descriptions make frequent references to
techniques and materials used)? In what languages are the metadata? It should also be noted
that the quality of the original metadata affects the quality of the automatic enrichment. If
the text contains many typos or is misaligned with the intended semantics of the respective
metadata field, then the outputs of the automatic enrichment tools will be less accurate. This
step is also crucial for detecting patterns in data that can be exploited in order to produce
annotations.
   The next step involves the selection and set-up of the semantic annotators that are most
appropriate for the specific use case, considering the advantages and disadvantages of each
approach as presented in Section 3.1. The selection of knowledge bases and vocabularies that
have the case-appropriate granularity and coverage is crucial. Generally, the more focused
the automatic enrichment, considering the terminology used (e.g. link with a domain-specific
vocabulary versus general-purpose NERD) and the metadata property that is parsed (e.g. topic-
specific fields such as dc:creator versus longer ones such as dc:description), the less the
risk of producing too many irrelevant or too generic enrichments and the more accurate the res-
olution of disambiguation. One should opt for knowledge bases that are accessible on the Web
via an open license, well-documented, and compliant with Linked Data best practices. Their
multilingual coverage (also in relation to the language of your metadata) is also an important
aspect that should be taken into consideration.
   After the production of the automatic annotations, the validation process should be carefully

21
     See https://github.com/ails-lab/ai4culture-datasets for the actual dataset and the process that was used for the
     data curation.


                                                         365
organised. The background of the validators is crucial: some tasks may require expert skills
(e.g., knowledge of a particular language, domain expertise etc.), while others can be performed
by appealing to a general audience. In the former case, it is wiser to keep the validation process
closed within a team of experts, while in the latter, organizing an open crowdsourcing cam-
paign will mobilize more people and thus speed up the process. The selection of the sample to
be validated is crucial: it does not need to be large but it should be well-balanced, following
the criteria outlined in Section 3.2.
   The final step involves the filtering of the automatic validation in light of the acquired hu-
man feedback. For annotations reviewed by humans, majority vote can typically be used to
determine acceptability. Depending on the annotation type, additional criteria might be en-
forced (e.g. for public validation campaigns where untrustworthy feedback is suspected, we
may require that an annotation is reviewed by multiple users). Automatic annotations that
have not been reviewed by humans or lack a reliable confidence score should be filtered using
automatic evaluation metrics. The appropriate metrics depend on the nature of the task, but
precision is a typical one when correctness is at high stake. Thresholds should be established
depending on what is considered acceptable given the specific use case requirements.


Acknowledgments
The work is co-funded by the European Union, under the projects “CRAFTED: Enrich and pro-
mote traditional and contemporary crafts” and “AI4Culture: An AI platform for the cultural
heritage data space”. We would like to thank all partners of the CRAFTED project, and partic-
ularly Panagiotis Tzortzis, for their valuable contributions to this work.


References
 [1] M. Andresel, S. Gordea, S. Stevanetic, and M. Schütz. “An Approach for Curating Collec-
     tions of Historical Documents with the Use of Topic Detection Technologies”. In: Int. J.
     Digit. Curation 17.1 (2022), p. 12.
 [2] J. Baas, M. M. Dastani, and A. Feelders. “Entity Matching in Digital Humanities Knowl-
     edge Graphs”. In: Proc. of the Conf. on Computational Humanities Research, CHR2021.
     Vol. 2989. CEUR Workshop Proceedings. 2021, pp. 1–15.
 [3] Y. Benkhedda, A. Skapars, V. Schlegel, G. Nenadic, and R. Batista-Navarro. “Enriching the
     Metadata of Community-Generated Digital Content through Entity Linking: An Evalua-
     tive Comparison of State-of-the-Art Models”. In: Proc. of the 8th Joint SIGHUM Workshop
     on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Liter-
     ature. St. Julians, Malta: Association for Computational Linguistics, 2024, pp. 213–220.
 [4] V. Charles, A. Isaac, V. Tzouvaras, and S. Hennicke. “Mapping Cross-Domain Metadata
     to the Europeana Data Model (EDM)”. In: Research and Advanced Technology for Digital
     Libraries. Springer Berlin Heidelberg, 2013, pp. 484–485.


                                              366
 [5] A. Chortaras and G. Stamou. “D2RML: Integrating Heterogeneous Data and Web Ser-
     vices into Custom RDF Graphs”. In: Workshop on Linked Data on the Web co-located with
     The Web Conference. Vol. 2073. CEUR Workshop Proceedings. 2018.
 [6] N. De Cao, L. Wu, K. Popat, M. Artetxe, N. Goyal, M. Plekhanov, L. Zettlemoyer, N. Can-
     cedda, S. Riedel, and F. Petroni. “Multilingual Autoregressive Entity Linking”. In: Trans-
     actions of the Association for Computational Linguistics 10 (2022), pp. 274–290.
 [7] C. Dijkshoorn, V. de Boer, L. Aroyo, and G. Schreiber. “Accurator: Nichesourcing for
     Cultural Heritage”. In: Hum. Comput. 6 (2019), pp. 12–41.
 [8] M. Ehrmann, A. Hamdi, E. L. Pontes, M. Romanello, and A. Doucet. “Named Entity Recog-
     nition and Classification in Historical Documents: A Survey”. In: ACM Computing Sur-
     veys 56.2 (2023).
 [9] N. Freire and A. Isaac. “Technical Usability of Wikidata’s Linked Data”. In: Business In-
     formation Systems Workshops. Ed. by W. Abramowicz and R. Corchuelo. Springer Inter-
     national Publishing, 2019, pp. 556–567.
[10]   S. Gordea, M. L. Paramita, and A. Isaac. “Named Entity Recommendations to Enhance
       Multilingual Retrieval in Europeana.eu”. In: Foundations of Intelligent Systems. Springer
       International Publishing, 2020, pp. 102–112.
[11]   S. Hooland, M. Wilde, R. Verborgh, T. Steiner, and R. Van de Walle. “Exploring Entity
       Recognition and Disambiguation for Cultural Heritage Collections”. In: Literary and Lin-
       guistic Computing (2013).
[12]   E. Kaldeli, M. Garcı́a-Martı́nez, A. Isaac, P. S. Scalia, A. Stabenau, I. L. Almor, C. G. Lacal,
       M. B. Ordóñez, A. Estela, and M. Herranz. “Europeana Translate: Providing multilingual
       access to digital cultural heritage”. In: Proc. of the 23rd Annual Conference of the European
       Association for Machine Translation, EAMT. European Association for Machine Transla-
       tion, 2022, pp. 297–298.
[13]   E. Kaldeli, O. Menis-Mastromichalakis, S. Bekiaris, M. Ralli, V. Tzouvaras, and G. Sta-
       mou. “CrowdHeritage: Crowdsourcing for Improving the Quality of Cultural Heritage
       Metadata”. In: Information 12.2 (2021).
[14]   N. Lai. “LMN at SemEval-2022 Task 11: A Transformer-based System for English Named
       Entity Recognition”. In: Proc. of the 16th International Workshop on Semantic Evaluation
       (SemEval-2022). Seattle, United States: Association for Computational Linguistics, 2022,
       pp. 1438–1443.
[15]   V. Lyberatos, S. Kantarelis, E. Kaldeli, S. Bekiaris, P. Tzortzis, O. Menis - Mastromicha-
       lakis, and G. Stamou. “Employing Crowdsourcing for Enriching a Music Knowledge Base
       in Higher Education”. In: Artificial Intelligence in Education Technologies: New Develop-
       ment and Innovative Practices. Springer Nature, 2023, pp. 224–240.


                                                 367
[16]   H. Manguinhas, V. Charles, A. Isaac, T. Miles, A. Lima, A. Neroulidis, V. Ginouvès, D.
       Atsidis, M. Hildebrand, M. Brinkerink, and S. Gordea. “Linking Subject Labels in Cul-
       tural Heritage Metadata to MIMO Vocabulary using CultuurLink”. In: Proc. of the 15th
       European Networked Knowledge Organization Systems Workshop (NKOS) co-located with
       the 20th Int. Conf. on Theory and Practice of Digital Libraries (TPDL). Vol. 1676. CEUR
       Workshop Proceedings. 2016, pp. 32–35.
[17]   F. P. Miller, A. F. Vandome, and J. McBrewster. Levenshtein Distance: Information theory,
       Computer science, String (computer science), String metric, Damerau?Levenshtein distance,
       Spell checker, Hamming distance. Alpha Press, 2009.
[18]   D. B. Nguyen, J. Hoffart, M. Theobald, and G. Weikum. “AIDA-light: High-Throughput
       Named-Entity Disambiguation”. In: Proc. of the Workshop on Linked Data on the Web
       co-located with the 23rd Int. World Wide Web Conf. (WWW. Vol. 1184. CEUR Workshop
       Proceedings. 2014.
[19]   S. Silvey. Statistical Inference. Monographs on statistics and applied probability. Chapman
       & Hall, 2003.
[20]   J. Stiller, V. Petras, M. Gäde, and A. Isaac. “Automatic Enrichments with Controlled Vo-
       cabularies in Europeana: Challenges and Consequences”. In: Digital Heritage. Progress
       in Cultural Heritage: Documentation, Preservation, and Protection. Springer International
       Publishing, 2014, pp. 238–247.
[21]   O. Suissa, A. Elmalech, and M. Zhitomirsky-Geffet. “Text analysis using deep neural net-
       works in digital humanities and information science”. In: Journal of the Association for
       Information Science and Technology 73 (2021).
[22]   O. Suominen, J. Inkinen, and M. Lehtinen. “Annif and Finto AI: Developing and Imple-
       menting Automated Subject Indexing”. In: Italian Journal of Library, Archives and Infor-
       mation Science 13.1 (2022), pp. 265–282.
[23]   I. Tanasijević and G. Pavlović-Lažetić. “HerCulB: content-based information extraction
       and retrieval for cultural heritage of the Balkans”. In: The electronic library 38.5/6 (2020),
       pp. 905–918.
[24]   L. Wu, F. Petroni, M. Josifoski, S. Riedel, and L. Zettlemoyer. “Scalable Zero-shot En-
       tity Linking with Dense Entity Retrieval”. In: Proc. of the Conf. on Empirical Methods in
       Natural Language Processing (EMNLP). 2020, pp. 6397–6407.
[25]   M. Wu, H. Brandhorst, M.-C. Marinescu, J. M. Lopez, M. Hlava, and J. Busch. “Automated
       metadata annotation: What is and is not possible with machine learning”. In: Data Intel-
       ligence 5.1 (2023), pp. 122–138.


                                                368

</pre>