<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic metadata curation of the cultural heritage resources</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matteo Lorenzini</string-name>
          <email>m.lorenzini@fbk.eu</email>
          <email>matteo.lorenzini@unitn.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universita degli Studi di Trento</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents my thesis proposal on automatic metadata curation of cultural heritage resources. Metadata curation represents one of the aspects of the "metadata management life-cycle" together with ingestion, maintenance and dissemination. The objective of metadata curatorship in general is to ensure the user can access objects of interest to him/her from a repository, digital library, catalogue, etc. using well-assigned metadata values aligned with an appropriately chosen schema. Ideally, in a repository or digital archive, all the objects should be described using the same accuracy and data structure. In this way the user can retrive all the informations, objects of interest and the related items as result of a single search. However, this is very rare. Objects are described using di erent levels of details and di erent data model design might be used by the cataloguers. In such cases, after the metadata quality control process, the curator must correct or normalize all the objects with errors or incosistencies in the metadada schema. New approaches based on the use of the semantic enrichment of the digital resources can help the resources aggregator during the metadata curation process having an impact also in the improvement of resources discovery.</p>
      </abstract>
      <kwd-group>
        <kwd>CIDOC-CRM</kwd>
        <kwd>Library</kwd>
        <kwd>Metadata Curation</kwd>
        <kwd>Dublin Core</kwd>
        <kwd>Digital</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The traditional concept of library has undergone a profound change: from a
collection of physical information resources (mostly books) to a collection of
digital resources. In addition, the notion of digital resource includes not only
texts in digital form, but also, in general, any kind of multimedia resources3.
Hence, making accessible cultural heritage resources requires metadata schemas
rich in semantics and a structure able to cover the material heterogeneity and
the variety of memory institutions (libraries, archives, museums). However, often
3 Such collections may be composed of text, written on di erent materials, paintings,
photographs, 3D objects, sound recordings, maps or even digital object.
we are facing with problems related to the low quality of the metadata used
for the description of the digital resources: wrong de nition, inconsistence of
the resources or resources described according only to the minimal mandatory
metadata entities4. There may be many reasons for that, all completely valid,
e.g.; in many cases these institutions have few human resources to work on
improved metadata, they are often not themselves the sources of the metadata.</p>
      <p>The goal of the present work is to improve metadata quality integrating
semantic web principles into the metadata curation process.
2</p>
    </sec>
    <sec id="sec-2">
      <title>State of the art</title>
      <p>Digital curation, broadly interpreted, is about maintaining and adding value to
a trusted body of digital information for both current and future use: in other
words, it is the active management and appraisal of digital information over its
entire lifecycle.</p>
      <p>
        The necessity of a lifecycle approach, to ensure the continuity of digital
material, is discussed by Pennock [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. A lifecycle approach ensures that all the
required stages are identi ed and planned, and necessary actions implemented, in
the correct sequence. This can ensure the maintenance of authenticity, reliability,
integrity and usability of digital material, which in turn ensures maximisation
of the investment in their creation.
      </p>
      <p>
        The curation framework developed by Bruce and Hillmann [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is considered as
a benchmark in the pursuit of quality assessment. This framework de nes seven
parameters to measure the quality of the metadata: Completeness, Accuracy,
Conformance to Expectations, Logical Consistency and Coherence,
Accessibility, Timeliness, Provenance. In the digital libraries domain, these parameters
are fundamental for the evaluation of metadata quality before the curation
processes. The evaluation helps various curators to systematically identify metadata
problems. This could be straightforwardly applied to Europeana Digital Library5
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or the ARIADNE poroject6 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        In the context of linguistic resources, the CLARIN7 consortium supports
metadata curation developing a metadata curation module to facilitate the
metadata ingestion and curation process of the Virtual Language Observatory (VLO).
In the context of Semantic Web the metrics were often not explicity de ned and
did not consist of precise statistical metrics. Moreover, only few approaches were
actually accompained by an implemented tool and none of them covered all the
data quality dimensions [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        From the literature analysis it can be inferred that the existing approaches
are either too abstract or extremely focused on one dimension e.g.; completeness
[
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ].
4 The minimal metadata set able to guarantee the harvesting process from the content
provider to the resource aggregator.
5 https://www.europeana.eu/portal/it
6 http://www.ariadne-infrastructure.eu
7 https://www.clarin.eu
      </p>
    </sec>
    <sec id="sec-3">
      <title>Problem statement</title>
      <p>Wrong and incomplete mappings a ect the discoverability and acessibility of the
resources for the users: metadata curation plays an essential role to the
improvement of the metadata quality. In a standard cycle, the resurces are checked by the
curators which, applying the metrics described in the previous paragraph,
analyze the metadata quality. Depending on the types of issues curators intervene
as follows :
{ Send back to the content provider the datasets or records in order to x the
incosistence/lacks/errors.
{ Fix by hand the incosistence/lacks/errors.</p>
      <p>{ Normalize the resources using a controlled vocabulary.</p>
      <p>This approach, even if it can be considered as a standard procedure, leads to
problems related to e ective impact of the curation process: In most of the
cases curation is done by hand by the curators, quality analysis involves just
the mandatories metadata elements8 and metadata quality analysis does not
consider all the parameters suggested by Bruce and Hillmann.</p>
      <p>For a better contextualization of the problems we can brie y refers to Cultura
Italia9, the italian digital library. Resources are integrated in Cultura Italia in
the form of metadata using PICO10 pro le. It is based on the international
standard language Dublin Core11 that can describe, in a single scheme, every
type of cultural resource, both physical and digital. Dublin Core consist of 15
main elements. PICO pro le adds to these others 37 elements conceived for the
application of the Dublin Core in Cultura Italia. We can identify two types of
errors:
{ Low metadata completeness: Objects are described using the 6
mandatory metadata elements from PICO for metadata harvesting. Elements like
dc:description or dc:author which are considered as optional are under
represented E.g in the dataset from "Regione Piemonte" which consists of 71.710
records the element dc:description is never used.
{ Low accuracy: Metadata are lled using non exaustive informations like
dc:title "photo".</p>
      <p>In the light of the problems underlined above our research question will be: "Can
we improve metadata quality with automatic curation techniques and semantic
web technologies?". This topic will be treated in the domain of the cultural
heritage.
8 Crucial information like "description" lacks quality evaluation
9 http://www.culturaitalia.it
10 http://www.culturaitalia.it/opencms/documentazione_tecnica_it.jsp?
language=it&amp;tematica=static
11 http://dublincore.org</p>
    </sec>
    <sec id="sec-4">
      <title>Proposed solution</title>
      <p>
        The solution we are going to propose is a framework which, using the
metrics from Bruce and Hillmann, aims to check automatically12 the issues derived
from wrong cataloging processes in order to optimize the metadata curation
work ow. Moreover it aims to improve the metadata quality by suggesting the
missing metadata elements or errors to the curators. Considering the scenario
from Cultura Italia described in the previous paragraph the curation process
will be characterized by the following main tasks:
{ Automatization of the quality metrics: This PhD project will be mainly
focused on the quantitative de nition of the Completeness, Accuracy and
Logical consistency and coherence metrics. For each of these metrics we will
be de ne a customized algorithm[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] able to measure the quality of the
metadata compared to the metadata pro le from the dataset object.
Completeness parameter, statistical approach: Each metadata standard, for example
Dublin Core or PICO, de nes a number of possible elds (15 for Simple
Dublin Core, 37 for PICO). Completeness obtained by computing will be to
count the number of elds in each metadata instance that contain a no-null
value. In the case of multi-valued elds, the eld is considered complete if at
least one instance exists. Accuracy, natural language processing approach:
a vector space model will be de ned. Here will be computed the distance
between metadata instances and the domain of the resource or dataset. A
shorter distance correspond to a higher accuracy of the metadata instance.
Logical consistency and coherence, semantic web technologies approach: The
consistency will be calculated compared to the degree to which the resource
description matches with the metadata standard schema and de nition. The
coherence will be measured at the instance level. Will be computed the degree
to which all the elds describe the same object in a similar way analyzing the
correlation between text and metadata elements. Here, a challenging issue,
will be the evaluation of the logical consistency and coherence of the textual
entities with respect to the domain of the digital object.
{ Suggestion of potentially correct metadata values : Usually, because of the
licences regarding the re-use of metadata, the aggregator can not modify
the metadata given by the data provider. So, the errors and the potentially
correct metadata values, will be reported by a log to the metadata creator.
      </p>
      <p>Then, the metadata creator can decide whether to accept the suggestions.
{ Evaluation methodology: validation will concern two aspects:
Metadata schema: compliance of the new elements with the standards
structure of the metadata pro le.</p>
      <p>Consistence of the new elements with respect to the context of the
digital object: evaluation made by the logical consistency and coherence
parameter.
12 Manual curation processes are of course limited in their coverage: the amount of
objects that can be curated in datasets like the ones of Cultura Italia or other
aggregators is too high for the human resources available
In order to achieve the best results two di erent methods will be tested:
De nition of a "gold standard" dataset: will be used in order to train the
framework about how the correct metadata schema should be in terms
of structure and content.</p>
      <p>Insert (intentionally) some errors in the well de ned resources in order
to check if the proposed system can recognize the errors.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>Considering the amount of digital archieves, problems related to metadata
curation becomes evident. Reasons may be di erent: There is no curation task
force, the metadata curation activity is delegated to the content providers or the
metadata curation activity is made by hand. The development of an automatic
process will enable the curators to not only obtain snapshots of the quality of a
repository, but also to constantly monitor its evolution and how di erent events
a ect it without the need to run costly human e ort. This could lead to the
creation of innovative applications based on metadata quality that would improve
the nal user experience.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bruce</surname>
            ,
            <given-names>T.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hillmann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : The Continuum of Metadata Quality: De ning, Expressing, Exploiting. In: Metadata in Practice,
          <year>2004</year>
          . ALA Editions.
          <article-title>(</article-title>
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kiraly</surname>
            ,
            <given-names>P.: A Metadata</given-names>
          </string-name>
          <string-name>
            <surname>Quality Assurance Framework.</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ochoa</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duval</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Automatic evaluation of metadata quality in digital repositories</article-title>
          .
          <source>In:Internatinoal Journal on Digital Libraries</source>
          ,
          <volume>10</volume>
          , pp.
          <volume>67</volume>
          {
          <fpage>91</fpage>
          . (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ostojic</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sugimoto</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Durco</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The Curation Module and Statistical Analysis on VLO Metadata Quality</article-title>
          .
          <source>In: CLARIN annual conference 2016</source>
          , pp.
          <volume>90</volume>
          {
          <fpage>100</fpage>
          . CLARIN consortium. (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Pennok</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Digital curation:A life-cycle approach to managing and preserving usable digital information</article-title>
          .
          <source>In: Library and Archives Journal</source>
          , vol.
          <volume>1</volume>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Sompel</surname>
            ,
            <given-names>H. V. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nelson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lagoze</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Warner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Resource Harvesting within the OAI-PMH Framework</article-title>
          .In:
          <string-name>
            <surname>D-Lib</surname>
            <given-names>Magazine</given-names>
          </string-name>
          ,
          <volume>10</volume>
          . (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Zaveri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rula</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maurino</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pietrobon</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Quality assessment for Linked Data: A Survey</article-title>
          .
          <source>In Semantic Web</source>
          ,
          <volume>7</volume>
          , pp.
          <volume>63</volume>
          {
          <issue>93</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Radulovic</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mihindukulasooriya</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garca-Castro</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Gmez-Prez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>A comprehensive quality model for Linked Data</article-title>
          ..
          <source>Semantic Web</source>
          ,
          <volume>9</volume>
          ,
          <fpage>3</fpage>
          -
          <lpage>24</lpage>
          . (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Melo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vlker</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          ,H.:
          <article-title>Type Prediction in Noisy RDF Knowledge Bases Using Hierarchical Multilabel Classi cation with Graph and Latent Features</article-title>
          .
          <source>International Journal on Arti cial Intelligence Tools</source>
          <volume>26</volume>
          (
          <issue>2</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>32</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          ,H.:
          <article-title>Knowledge graph re nement: A survey of approaches and evaluation methods</article-title>
          .
          <source>Semantic Web</source>
          <volume>8</volume>
          (
          <issue>3</issue>
          ):
          <fpage>489</fpage>
          -
          <lpage>508</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>