=Paper=
{{Paper
|id=Vol-289/paper-5
|storemode=property
|title=Semantic Excavation of the City of Books
|pdfUrl=https://ceur-ws.org/Vol-289/p05.pdf
|volume=Vol-289
|dblpUrl=https://dblp.org/rec/conf/kcap/TordaiOS07a
}}
==Semantic Excavation of the City of Books==
Semantic Excavation of the City of Books
Anna Tordai Borys Omelayenko Guus Schreiber
VU University VU University VU University
1081a De Boelelaan 1081a De Boelelaan 1081a De Boelelaan
Amsterdam, Netherlands Amsterdam, Netherlands Amsterdam, Netherlands
atordai@cs.vu.nl b.omelayenko@cs.vu.nl schreiber@cs.vu.nl
ABSTRACT Thesaurus (AAT), the Thesaurus of Geographical Names
As the Semantic Web gains momentum, so grows the interest (TGN) and the United List of Artist Names (ULAN), as well
in making knowledge kept in various repositories available. as the Dutch Ethographic Collection Foundation (SVCN)5
In this paper we describe a case study using a methodologi- thesaurus. These form ”standard” vocabularies in the cul-
cal approach for porting cultural repositories to the Seman- tural heritage field, meaning various institutions have agreed
tic Web. The approach consists of thesaurus conversion, upon, and approved their usage. ”Local” thesauri or vocab-
meta-data schema mapping, meta-data value mapping, and ularies on the other hand are often created or maintained by
thesauri alignment. It is derived from our experience col- a single institution or person.
lected in a number of conversions we have performed for
the E-Culture project, and in this paper we apply it to a The objective of the present work is to describe the conver-
collection of data about images related to book printing. sion of the Bibliopolis6 collection (Latin for city of books)
and its alignment to existing vocabularies performed within
the E-Culture project. We follow the four-step process de-
1. INTRODUCTION scribed in [5] to convert the thesaurus and metadata such
In this work we present a case study based on the four activ-
that these become an interoperable part of the virtual col-
ities presented as a poster in [5] at K-Cap 2007. These activ-
lection. The Bibliopolis collection consists of images related
ities are necessary for converting cultural heritage data into
to book-printing, and range from photographs of publishing
RDF/OWL. The context of this work is the MultimediaN
houses to illustrations of the printing process and a local
E-Culture project [4]1 , a leading Semantic Web project that
thesaurus of keywords. It is a good example of the range
won the Semantic Web Challenge in 2006. The objective
of data we come across when dealing with cultural heritage
of this project is to create a large virtual collection of cul-
collections and vocabularies.
tural heritage objects that supports semantic search. Meta-
data and vocabularies are represented in RDF/OWL. The
To represent the collections the project uses a specializa-
project demonstrator (see the demonstrator at the project
tion of Dublin Core (DC) for visual resources (all objects in
website) includes multiple vocabularies which are partially
the virtual collection are required to have an image as their
semantically aligned.
data representation) as the guiding metadata scheme. This
Dublin Core specialization is named the Visual Resources
This paper builds on earlier conversions of metadata and
Association Core (VRA)7 scheme which follows the Dublin
thesauri and their commonalities. There are currently 5
Core dumb-down principle (i.e. it is a proper specializa-
collections and 6 thesauri that are part of the E-Culture
tion and does not contain extensions). Likewise, we model
demonstrator. Among them are the collections from the
collection-specific metadata schemes as specializations of VRA.
Royal Tropical Institute (KIT)2 in Amsterdam and the Na-
tional Museum of Ethnology (RMV)3 in Leiden. The the-
For the representation of thesauri the project uses the SKOS
sauri include three from Getty4 : the Art and Architecture
Core Schema8 . It was designed to support vocabulary inter-
1
http://e-culture.multimedian.nl operability and is currently undergoing standardization by
2
http://www.kit.nl/ the World-Wide Web Consortium (W3C). SKOS has already
3
http://www.rmv.nl/ been adopted by large organizations such as NASA.
4
http://www.getty.edu/research/conducting_
research/vocabularies/ This paper is organized as follows. We discuss related work
in Section 2. We present our approach in Section 3 fol-
lowed by a short presentation of the Bibliopolis data in Sec-
tion 4. Next, we devote four sections to describe the case
study based on the following four activities: thesaurus con-
5
Acronym for Stichting Volkenkundige Collectie Nederland
http://www.svcn.nl/thesaurus.asp
6
http://www.bibliopolis.nl/
7
http://www.vraweb.org/
8
http://www.w3.org/TR/swbp-skos-core-guide/
version, metadata schema mapping, metadata mapping and
thesaurus alignment. Finally, we conclude this paper with a Thesaurus Metadata
discussion in Section 9.
schema mapping schema mapping
2. RELATED WORK
In the area of thesaurus conversion Miles et. al. [3] propose
guidelines for migrating thesauri to the Semantic Web using
the SKOS Core schema. They distinguish between standard Metadata
and non-standard thesauri, and propose to preserve all infor- mapping
mation in the thesaurus by using sub-class and sub-property
statements where necessary.
The work of Van Assem et. al. [6] is based on these guide- Thesaurus
lines, and they propose a three step method consisting of the alignment
analysis of the thesaurus, mapping to the SKOS schema and
the creation of the conversion program. The case studies do
show however that non standard thesauri are more difficult Figure 1: The four activities for converting a collec-
to convert completely as some features cannot be mapped tion.
to the SKOS schema.
The problem of interoperability between two collections has • Thesaurus conversion, including thesaurus schema map-
been discussed in [1]. Within the SIMILE project Butler ping. This step is a relatively well-researched area,
et.al. report on the conversion and linkage of a visual works e.g. [6], with SKOS being the default option for the-
dataset and learning object dataset using XSLT. The first saurus schema.
dataset was converted using the VRA schema and the sec- • Metadata schema mapping. Here we are looking at
ond using Dublin Core, although non standard properties generic schemas like Dublin Core and its specializa-
were created as extensions. Issues discussed range from the tions to the cultural domain, such as VRA.
creation of URIs to dealing with hierarchical terms.
• Metadata conversion. At this step the data values are
In [2] Hyvönen et. al. describe the MuseumFinland project converted and looked up in the local thesaurus or ex-
encompassing multiple collections and ontologies. The col- ternal vocabularies using information extraction tech-
lections of various Finnish museums and additional ontolo- niques. Data interpretation is also common here, es-
gies were converted into RDF/OWL. The metadata of the pecially for data that does not directly fit the standard
collections was transformed using a common term ontology, vocabularies.
while the additional ontologies form an additional semantic
link between the collections and were further enhanced by • Thesaurus alignment. Here we align the thesaurus to
manual editing and enrichment. external (standard) vocabularies with ontology align-
ment techniques.
3. APPROACH
Structural integration is performed during thesaurus schema
The process developed within the E-Culture project for con-
mapping for vocabularies, and metadata schema mapping
verting datasets to an interoperable Semantic Web format
for collections. The terminological integration performed
was presented in [5]. Once again, our goal is syntactic and
during metadata mapping and thesaurus alignment is de-
semantic integration of data. In achieving this goal we are
pendent on the schema mapping activities, which we denote
driven by the practical needs of the E-culture project: the
with vertical arrows. As vocabularies tend to be used in
need to integrate multiple collections. Accordingly, we fol-
collection metadata making this link explicit is part of the
low a practical bottom-up approach where we enrich real-
semantic enrichment process. Collection metadata in turn
world data with a thin layer of semantics to achieve inter-
may contain implicit vocabularies hidden in data values that
operability. This approach may be seen as an alternative
are candidates for thesaurus alignment.
to the top-down approach that is very common in the Se-
mantic Web community. With the top-down approach we
would first need to develop a conceptual model of the cul- 4. BIBLIOPOLIS DATA
tural heritage world in order to be able to perform semantic The Bibliopolis data from the Koninklijke Bibliotheek (KB),
enrichment of the data. This ontology development effort the National Library of the Netherlands, consists of two
has not been started yet and such efforts would take sev- XML files: collection and thesaurus. The collection file con-
eral years to be finished. However, there are a number of tains the metadata of 1,645 images related to the printing of
thesauri available at the moment which are widely used by books and book illustrations. The thesaurus contains 1,033
the cultural communities. In our approach we perform syn- terms used as keywords for indexing images. These two files
tactic integration and take the first step towards semantic are a part of the Bibliopolis website. Both the thesaurus
integration by performing terminological integration. The and the metadata are bilingual (English and Dutch).
task of integrating collections and vocabularies from both
a structural and terminological perspective has evolved into Thesaurus. The thesaurus contains core terms, augmented
four activities which are summarized in Fig. 1: with their synonyms in plural, and variants of these terms
2
academiedrukkers
academiedrukker
universiteitsdrukker
aan een universiteit verbonden...
academische geschriften
overheidsdrukkers
university printer
emo
12/13/01
universiteitsdrukkers
drukkers
university printers
university printer
academy printer
academic printer
a printer appointed by...
academy printers
academic printers
Figure 2: Thesaurus record for term University
printer
in singular along with a descriptive note. Each record may
also contain related, broader and narrower terms. Addition-
ally, it contains some administrative data: initials of the
record creator, the date of entry, and the date of modifi-
cation. A sample XML element for the term university
printer is shown in Fig. 2.
Metadata. The metadata forms the description of images
related to book printing. The data consists of titles and
descriptions of the objects, names of their creator(s) with
signatures of their roles, such as a for author. The works
are also classified according to the technique used, their
type, and a library classification of the subject matter. The c Koninklijke Bibliotheek (http://www.kb.nl/)
metadata includes copyright information, measurements and Den Haag, Koninklijke Bibliotheek, 169 E 56
other administrative information. An example collection ob-
ject plus corresponding metadata is shown in Fig. 3.
6
5. THESAURUS CONVERSION Delftse Bijbel...
Delft Bible...
Thesaurus schema mapping and conversion is a relatively Yemantszoon, Mauricius : d
well-researched area. In our work we used the method for tekstbladzijde
boekdruk
thesauri conversion proposed by van Assem [6]. As for the 10 jan. 1477
thesaurus schema, we use SKOS within the E-Culture pro- D
ject. Bijbel. Oude
Testament...
Mapping the Bibliopolis thesaurus turned out to be rela-
tively straightforward as it fit the SKOS template. Table 1 typografische vormgeving
shows the details of the mapping of the thesaurus repre- bijbels
Delft
sentation in Fig. 2 to SKOS. Two XML elements were not Eerste bijbel die in het
converted, as they contained bookkeeping information and Nederlands verscheen...
are not meant for public consumption. One XML element The first Bible to
appear in the Dutch language...
(see last column in the table) turned out to be a duplicate 27 x 20 cm
piece of information and was therefore omitted. It should be ...
noted that this conversion was guided by the requirements
of the project which does not include complete conversion
of the data.
Figure 3: A fragment of a real XML record depict-
ing a Delft Bible dated 10 January 1477, originated
The creation of the URI deserves special mention. When
from Delft, classified with category ‘bibles’. (Cer-
creating a URI we derive it from the real term identifier
tain fields may be empty)
followed by the disambiguation signature and the thesaurus
version. For example, in the Bibliopolis case the real identi-
fiers are stored in field TWOND (and not NUM that contains
Data Item Function Activity Source and Target Property/Class
NUM Internal identifier Create literal source: 2
target: vra:location.refId “2” ;
TWOND Preferred term in Dutch Create URI, literal and language source: academiedrukkers
tag target: bp:academiedrukkers rdf:type skos:Concept ;
skos:prefLabel “academiedrukkers”@nl ;
TWSYN Synonym in Dutch Create literal and language tag source: universiteitsdrukkers
target: skos:altLabel “universiteitsdrukkers”@nl ;
TWVAR Term in singular form in Create literal and language tag source: academiedrukker
Dutch target: skos:altLabel “academiedrukker”@nl ;
DEF Definition in Dutch Create literal and language tag source: aan een universiteit verbonden...
target: skos:definition “aan een universiteit verbon-
den...”@nl ;
TWBT Broader term Look up concept URI and add source: drukkers
URI target: skos:broader bp:drukkers ;
TWNT Narrower term Look up concept URI and add source: narrower term
URI target: skos:narrower bp:narrower term ;
TWRT Related term Look up concept URI and add source: overheidsdrukkers
URI target: skos:related bp:overheidsdrukkers ;
TWOND EN Preferred term in En- Create literal and language tag source: university printers
glish target: skos:prefLabel “university printers”@en ;
TWSYN EN Synonym in English Create literal and language tag source: academy printers
target: skos:altLabel “academy printers”@en ;
TWVAR EN Term in singular form in Create literal and language tag source: university printer
English target: skos:altLabel “university printer”@en ;
DEF EN Definition in English Create literal and language tag source: a printer appointed by...
target: skos:definition “a printer appointed by...”@en ;
ENG English translation of Not converted; duplicate infor- source: university printer
term mation
INVOERDER Entered by Not converted: not part of re- source: emo
quirements
INVDAT Date of entry Not converted: not part of re- source: 12/13/01
quirements
Table 1: Mapping thesaurus data to SKOS
a file-specific index rather than the real term identifier), they Table 2 shows an overview of the mapping from the XML
are unambiguous, and we have a single version. record fields to a VRA metadata schema with examples.
Here we face two situations. First, in the simplest case,
there is a exact semantic match between an original field
6. METADATA SCHEMA MAPPING and a VRA field. Second, if this is not the case, the field
In this activity we map the original record fields (see Fig. 3) should be specified as a specialization of an existing VRA
to a metadata schema. In the E-Culture project we use the element. In the Bibliopolis case this occurs with the ORIGI-
VRA Core scheme which is a specialization of Dublin Core9 NAL10 , REPRODUCTION and CLASSIFICATION fields. The
for visual resources (our target type of resources). first two are specific “titles”, the third one is a specific “sub-
ject” description. In Table 2 we see that the RDF/OWL
Before mapping to the schema we analyze the metadata (in- specification contains property definitions in the Bibliopolis
cluding examination of any additional documentation, web- namespace (bp:) paired with a statement about the sub-
sites, and interviews with experts). The meaning of the property relationship with a VRA element.
fields needs to be understood to find a correct correspon-
dence within the target schema. The first impression of the One field requires some deeper study. The MAKER field not
meaning of a field might be misleading. For example, the only contains the creator of the work, but also a character in-
TWGEO field was initially mapped to vra:location, i.e., the dicating the role that the person played in creating the work.
DC/VRA element indicating where the work was created. As shown in the example record in Fig. 3 the MAKER field
However, the documentation showed that the field actually has the value Yemantszoon, Mauricius : d, where “d” stands
gives information about the location related to the subject, for “drukker”, Dutch for “printer”. To preserve the roles of
and not the creation place. We finally used the VRA Core v4 the creators we specialize the VRA property vra:creator with
element vra:subject.geographicPlace, which gives the correct the properties that correspond to the roles found in the Bib-
interpretation. This element is a subproperty of DC/VRA liopolis data. This resulted in a set of RDF/OWL definitions
subject. such as:
An important additional consideration is that certain records
or fields may contain confidential or administrative informa- bp:drukker rdfs:subPropertyOf vra:creator
tion such as acquisition or bookkeeping information. For bp:origineel rdfs:subPropertyOf vra:title
bp:reproductie rdfs:subPropertyOf vra:title
example, the amount for which an object is insured should bp:classificatie rdfs:subPropertyOf vra:subject
not be publicly visible. This situation did not occur with
the Bibliopolis data. 10
For readability we use the English in the text, in cases
where it is close to the Dutch equivalent (“original” vs. “orig-
9
http://dublincore.org/ ineel”)
bp:A rdf:type skos:concept .
(The example uses the RDF N3 notation). bp:A skos:prefLabel @en
"General works" .
Dublin Core has excellent general coverage. In all collections
bp:D rdf:type skos:concept .
we tackled sofar, we were able to find for each field a Dublin bp:D skos:prefLabel @en
Core / VRA which was either an equivalent, or could act as ‘‘History of the art of printing" .
superproperty of a local specialization. This characteristic
bp:M rdf:type skos:concept .
makes Dublin Core a powerful tool for metadata interoper- bp:M skos:prefLabel @en
ability. "Secondary subjects" .
bp:M1 rdf:type skos:concept .
7. METADATA VALUE CONVERSION bp:M1 skos:prefLabel @en
"Philosophy, psychology" ;
After the schema is created the data values of the fields skos:broader bp:M
have to be converted. As discussed in [5] we have two kinds
of fields: those that contain free-text literal values, such bp:M4 rdf:type skos:concept .
bp:M4 skos:prefLabel @en
as a description field, and those that contain values from "language and literature" ;
(implicit) vocabularies, such as the fields for keywords or skos:broader bp:M .
geographic places. In the latter case we distinguish between
bp:M41 rdf:type skos:concept .
three kinds of vocabularies to which the field value can be bp:M41 skos:prefLabel @en "English" ;
converted: skos:broader bp:M4 .
bp:M41 rdf:type skos:concept .
bp:M41 skos:prefLabel @en "German" ;
1. A local vocabulary. skos:broader bp:M4 .
2. A vocabulary that is implicitly present in the field val-
ues. Figure 4: RDF specification (in N3 notation) of
some sample classification concepts. The “M” con-
3. Terms that may belong to a vocabulary. cept is the top concept of a BT/NT hierarchy
In the Bibliopolis dataset we had the following situations for
metadata value mappings: The other implicit vocabulary present within the data is that
of roles. The field MAKER contains the name of the creator
Converting to a local vocabulary concept. Option 1 is along with its role (eg: Yemantszoon, Mauricius : d where d
exemplified by the values of the field TWOND which rep- stands for printer) which is one of the 14 roles. We create
resent thesaurus concepts. This relationship is explicitly RDF representations of these terms as SKOS concepts.
present in the source data and is preserved during the meta-
data value conversion. We create the RDF/OWL represen- Converting into a typed resource. Again, we create new
tations and use the corresponding URIs of these entries in RDF resources from field values that are potentially part of
the Bibliopolis thesaurus. Once again, these URIs are com- some vocabulary. We create a unique URI by adding the
posed of text as the records refer to the (unique) Dutch text field name to the field value. For example, for values of
label of the concept and not to the concept identifier. This the field TECHNIQUE this results in &bp;techniek_boekdruk,
is relevant information for the choice of the URI naming which is part of the bp: namespace. The reason for this is
scheme for vocabulary concepts (cf. Section 5). that the values of TECHNIQUE and OBJECT sometimes co-
incide, for example, foto is a technique as well as an object
Converting to an implied vocabulary concept. In this type. This vocabulary can be an existing standard vocabu-
case we map field values to resources which form new vo- lary such as the AAT in which case an alignment between
cabularies implicitly present in the data. In the Bibliopolis the new resource and the vocabulary has to be performed. In
data there were two fields whose values formed an implicit the Bibliopolis data a number of values of the fields TECH-
vocabulary. NIQUE, OBJECT and TWGEO can be aligned to the AAT
and TGN. There were a small number of unmapped values
In Table 2 we see the value “D” in the field CLASSIFICATIE. of field TECHNIQUE (13) and of field OBJECT (5) as can
Further analysis revealed that these single-letter values actu- be seen in Table 3. These terms can be added to the AAT
ally represent a small vocabulary for library-type classifica- by extending it. The alignment and extension is further
tions of the subject. This information is not part of the XML discussed in Section 8.
data, but is only shown on the website of Bibliopolis. This
classification vocabulary has also some broader/narrower re- We also create resources from field values where the vocab-
lations. We represented this vocabulary using the SKOS ulary the values belong to is unknown or the mapping is
template and mapped the field values to concepts from this not performed. This allows for the option of creating future
vocabulary. semantic extensions, although as a result we have a number
of resources we do not use. In general, these may be names
The RDF example in Fig. 4 shows the SKOS specification of organizations or persons, places, cultures or historical pe-
of a subset of such classification subjects, including the D riods. In Bibliopolis the values of MAKER and TWNAAM
concept. The M concept (“secondary subjects”) has a hier- contain person names. These names can possibly be linked
archical substructure. to the ULAN vocabulary. We create resources out of these
Data Item Function Activity Source and Target Property/Class
NUMMER Record Id Create URI and additional pro- source: 6
ject specific triples (&vra;Work) target: bp:6 rdf:Type vra:Work .
TITEL Title in Dutch Create literal and language tag source: Delftse Bijbel...
target: vra:title “Delftse Bijbel...”@nl ;
TITEL EN Title in English Create literal and language tag source: Delft Bible...
target: vra:title “Delft Bible...”@en ;
MAKER Creator and his marker Extract name and role marker, source: Yemantszoon, Mauricius : d (d stands for
for role create URI and label for name drukker meaning printer)
and convert marker to role, target: bp:drukker bp:Yemantszoon Mauricius ;
create role as subproperty of bp:Yemantszoon Mauricius rdf:type ulan:person ;
vra:creator rdfs:label “Yemantszoon Mauricius” .
OBJECT Object type Map to AAT or create local ex- source: tekstbladzijde (text page)
tension to AAT and mapping target: vra:type bp:object tekstbladzijde ;
bp:tekstbladzijde rdf:type skos:concept .
skos:prefLabel “tekstbladzijde”@nl ;
skos:broader AAT:pages ;
TECHNIEK Technique used Map to AAT or create local ex- source: boekdruk (book printing)
tension to AAT and mapping target: vra:technique bp:techniek boekdruk ;
bp:boekdruk rdf:type skos:concept .
skos:prefLabel “boekdruk”@nl ;
skos:broader AAT:printing ;
DATERING Date Interpret and filter data source: 10 jan. 1477
target: vra:date “10-01-1477”
ORIGINEEL or RE- Title of the original (The title, author, date, place source: Bijbel. Oude Testament...
PRODUCTIE or reproduction (book) and page number can be ex- target: bp:origineel “Bijbel. Oude Testament...”@en
containing the image tracted) ;
CLASSIFICATIE Classification of the Interpret code, Create URI with source: D (code interpreted as History of book print-
work in librarian terms code, use interpretation as la- ing)
using a code bel keep identifier and create re- target: bp:classificatie bp:D ;
source
TWNAAM Person used as subject Interpret name and create URI source: John Do
for work target: vra:subject.personalName bp:John Do ;
bp:John Do rdf:type ulan:person ;
rdfs:label “John Do” .
TWOND Thesaurus term used as Create mapping to thesaurus source: typografische vormgeving
subject target: vra:subject bp:typografische vormgeving ;
TWGEO Place used as subject for Create mapping to TGN where source: Delft
work possible or keep literal with lan- target: vra:subject.geographicPlace tgn:7006804 ;
guage tag
OMSCHRIJVING Dutch or English de- Create literal and language tag source: Eerste bijbel die...
or OMSCHRIJV- scription target: vra:description “Eerste bijbel die...”@nl ;
ING EN
AFMETINGEN Size of the work Create literal source: 27 x 20 cm
target: vra:measurements.dimensions “27 x 20 cm” .
Table 2: Part of the Bibliopolis metadata with examples, function and RDFS property/classes
Source Data Vocabulary Terms Instances
names with URIs in the bp: namespace removing invalid Mapped Total Mapped Total
characters and spaces. The concepts are of type ulan:person Thesaurus AAT 209 1033 - -
and the human readable label contains the name. Metadata AAT 15 28 1332 1468
technique
Converting to a literal. Finally, pieces of text such as titles Metadata AAT 14 19 978 1507
object type
and descriptions are converted to literals. In Bibliopolis the Metadata TGN 32 69 349 480
values of TITLE and DESCRIPTION fields were converted subject
into literals with language tags as the title and description place
of works is both in English and in Dutch.
Table 3: Mappings between the Bibliopolis data and
8. THESAURUS ALIGNMENT other vocabularies
The local thesaurus and the resources containing techniques,
object types and locations extracted from the data during
the metadata conversion process need to be aligned with statement that we try to avoid, as ambiguity is quite com-
standard vocabularies. mon. The SKOS Mapping Vocabulary specification11 was
created for the purpose of linking thesauri to each other. It
We aligned the Bibliopolis thesaurus to AAT by syntacti- specifies relationships such as skos:exactMatch, skos:broad-
cally matching the Dutch skos:prefLabel to the Dutch trans- Match, skos:narrowMatch and more for aligning vocabula-
lation of AAT preferred terms and mapped 209 concepts out ries. For this alignment the mappings are still based on the
of 1033 as presented in Table 3. lexical match of term labels, that corresponds to the relation
skos:exactMatch.
Then, we need to identify the relation between the matched
11
terms. The OWL owl:sameAs relation is typically an over- http://www.w3.org/2004/02/skos/mapping/spec/
The field TWGEO contains geographic names which were will take place at regular intervals in time. This also means
mapped to TGN. As the values of this field are in Dutch that tool support should be in place to support this process,
we extended TGN by adding the Dutch label terms to the allowing updates to be generated semi-automatically, simi-
proper concept. For example, the value Parijs is the dutch lar to the AnnoCultor13 that is being currently developed
label of Paris in TGN. Such extensions had to be performed within the E-Culture project.
manually, while the mapping of values to cities in the Nether-
lands could be performed automatically as the labels in For the E-Culture virtual collection we have now carried
TGN contain the Dutch language version. We used syntac- out this process a number of times. This paper should be
tic matching for finding appropriate mappings along with viewed as a post-hoc rationalization of this work. Our goal
some additional techniques to reduce ambiguity, such as re- is to provide a set of methods and tools that allow collec-
stricting the search to cities instead of provinces and the tion owners (museums, archives) to carry out this process.
use of background knowledge like the vernacular names of Cultural-heritage institutions are now often bound to closed
cities. We only automatically mapped unambiguous terms, content management systems; the “three-O” paradigm (open
manually mapping ambiguous terms. Background knowl- access, open data, open standards) is gaining support, but
edge of the collection data helped in solving ambiguity as it we have to provide the owners of collections with the neces-
restricted the places the data could be associated to. sary support facilities.
The values of the fields TECHNIQUE and OBJECT were also We see two potential weaknesses of this work. Firstly, our
aligned with AAT using syntactic matching and once more process still requires much more tool support. In particular
use skos:exactMatch relation. As can be seen in Table 3, a for vocabulary alignment we need to explore how existing
number of terms were not mapped. We extend the AAT tools, such as the ones participating in the OAEI contest,
by adding the leftover terms to some part of the vocabu- perform on this data set. Our current work is still to much
lary if possible. For instance, the technique boekdruk (book based on manual work and only uses simple syntactic tools.
printing) is not part of AAT but is a special kind of printing
technique, therefore the AAT concept printing is selected as Secondly, the use of Dublin Core as “top-level ontology” for
broader term. We use the SKOS template to represent the the structure as metadata can also be perceived as a risk.
extension. What if the collection has metadata fields that fit with none
of the DC elements? However, this was not a problem in ei-
From Table 3 we can see that a large number of resources ther of these six collections. For the moment it seems Dublin
are created without being linked to vocabularies. Such re- Core is indeed a key resource in information interoperability.
sources might be seen as an unnecessary overhead but they However, it is a challenge to construct reasoners that make
can be used in the future when new vocabularies are added use of the collection-specific specializations.
or mapped manually. Almost 80 percent of the thesaurus
terms were not mapped to AAT and while a number of terms This article does not show the actual added value of the
could be linked with skos:broadMatch, this would require converted collection content. For this the readers are en-
additional manual work which could take up a significant couraged to visit the E-Culture online demonstrator, which
amount of time while yielding few matches. This is not the contains the Bibliopolis data.
case for the values of TECHNIQUE, OBJECT and TWGEO
fields where by manually aligning 13, 5 and 37 terms re- 10. ACKNOWLEDGMENTS
spectively would yield complete alignments. For OBJECT
We are grateful to our colleagues from the Multimedian
linking 5 terms would yield an alignment of another 500 oc-
E-Culture team: Alia Amin, Lora Aroyo, Victor de Boer,
currences of the term in the metadata which is one third of
Lynda Hardman, Michiel Hildebrand, Marco de Niet, An-
the total occurrences and well worth the manual effort.
nelies van Nispen, Marie France van Orsouw, Jacco van Os-
senbruggen, Annemiek Teesing, Jan Wielemaker and Bob
9. DISCUSSION Wielinga. We would also like to thank Mark van Assem
Interoperability is becoming one of the key issues in the for his input. The project is a collaboration between the
open Web world. Many research programs, such as the Free University Amsterdam, the Centre of Mathematics and
IST program of the EU, have interoperability high on the Computer Science (CWI), the University of Amsterdam,
agenda. However, real interoperability between collections Digital Heritage Netherlands (DEN) and the Netherlands In-
is still scarce. Until now, many approaches have focused on stitute for Cultural Heritage (ICN). The MultimediaN pro-
interoperability as a problem between two collections. ject is funded through the BSIK programme of the Dutch
government.
In this paper we take a different approach. We assume a
multitude of collections will become part of the interopera- We are especially thankful to Marieke van Delft of the Konin-
ble space; the activities we present can to a large extent be klijke Bibliotheek (National library of the Netherlands) for
carried out by studying an individual collection. Mapping her cooperation in the Bibliopolis case.
to existing other vocabularies requires knowledge of other
components, but there is no need for these to be complete. 11. REFERENCES
For vocabulary alignment the adage “a little semantics goes [1] M. H. Butler, J. Gilbert, A. Seaborne, and
a long way”12 holds. Also, one should not view this as a one- K. Smathers. Data conversion, extraction and record
shot thing. Metadata and vocabularies change, so extensions linkage using xml and rdf tools in project simile.
12 13
quote from J. Hendler http://annocultor.sourceforge.net/
Technical report, Digital Media Systems Laboratory
and HP Laboratories, August 2004.
[2] E. Hyvönen, E. Mäkelä, M. Salminen, A. Valo,
K. Viljanen, S. Saarela, M. Junnila, and S. Kettula.
Museumfinland–finnish museums on the semantic web.
Web Semantics: Science, Services and Agents on the
World Wide Web, 3(2-3):224–241, October 2005.
[3] A. J. Miles, N. Rogers, and D. Beckett. Migrating
thesauri to the semantic web - guidelines and case
studies for generating rdf encodings of existing thesauri.
[4] G. Schreiber, A. Amin, M. van Assem, V. de Boer,
L. Hardman, M. Hildebrand, L. Hollink, Z. Huang,
J. van Kersen, M. de Niet, B. Omelayenko, J. van
Ossenbruggen, R. Siebes, J. Taekema, J. Wielemaker,
and B. J. Wielinga. Multimedian e-culture
demonstrator. In I. F. Cruz, S. Decker, D. Allemang,
C. Preist, D. Schwabe, P. Mika, M. Uschold, and
L. Aroyo, editors, International Semantic Web
Conference, volume 4273 of Lecture Notes in Computer
Science, pages 951–958. Springer, 2006.
[5] A. Tordai, B. Omelayenko, and G. Schreiber.
Thesaurus and metadata alignment for a semantic
e-culture application. 2007.
[6] M. van Assem, V. Malaisé, A. Miles, and G. Schreiber.
A method to convert thesauri to SKOS. In Y. Sure and
J. Domingue, editors, ESWC, volume 4011 of Lecture
Notes in Computer Science, pages 95–109. Springer,
2006.