Semantic Excavation of the City of Books Anna Tordai Borys Omelayenko Guus Schreiber VU University VU University VU University 1081a De Boelelaan 1081a De Boelelaan 1081a De Boelelaan Amsterdam, Netherlands Amsterdam, Netherlands Amsterdam, Netherlands atordai@cs.vu.nl b.omelayenko@cs.vu.nl schreiber@cs.vu.nl ABSTRACT Thesaurus (AAT), the Thesaurus of Geographical Names As the Semantic Web gains momentum, so grows the interest (TGN) and the United List of Artist Names (ULAN), as well in making knowledge kept in various repositories available. as the Dutch Ethographic Collection Foundation (SVCN)5 In this paper we describe a case study using a methodologi- thesaurus. These form ”standard” vocabularies in the cul- cal approach for porting cultural repositories to the Seman- tural heritage field, meaning various institutions have agreed tic Web. The approach consists of thesaurus conversion, upon, and approved their usage. ”Local” thesauri or vocab- meta-data schema mapping, meta-data value mapping, and ularies on the other hand are often created or maintained by thesauri alignment. It is derived from our experience col- a single institution or person. lected in a number of conversions we have performed for the E-Culture project, and in this paper we apply it to a The objective of the present work is to describe the conver- collection of data about images related to book printing. sion of the Bibliopolis6 collection (Latin for city of books) and its alignment to existing vocabularies performed within the E-Culture project. We follow the four-step process de- 1. INTRODUCTION scribed in [5] to convert the thesaurus and metadata such In this work we present a case study based on the four activ- that these become an interoperable part of the virtual col- ities presented as a poster in [5] at K-Cap 2007. These activ- lection. The Bibliopolis collection consists of images related ities are necessary for converting cultural heritage data into to book-printing, and range from photographs of publishing RDF/OWL. The context of this work is the MultimediaN houses to illustrations of the printing process and a local E-Culture project [4]1 , a leading Semantic Web project that thesaurus of keywords. It is a good example of the range won the Semantic Web Challenge in 2006. The objective of data we come across when dealing with cultural heritage of this project is to create a large virtual collection of cul- collections and vocabularies. tural heritage objects that supports semantic search. Meta- data and vocabularies are represented in RDF/OWL. The To represent the collections the project uses a specializa- project demonstrator (see the demonstrator at the project tion of Dublin Core (DC) for visual resources (all objects in website) includes multiple vocabularies which are partially the virtual collection are required to have an image as their semantically aligned. data representation) as the guiding metadata scheme. This Dublin Core specialization is named the Visual Resources This paper builds on earlier conversions of metadata and Association Core (VRA)7 scheme which follows the Dublin thesauri and their commonalities. There are currently 5 Core dumb-down principle (i.e. it is a proper specializa- collections and 6 thesauri that are part of the E-Culture tion and does not contain extensions). Likewise, we model demonstrator. Among them are the collections from the collection-specific metadata schemes as specializations of VRA. Royal Tropical Institute (KIT)2 in Amsterdam and the Na- tional Museum of Ethnology (RMV)3 in Leiden. The the- For the representation of thesauri the project uses the SKOS sauri include three from Getty4 : the Art and Architecture Core Schema8 . It was designed to support vocabulary inter- 1 http://e-culture.multimedian.nl operability and is currently undergoing standardization by 2 http://www.kit.nl/ the World-Wide Web Consortium (W3C). SKOS has already 3 http://www.rmv.nl/ been adopted by large organizations such as NASA. 4 http://www.getty.edu/research/conducting_ research/vocabularies/ This paper is organized as follows. We discuss related work in Section 2. We present our approach in Section 3 fol- lowed by a short presentation of the Bibliopolis data in Sec- tion 4. Next, we devote four sections to describe the case study based on the following four activities: thesaurus con- 5 Acronym for Stichting Volkenkundige Collectie Nederland http://www.svcn.nl/thesaurus.asp 6 http://www.bibliopolis.nl/ 7 http://www.vraweb.org/ 8 http://www.w3.org/TR/swbp-skos-core-guide/ version, metadata schema mapping, metadata mapping and thesaurus alignment. Finally, we conclude this paper with a Thesaurus Metadata discussion in Section 9. schema mapping schema mapping 2. RELATED WORK In the area of thesaurus conversion Miles et. al. [3] propose guidelines for migrating thesauri to the Semantic Web using the SKOS Core schema. They distinguish between standard Metadata and non-standard thesauri, and propose to preserve all infor- mapping mation in the thesaurus by using sub-class and sub-property statements where necessary. The work of Van Assem et. al. [6] is based on these guide- Thesaurus lines, and they propose a three step method consisting of the alignment analysis of the thesaurus, mapping to the SKOS schema and the creation of the conversion program. The case studies do show however that non standard thesauri are more difficult Figure 1: The four activities for converting a collec- to convert completely as some features cannot be mapped tion. to the SKOS schema. The problem of interoperability between two collections has • Thesaurus conversion, including thesaurus schema map- been discussed in [1]. Within the SIMILE project Butler ping. This step is a relatively well-researched area, et.al. report on the conversion and linkage of a visual works e.g. [6], with SKOS being the default option for the- dataset and learning object dataset using XSLT. The first saurus schema. dataset was converted using the VRA schema and the sec- • Metadata schema mapping. Here we are looking at ond using Dublin Core, although non standard properties generic schemas like Dublin Core and its specializa- were created as extensions. Issues discussed range from the tions to the cultural domain, such as VRA. creation of URIs to dealing with hierarchical terms. • Metadata conversion. At this step the data values are In [2] Hyvönen et. al. describe the MuseumFinland project converted and looked up in the local thesaurus or ex- encompassing multiple collections and ontologies. The col- ternal vocabularies using information extraction tech- lections of various Finnish museums and additional ontolo- niques. Data interpretation is also common here, es- gies were converted into RDF/OWL. The metadata of the pecially for data that does not directly fit the standard collections was transformed using a common term ontology, vocabularies. while the additional ontologies form an additional semantic link between the collections and were further enhanced by • Thesaurus alignment. Here we align the thesaurus to manual editing and enrichment. external (standard) vocabularies with ontology align- ment techniques. 3. APPROACH Structural integration is performed during thesaurus schema The process developed within the E-Culture project for con- mapping for vocabularies, and metadata schema mapping verting datasets to an interoperable Semantic Web format for collections. The terminological integration performed was presented in [5]. Once again, our goal is syntactic and during metadata mapping and thesaurus alignment is de- semantic integration of data. In achieving this goal we are pendent on the schema mapping activities, which we denote driven by the practical needs of the E-culture project: the with vertical arrows. As vocabularies tend to be used in need to integrate multiple collections. Accordingly, we fol- collection metadata making this link explicit is part of the low a practical bottom-up approach where we enrich real- semantic enrichment process. Collection metadata in turn world data with a thin layer of semantics to achieve inter- may contain implicit vocabularies hidden in data values that operability. This approach may be seen as an alternative are candidates for thesaurus alignment. to the top-down approach that is very common in the Se- mantic Web community. With the top-down approach we would first need to develop a conceptual model of the cul- 4. BIBLIOPOLIS DATA tural heritage world in order to be able to perform semantic The Bibliopolis data from the Koninklijke Bibliotheek (KB), enrichment of the data. This ontology development effort the National Library of the Netherlands, consists of two has not been started yet and such efforts would take sev- XML files: collection and thesaurus. The collection file con- eral years to be finished. However, there are a number of tains the metadata of 1,645 images related to the printing of thesauri available at the moment which are widely used by books and book illustrations. The thesaurus contains 1,033 the cultural communities. In our approach we perform syn- terms used as keywords for indexing images. These two files tactic integration and take the first step towards semantic are a part of the Bibliopolis website. Both the thesaurus integration by performing terminological integration. The and the metadata are bilingual (English and Dutch). task of integrating collections and vocabularies from both a structural and terminological perspective has evolved into Thesaurus. The thesaurus contains core terms, augmented four activities which are summarized in Fig. 1: with their synonyms in plural, and variants of these terms 2 academiedrukkers academiedrukker universiteitsdrukker aan een universiteit verbonden... academische geschriften overheidsdrukkers university printer emo 12/13/01 universiteitsdrukkers drukkers university printers university printer academy printer academic printer a printer appointed by... academy printers academic printers Figure 2: Thesaurus record for term University printer in singular along with a descriptive note. Each record may also contain related, broader and narrower terms. Addition- ally, it contains some administrative data: initials of the record creator, the date of entry, and the date of modifi- cation. A sample XML element for the term university printer is shown in Fig. 2. Metadata. The metadata forms the description of images related to book printing. The data consists of titles and descriptions of the objects, names of their creator(s) with signatures of their roles, such as a for author. The works are also classified according to the technique used, their type, and a library classification of the subject matter. The c Koninklijke Bibliotheek (http://www.kb.nl/) metadata includes copyright information, measurements and Den Haag, Koninklijke Bibliotheek, 169 E 56 other administrative information. An example collection ob- ject plus corresponding metadata is shown in Fig. 3. 6 5. THESAURUS CONVERSION Delftse Bijbel... Delft Bible... Thesaurus schema mapping and conversion is a relatively Yemantszoon, Mauricius : d well-researched area. In our work we used the method for tekstbladzijde boekdruk thesauri conversion proposed by van Assem [6]. As for the 10 jan. 1477 thesaurus schema, we use SKOS within the E-Culture pro- D ject. Bijbel. Oude Testament... Mapping the Bibliopolis thesaurus turned out to be rela- tively straightforward as it fit the SKOS template. Table 1 typografische vormgeving shows the details of the mapping of the thesaurus repre- bijbels Delft sentation in Fig. 2 to SKOS. Two XML elements were not Eerste bijbel die in het converted, as they contained bookkeeping information and Nederlands verscheen... are not meant for public consumption. One XML element The first Bible to appear in the Dutch language... (see last column in the table) turned out to be a duplicate 27 x 20 cm piece of information and was therefore omitted. It should be ... noted that this conversion was guided by the requirements of the project which does not include complete conversion of the data. Figure 3: A fragment of a real XML record depict- ing a Delft Bible dated 10 January 1477, originated The creation of the URI deserves special mention. When from Delft, classified with category ‘bibles’. (Cer- creating a URI we derive it from the real term identifier tain fields may be empty) followed by the disambiguation signature and the thesaurus version. For example, in the Bibliopolis case the real identi- fiers are stored in field TWOND (and not NUM that contains Data Item Function Activity Source and Target Property/Class NUM Internal identifier Create literal source: 2 target: vra:location.refId “2” ; TWOND Preferred term in Dutch Create URI, literal and language source: academiedrukkers tag target: bp:academiedrukkers rdf:type skos:Concept ; skos:prefLabel “academiedrukkers”@nl ; TWSYN Synonym in Dutch Create literal and language tag source: universiteitsdrukkers target: skos:altLabel “universiteitsdrukkers”@nl ; TWVAR Term in singular form in Create literal and language tag source: academiedrukker Dutch target: skos:altLabel “academiedrukker”@nl ; DEF Definition in Dutch Create literal and language tag source: aan een universiteit verbonden... target: skos:definition “aan een universiteit verbon- den...”@nl ; TWBT Broader term Look up concept URI and add source: drukkers URI target: skos:broader bp:drukkers ; TWNT Narrower term Look up concept URI and add source: narrower term URI target: skos:narrower bp:narrower term ; TWRT Related term Look up concept URI and add source: overheidsdrukkers URI target: skos:related bp:overheidsdrukkers ; TWOND EN Preferred term in En- Create literal and language tag source: university printers glish target: skos:prefLabel “university printers”@en ; TWSYN EN Synonym in English Create literal and language tag source: academy printers target: skos:altLabel “academy printers”@en ; TWVAR EN Term in singular form in Create literal and language tag source: university printer English target: skos:altLabel “university printer”@en ; DEF EN Definition in English Create literal and language tag source: a printer appointed by... target: skos:definition “a printer appointed by...”@en ; ENG English translation of Not converted; duplicate infor- source: university printer term mation INVOERDER Entered by Not converted: not part of re- source: emo quirements INVDAT Date of entry Not converted: not part of re- source: 12/13/01 quirements Table 1: Mapping thesaurus data to SKOS a file-specific index rather than the real term identifier), they Table 2 shows an overview of the mapping from the XML are unambiguous, and we have a single version. record fields to a VRA metadata schema with examples. Here we face two situations. First, in the simplest case, there is a exact semantic match between an original field 6. METADATA SCHEMA MAPPING and a VRA field. Second, if this is not the case, the field In this activity we map the original record fields (see Fig. 3) should be specified as a specialization of an existing VRA to a metadata schema. In the E-Culture project we use the element. In the Bibliopolis case this occurs with the ORIGI- VRA Core scheme which is a specialization of Dublin Core9 NAL10 , REPRODUCTION and CLASSIFICATION fields. The for visual resources (our target type of resources). first two are specific “titles”, the third one is a specific “sub- ject” description. In Table 2 we see that the RDF/OWL Before mapping to the schema we analyze the metadata (in- specification contains property definitions in the Bibliopolis cluding examination of any additional documentation, web- namespace (bp:) paired with a statement about the sub- sites, and interviews with experts). The meaning of the property relationship with a VRA element. fields needs to be understood to find a correct correspon- dence within the target schema. The first impression of the One field requires some deeper study. The MAKER field not meaning of a field might be misleading. For example, the only contains the creator of the work, but also a character in- TWGEO field was initially mapped to vra:location, i.e., the dicating the role that the person played in creating the work. DC/VRA element indicating where the work was created. As shown in the example record in Fig. 3 the MAKER field However, the documentation showed that the field actually has the value Yemantszoon, Mauricius : d, where “d” stands gives information about the location related to the subject, for “drukker”, Dutch for “printer”. To preserve the roles of and not the creation place. We finally used the VRA Core v4 the creators we specialize the VRA property vra:creator with element vra:subject.geographicPlace, which gives the correct the properties that correspond to the roles found in the Bib- interpretation. This element is a subproperty of DC/VRA liopolis data. This resulted in a set of RDF/OWL definitions subject. such as: An important additional consideration is that certain records or fields may contain confidential or administrative informa- bp:drukker rdfs:subPropertyOf vra:creator tion such as acquisition or bookkeeping information. For bp:origineel rdfs:subPropertyOf vra:title bp:reproductie rdfs:subPropertyOf vra:title example, the amount for which an object is insured should bp:classificatie rdfs:subPropertyOf vra:subject not be publicly visible. This situation did not occur with the Bibliopolis data. 10 For readability we use the English in the text, in cases where it is close to the Dutch equivalent (“original” vs. “orig- 9 http://dublincore.org/ ineel”) bp:A rdf:type skos:concept . (The example uses the RDF N3 notation). bp:A skos:prefLabel @en "General works" . Dublin Core has excellent general coverage. In all collections bp:D rdf:type skos:concept . we tackled sofar, we were able to find for each field a Dublin bp:D skos:prefLabel @en Core / VRA which was either an equivalent, or could act as ‘‘History of the art of printing" . superproperty of a local specialization. This characteristic bp:M rdf:type skos:concept . makes Dublin Core a powerful tool for metadata interoper- bp:M skos:prefLabel @en ability. "Secondary subjects" . bp:M1 rdf:type skos:concept . 7. METADATA VALUE CONVERSION bp:M1 skos:prefLabel @en "Philosophy, psychology" ; After the schema is created the data values of the fields skos:broader bp:M have to be converted. As discussed in [5] we have two kinds of fields: those that contain free-text literal values, such bp:M4 rdf:type skos:concept . bp:M4 skos:prefLabel @en as a description field, and those that contain values from "language and literature" ; (implicit) vocabularies, such as the fields for keywords or skos:broader bp:M . geographic places. In the latter case we distinguish between bp:M41 rdf:type skos:concept . three kinds of vocabularies to which the field value can be bp:M41 skos:prefLabel @en "English" ; converted: skos:broader bp:M4 . bp:M41 rdf:type skos:concept . bp:M41 skos:prefLabel @en "German" ; 1. A local vocabulary. skos:broader bp:M4 . 2. A vocabulary that is implicitly present in the field val- ues. Figure 4: RDF specification (in N3 notation) of some sample classification concepts. The “M” con- 3. Terms that may belong to a vocabulary. cept is the top concept of a BT/NT hierarchy In the Bibliopolis dataset we had the following situations for metadata value mappings: The other implicit vocabulary present within the data is that of roles. The field MAKER contains the name of the creator Converting to a local vocabulary concept. Option 1 is along with its role (eg: Yemantszoon, Mauricius : d where d exemplified by the values of the field TWOND which rep- stands for printer) which is one of the 14 roles. We create resent thesaurus concepts. This relationship is explicitly RDF representations of these terms as SKOS concepts. present in the source data and is preserved during the meta- data value conversion. We create the RDF/OWL represen- Converting into a typed resource. Again, we create new tations and use the corresponding URIs of these entries in RDF resources from field values that are potentially part of the Bibliopolis thesaurus. Once again, these URIs are com- some vocabulary. We create a unique URI by adding the posed of text as the records refer to the (unique) Dutch text field name to the field value. For example, for values of label of the concept and not to the concept identifier. This the field TECHNIQUE this results in &bp;techniek_boekdruk, is relevant information for the choice of the URI naming which is part of the bp: namespace. The reason for this is scheme for vocabulary concepts (cf. Section 5). that the values of TECHNIQUE and OBJECT sometimes co- incide, for example, foto is a technique as well as an object Converting to an implied vocabulary concept. In this type. This vocabulary can be an existing standard vocabu- case we map field values to resources which form new vo- lary such as the AAT in which case an alignment between cabularies implicitly present in the data. In the Bibliopolis the new resource and the vocabulary has to be performed. In data there were two fields whose values formed an implicit the Bibliopolis data a number of values of the fields TECH- vocabulary. NIQUE, OBJECT and TWGEO can be aligned to the AAT and TGN. There were a small number of unmapped values In Table 2 we see the value “D” in the field CLASSIFICATIE. of field TECHNIQUE (13) and of field OBJECT (5) as can Further analysis revealed that these single-letter values actu- be seen in Table 3. These terms can be added to the AAT ally represent a small vocabulary for library-type classifica- by extending it. The alignment and extension is further tions of the subject. This information is not part of the XML discussed in Section 8. data, but is only shown on the website of Bibliopolis. This classification vocabulary has also some broader/narrower re- We also create resources from field values where the vocab- lations. We represented this vocabulary using the SKOS ulary the values belong to is unknown or the mapping is template and mapped the field values to concepts from this not performed. This allows for the option of creating future vocabulary. semantic extensions, although as a result we have a number of resources we do not use. In general, these may be names The RDF example in Fig. 4 shows the SKOS specification of organizations or persons, places, cultures or historical pe- of a subset of such classification subjects, including the D riods. In Bibliopolis the values of MAKER and TWNAAM concept. The M concept (“secondary subjects”) has a hier- contain person names. These names can possibly be linked archical substructure. to the ULAN vocabulary. We create resources out of these Data Item Function Activity Source and Target Property/Class NUMMER Record Id Create URI and additional pro- source: 6 ject specific triples (&vra;Work) target: bp:6 rdf:Type vra:Work . TITEL Title in Dutch Create literal and language tag source: Delftse Bijbel... target: vra:title “Delftse Bijbel...”@nl ; TITEL EN Title in English Create literal and language tag source: Delft Bible... target: vra:title “Delft Bible...”@en ; MAKER Creator and his marker Extract name and role marker, source: Yemantszoon, Mauricius : d (d stands for for role create URI and label for name drukker meaning printer) and convert marker to role, target: bp:drukker bp:Yemantszoon Mauricius ; create role as subproperty of bp:Yemantszoon Mauricius rdf:type ulan:person ; vra:creator rdfs:label “Yemantszoon Mauricius” . OBJECT Object type Map to AAT or create local ex- source: tekstbladzijde (text page) tension to AAT and mapping target: vra:type bp:object tekstbladzijde ; bp:tekstbladzijde rdf:type skos:concept . skos:prefLabel “tekstbladzijde”@nl ; skos:broader AAT:pages ; TECHNIEK Technique used Map to AAT or create local ex- source: boekdruk (book printing) tension to AAT and mapping target: vra:technique bp:techniek boekdruk ; bp:boekdruk rdf:type skos:concept . skos:prefLabel “boekdruk”@nl ; skos:broader AAT:printing ; DATERING Date Interpret and filter data source: 10 jan. 1477 target: vra:date “10-01-1477” ORIGINEEL or RE- Title of the original (The title, author, date, place source: Bijbel. Oude Testament... PRODUCTIE or reproduction (book) and page number can be ex- target: bp:origineel “Bijbel. Oude Testament...”@en containing the image tracted) ; CLASSIFICATIE Classification of the Interpret code, Create URI with source: D (code interpreted as History of book print- work in librarian terms code, use interpretation as la- ing) using a code bel keep identifier and create re- target: bp:classificatie bp:D ; source TWNAAM Person used as subject Interpret name and create URI source: John Do for work target: vra:subject.personalName bp:John Do ; bp:John Do rdf:type ulan:person ; rdfs:label “John Do” . TWOND Thesaurus term used as Create mapping to thesaurus source: typografische vormgeving subject target: vra:subject bp:typografische vormgeving ; TWGEO Place used as subject for Create mapping to TGN where source: Delft work possible or keep literal with lan- target: vra:subject.geographicPlace tgn:7006804 ; guage tag OMSCHRIJVING Dutch or English de- Create literal and language tag source: Eerste bijbel die... or OMSCHRIJV- scription target: vra:description “Eerste bijbel die...”@nl ; ING EN AFMETINGEN Size of the work Create literal source: 27 x 20 cm target: vra:measurements.dimensions “27 x 20 cm” . Table 2: Part of the Bibliopolis metadata with examples, function and RDFS property/classes Source Data Vocabulary Terms Instances names with URIs in the bp: namespace removing invalid Mapped Total Mapped Total characters and spaces. The concepts are of type ulan:person Thesaurus AAT 209 1033 - - and the human readable label contains the name. Metadata AAT 15 28 1332 1468 technique Converting to a literal. Finally, pieces of text such as titles Metadata AAT 14 19 978 1507 object type and descriptions are converted to literals. In Bibliopolis the Metadata TGN 32 69 349 480 values of TITLE and DESCRIPTION fields were converted subject into literals with language tags as the title and description place of works is both in English and in Dutch. Table 3: Mappings between the Bibliopolis data and 8. THESAURUS ALIGNMENT other vocabularies The local thesaurus and the resources containing techniques, object types and locations extracted from the data during the metadata conversion process need to be aligned with statement that we try to avoid, as ambiguity is quite com- standard vocabularies. mon. The SKOS Mapping Vocabulary specification11 was created for the purpose of linking thesauri to each other. It We aligned the Bibliopolis thesaurus to AAT by syntacti- specifies relationships such as skos:exactMatch, skos:broad- cally matching the Dutch skos:prefLabel to the Dutch trans- Match, skos:narrowMatch and more for aligning vocabula- lation of AAT preferred terms and mapped 209 concepts out ries. For this alignment the mappings are still based on the of 1033 as presented in Table 3. lexical match of term labels, that corresponds to the relation skos:exactMatch. Then, we need to identify the relation between the matched 11 terms. The OWL owl:sameAs relation is typically an over- http://www.w3.org/2004/02/skos/mapping/spec/ The field TWGEO contains geographic names which were will take place at regular intervals in time. This also means mapped to TGN. As the values of this field are in Dutch that tool support should be in place to support this process, we extended TGN by adding the Dutch label terms to the allowing updates to be generated semi-automatically, simi- proper concept. For example, the value Parijs is the dutch lar to the AnnoCultor13 that is being currently developed label of Paris in TGN. Such extensions had to be performed within the E-Culture project. manually, while the mapping of values to cities in the Nether- lands could be performed automatically as the labels in For the E-Culture virtual collection we have now carried TGN contain the Dutch language version. We used syntac- out this process a number of times. This paper should be tic matching for finding appropriate mappings along with viewed as a post-hoc rationalization of this work. Our goal some additional techniques to reduce ambiguity, such as re- is to provide a set of methods and tools that allow collec- stricting the search to cities instead of provinces and the tion owners (museums, archives) to carry out this process. use of background knowledge like the vernacular names of Cultural-heritage institutions are now often bound to closed cities. We only automatically mapped unambiguous terms, content management systems; the “three-O” paradigm (open manually mapping ambiguous terms. Background knowl- access, open data, open standards) is gaining support, but edge of the collection data helped in solving ambiguity as it we have to provide the owners of collections with the neces- restricted the places the data could be associated to. sary support facilities. The values of the fields TECHNIQUE and OBJECT were also We see two potential weaknesses of this work. Firstly, our aligned with AAT using syntactic matching and once more process still requires much more tool support. In particular use skos:exactMatch relation. As can be seen in Table 3, a for vocabulary alignment we need to explore how existing number of terms were not mapped. We extend the AAT tools, such as the ones participating in the OAEI contest, by adding the leftover terms to some part of the vocabu- perform on this data set. Our current work is still to much lary if possible. For instance, the technique boekdruk (book based on manual work and only uses simple syntactic tools. printing) is not part of AAT but is a special kind of printing technique, therefore the AAT concept printing is selected as Secondly, the use of Dublin Core as “top-level ontology” for broader term. We use the SKOS template to represent the the structure as metadata can also be perceived as a risk. extension. What if the collection has metadata fields that fit with none of the DC elements? However, this was not a problem in ei- From Table 3 we can see that a large number of resources ther of these six collections. For the moment it seems Dublin are created without being linked to vocabularies. Such re- Core is indeed a key resource in information interoperability. sources might be seen as an unnecessary overhead but they However, it is a challenge to construct reasoners that make can be used in the future when new vocabularies are added use of the collection-specific specializations. or mapped manually. Almost 80 percent of the thesaurus terms were not mapped to AAT and while a number of terms This article does not show the actual added value of the could be linked with skos:broadMatch, this would require converted collection content. For this the readers are en- additional manual work which could take up a significant couraged to visit the E-Culture online demonstrator, which amount of time while yielding few matches. This is not the contains the Bibliopolis data. case for the values of TECHNIQUE, OBJECT and TWGEO fields where by manually aligning 13, 5 and 37 terms re- 10. ACKNOWLEDGMENTS spectively would yield complete alignments. For OBJECT We are grateful to our colleagues from the Multimedian linking 5 terms would yield an alignment of another 500 oc- E-Culture team: Alia Amin, Lora Aroyo, Victor de Boer, currences of the term in the metadata which is one third of Lynda Hardman, Michiel Hildebrand, Marco de Niet, An- the total occurrences and well worth the manual effort. nelies van Nispen, Marie France van Orsouw, Jacco van Os- senbruggen, Annemiek Teesing, Jan Wielemaker and Bob 9. DISCUSSION Wielinga. We would also like to thank Mark van Assem Interoperability is becoming one of the key issues in the for his input. The project is a collaboration between the open Web world. Many research programs, such as the Free University Amsterdam, the Centre of Mathematics and IST program of the EU, have interoperability high on the Computer Science (CWI), the University of Amsterdam, agenda. However, real interoperability between collections Digital Heritage Netherlands (DEN) and the Netherlands In- is still scarce. Until now, many approaches have focused on stitute for Cultural Heritage (ICN). The MultimediaN pro- interoperability as a problem between two collections. ject is funded through the BSIK programme of the Dutch government. In this paper we take a different approach. We assume a multitude of collections will become part of the interopera- We are especially thankful to Marieke van Delft of the Konin- ble space; the activities we present can to a large extent be klijke Bibliotheek (National library of the Netherlands) for carried out by studying an individual collection. Mapping her cooperation in the Bibliopolis case. to existing other vocabularies requires knowledge of other components, but there is no need for these to be complete. 11. REFERENCES For vocabulary alignment the adage “a little semantics goes [1] M. H. Butler, J. Gilbert, A. Seaborne, and a long way”12 holds. Also, one should not view this as a one- K. Smathers. Data conversion, extraction and record shot thing. Metadata and vocabularies change, so extensions linkage using xml and rdf tools in project simile. 12 13 quote from J. Hendler http://annocultor.sourceforge.net/ Technical report, Digital Media Systems Laboratory and HP Laboratories, August 2004. [2] E. Hyvönen, E. Mäkelä, M. Salminen, A. Valo, K. Viljanen, S. Saarela, M. Junnila, and S. Kettula. Museumfinland–finnish museums on the semantic web. Web Semantics: Science, Services and Agents on the World Wide Web, 3(2-3):224–241, October 2005. [3] A. J. Miles, N. Rogers, and D. Beckett. Migrating thesauri to the semantic web - guidelines and case studies for generating rdf encodings of existing thesauri. [4] G. Schreiber, A. Amin, M. van Assem, V. de Boer, L. Hardman, M. Hildebrand, L. Hollink, Z. Huang, J. van Kersen, M. de Niet, B. Omelayenko, J. van Ossenbruggen, R. Siebes, J. Taekema, J. Wielemaker, and B. J. Wielinga. Multimedian e-culture demonstrator. In I. F. Cruz, S. Decker, D. Allemang, C. Preist, D. Schwabe, P. Mika, M. Uschold, and L. Aroyo, editors, International Semantic Web Conference, volume 4273 of Lecture Notes in Computer Science, pages 951–958. Springer, 2006. [5] A. Tordai, B. Omelayenko, and G. Schreiber. Thesaurus and metadata alignment for a semantic e-culture application. 2007. [6] M. van Assem, V. Malaisé, A. Miles, and G. Schreiber. A method to convert thesauri to SKOS. In Y. Sure and J. Domingue, editors, ESWC, volume 4011 of Lecture Notes in Computer Science, pages 95–109. Springer, 2006.