<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A systematic approach towards higher quality linked open data at Nieuwe Instituut</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Nora</forename><surname>Abdelmageed</surname></persName>
							<email>n.abdelmageed@nieuweinstituut.nl</email>
							<affiliation key="aff0">
								<orgName type="department">Nieuwe Instituut</orgName>
								<address>
									<addrLine>Museumpark 25</addrLine>
									<postCode>3015 CB</postCode>
									<settlement>Rotterdam</settlement>
									<country key="NL">Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Lois</forename><surname>Hutubessy</surname></persName>
							<email>l.hutubessy@nieuweinstituut.nl</email>
							<affiliation key="aff0">
								<orgName type="department">Nieuwe Instituut</orgName>
								<address>
									<addrLine>Museumpark 25</addrLine>
									<postCode>3015 CB</postCode>
									<settlement>Rotterdam</settlement>
									<country key="NL">Netherlands</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">A systematic approach towards higher quality linked open data at Nieuwe Instituut</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">B16ADFD390DE794D0B3BD81E4BC5B673</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:46+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Linked Open Data</term>
					<term>Cultural Heritage</term>
					<term>Data Quality</term>
					<term>Entity Linking</term>
					<term>Entity Resolution</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Nieuwe Instituut (NI) houses the Dutch National Collection of Architecture and Urban Planning. This collection consists of about 4 million objects, including design drawings, 3D models, and photographs. As part of the program Disclosing Architecture, which is now in its sixth and final year, the Linked Open Data (LOD) project aims to share the richness of data within the collection with the public through semantic web technologies. This project will ultimately facilitate the exchange of cultural heritage data with related national and international institutions. Currently, Nieuwe Instituut (NI)'s collection management system contains inconsistent records due to changes in registration guidelines and the migration from older collection management tools. Without documentation of these guidelines, it is impossible to establish consistent rules for the entire dataset. Yet, clean data is crucial for effectively showcasing NI's collection to the public. In response, this paper introduces a framework for higherquality LOD data, the Data Cleaning Initiative (DCI). The first implementation of the DCI is through a series of steps planned for the year 2024 with the goal of cleaning and enriching the collection data at NI.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The Dutch Cultural Heritage sector has undergone a significant transformation by adopting semantic web technologies and converting their data into Linked Open Data (LOD). The dataset register <ref type="bibr" target="#b0">1</ref> developed by Netwerk Digitaal Erfgoed 2 lists 71 publishers that expose their datasets in one or more linked data formats <ref type="bibr" target="#b2">3</ref> . Rijksmuseum is a pioneer in this area, having published its collection data as LOD as one of the first [1], thereby diversifying search results in other applications [2]. Another notable effort is by Van Wissen [3], who targeted named entity extraction from archival records with the help of LOD. However, ensuring high-standard and clean data remains an open challenge.</p><p>Nieuwe Instituut (NI) <ref type="bibr">4</ref> houses the National Collection of Architecture and Urban Planning, which consists of approximately 4 million items ranging from design drawings to photographs and 3D models. As part of the program Disclosing Architecture (AD)<ref type="foot" target="#foot_0">5</ref> , which is now in its sixth and final year, LOD project aims to share the richness of data within the collection with the public through semantic web technologies. In parallel, NI develops a new online Collection platform that will further increase accessibility to a wider audience, facilitating discovery by design. While this platform is set to launch in November 2024, the underlying data is already exposed through a separate endpoint<ref type="foot" target="#foot_1">6</ref> , with more than 18 million triples available in the customized Triply<ref type="foot" target="#foot_2">7</ref> environment.</p><p>In this paper, we propose our systematic approach Data Cleaning Initiative (DCI) at NI to enhance the data quality of our LOD. We explain its scope, tasks, and how we implement such an approach in practice. DCI is a vision that could be applied in any Cultural Heritage instituut for cleaning and enriching their LOD.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1.">Motivation &amp; Problem Definition</head><p>We aim to increase the exposure of Architecture and Urban Planning data. For instance, we built a central point or an encyclopedia on top of the Dutch Heritage Data collection in NI. However, due to human errors and changes in guidelines and collection registration tools, the data has become heterogeneous and inconsistent on a large scale. As such, publishing the current data version may not be the best approach. The data currently contained in the collection management system (Axiell Collections<ref type="foot" target="#foot_3">8</ref> ), while suitable for internal purposes, would benefit from further cleaning and enrichment to ensure a higher quality of data for public use. This would facilitate collaborations with third parties and attract a wider audience.</p><p>Records in NI's collection management system are inconsistent due to changes in registration guidelines and migrations from older collection management tools. In addition, there is no documentation of these guidelines, making it difficult to establish rules for the entire dataset. Cleaning these records in their entirety is challenging for two reasons. On the one hand, understanding the meaning of sparse heterogeneous records within the same catalog is difficult. E.g., "Library catalog" registers 16 types, including books and audio materials. On the other hand, the sheer volume of heritage records adds to the complexity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.">Objectives &amp; Tasks</head><p>The Data Cleaning Initiative (DCI) has two main objectives: (i) Proposing to split catalogs containing broad relations of data into smaller, closely related datasets. This facilitates the semantic grouping of the original catalogs within the collection management system. (ii) Applying cleaning and enrichment strategies to the resulting semantic categories to improve the data quality.</p><p>We propose four categories of data cleaning and enrichment tasks in the context of DCI. We define these main categories as: external resources, or Knowledge Graphs (KGs), e.g., Wikidata<ref type="foot" target="#foot_4">9</ref> . For example, Doesburg, Theo van would be mapped to wd:Q160422<ref type="foot" target="#foot_5">10</ref> . 4. Entity Enrichment This category aims to fetch external properties and pieces of information that exist in external resources and do not exist in the local collection management system. E.g., we save the image of Doesburg from Wikidata in our Axiell Collections.</p><p>Table <ref type="table" target="#tab_0">1</ref> summarizes the scope of each task of the DCI. The first record shows an empty name.type where the Data Cleaning task covers this kind of issue. Records 2, 4, and 5 represent the same person in different formats; Entity Resolution groups them all as the same person. These records after grouping, Entity Linking will map these records to the Wikidata page of Theodoor van Erp -wd:Q2759953<ref type="foot" target="#foot_6">11</ref> . Finally, we can store new information in our catalog to enrich the displayed data at the application level.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Approach</head><p>In this section, we describe the initiative's approach. Initially, we explain the concept of semantic grouping; then, we give details of the DCI pipeline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Semantic Grouping</head><p>Currently, domain experts enter metadata for multiple semantic categories into the same Axiell catalog. For instance, they use "People &amp; Institutions" for registering persons, architects, publishers, universities, etc. The same case applies to the "Library catalog", where domain experts register sixteen semantic types, including Books, serials, Audio/Visual Material, and Articles. This yields a large catalog but is sparse in terms of the metadata. Thus, we decided to split each catalog into smaller chunks that share the semantic type, i.g. extracting only Books from the Library. The main idea is the DCI relies on the semantic division of the Axiell catalogs.</p><p>By this means, it facilitates human intervention and yields a better statistical view of the data. The target of this phase of the DCI is to obtain named groups to be cleaned. Creating a semantic group requires determining representative fields for each group. To find these groups, we instigated Axiell Collections' fields for each category. In addition, we held several meetings with domain experts to determine those fields and ensure their correctness and scope. For instance, "People &amp; Institutions" catalog contains for example: name, birthDate, birthPlace, deathDate, deathPlace, biography, ISBN_publisher_prefix. This group of fields represents two semantic groups: 1) Persons, and this group contains: name, birthDate, birthPlace, deathDate, deathPlace, biography. 2) Institutes and this group contains name and ISBN_publisher_prefix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Pipeline &amp; Workflow</head><p>Figure <ref type="figure" target="#fig_0">1</ref> depicts the proposed pipeline for DCI. The figure represents the separate steps that we follow til we reach the final goal or the end of the year for each semantic group. Our pipeline consists of six iterative steps taking into consideration four scopes of work: Data cleaning, entity resolution, entity linking, and entity enrichment. Our pipeline starts with: 1) Export the target semantic group given the representative group fields in a CSV file. The CSV file allows batch processing and facilitates the general overview of the exported records. 2) Analyze the data and log the encountered issues in our backlog system. 3) Propose a solution for individual issues and determine if it is possible to solve them automatically or if it needs manual intervention. 4) Discuss the proposed solution with domain experts and stakeholders to validate it. If it is not a valid solution, we go back to the proposing stage otherwise, we move to, 5) Solve the target issue by applying the proposed solution on the CSV file directly. Finally, 6) Approve and Import, we seek approval from our manager, and if agreed, then we import the CSV with fixes back to the Axiell Acceptance environment.  Figure <ref type="figure" target="#fig_1">2</ref> shows a simplified data flow diagram that explains the interaction among DCI actors regarding their tasks. It starts with an executive that analyzes the target semantic group to be cleaned. Then, the executive proposes a solution that starts an iterative process (discuss, validate, solve, and approve). The domain expert is the only required actor to validate a proposed solution. If and only if the application manager approves the solution and its results, the executive can import the fixed fold back to the collections management system.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: DCI Proposed Pipeline</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Simplified workflow of the DCI</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 People</head><label>1</label><figDesc>This category involves all tasks concerning primitive data cleaning. E.g., handling inconsistencies like using different formats or tackling missing values. The latter influences the guidelines for filling in this metadata. E.g., discovering a potential required field. 2. Entity Resolution: This category aims at similar entities discovery and grouping. Since all data is manually entered by domain experts, they might use different representations describing the same entity. E.g., Doesburg, Theo van is a different representation for Doesburg, Th. van. 3. Entity Linking This group maps internal records of our heritage data collections to</figDesc><table><row><cell cols="2">&amp; Institutions Examples</cell><cell></cell><cell></cell></row><row><cell cols="3">No. use_count name</cell><cell>name.type</cell></row><row><cell>1</cell><cell>10</cell><cell>Erp, T. van</cell><cell>?</cell></row><row><cell>2</cell><cell>10</cell><cell>Erp, Th. van</cell><cell>person &amp; author</cell></row><row><cell>3</cell><cell>2</cell><cell>Vugt, Theo van</cell><cell>author</cell></row><row><cell>4</cell><cell>3</cell><cell>Erp, Theo van</cell><cell>person &amp; author</cell></row><row><cell>5</cell><cell>3</cell><cell cols="2">Erp, Theodoor van person &amp; author</cell></row><row><cell>1. Data Cleaning:</cell><cell></cell><cell></cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_0">https://nieuweinstituut.nl/en/projects/architectuur-dichterbij</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_1">https://collectiedata.hetnieuweinstituut.nl/the-other-interface/knowledge-graph/sparql</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_2">triply.cc</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_3">https://www.axiell.com/solutions/product/axiell-collections/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_4">https://www.wikidata.org/wiki/Wikidata:Introduction</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_5">http://www.wikidata.org/entity/Q160422</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_6">https://www.wikidata.org/wiki/Q2759953</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>We would like to thank our domain experts, Inge van Stokkom, Christel Leenen, Ernst des Bouvrie, Evelien Dekker, Kelly James, and program manager Gijs Broos, Nieuwe Instituut (NI). Moreover, we would like to thank the Dutch Ministry of Culture, Education, and Science for funding the Disclosing Architecture program.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">The rijksmuseum collection as linked data</title>
		<author>
			<persName><forename type="first">C</forename><surname>Dijkshoorn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jongma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Aroyo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Van Ossenbruggen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Schreiber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Ter Weele</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wielemaker</surname></persName>
		</author>
		<idno type="DOI">10.3233/SW-170257</idno>
		<ptr target="https://doi.org/10.3233/SW-170257.doi:10.3233/SW-170257" />
	</analytic>
	<monogr>
		<title level="j">Semantic Web</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="221" to="230" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Using linked data to diversify search results a case study in cultural heritage</title>
		<author>
			<persName><forename type="first">C</forename><surname>Dijkshoorn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Aroyo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Schreiber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wielemaker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jongma</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-319-13704-9_9</idno>
		<ptr target="https://doi.org/10.1007/978-3-319-13704-9_9.doi:10.1007/978-3-319-13704-9\_9" />
	</analytic>
	<monogr>
		<title level="m">Knowledge Engineering and Knowledge Management -19th International Conference, EKAW 2014</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<meeting><address><addrLine>Linköping, Sweden</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2014">November 24-28, 2014. 2014</date>
			<biblScope unit="volume">8876</biblScope>
			<biblScope unit="page" from="109" to="120" />
		</imprint>
	</monogr>
	<note>Proceedings</note>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Van Wissen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Latronico</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Zamborlini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Reinders</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Van Den</surname></persName>
		</author>
		<author>
			<persName><surname>Heuvel</surname></persName>
		</author>
		<ptr target="http://10.0.20.161/zenodo.3862817" />
		<title level="m">Unlocking the archives. a pipeline for scanning, transcribing and modelling entities of archival documents into linked open data</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note>DH Benelux 2020-Online:# Goesonline</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
