<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Supporting Serendipitous and Focused Search</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Junte</forename><surname>Zhang</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Meertens Institute</orgName>
								<orgName type="institution">Royal Academy of Arts and Sciences Amsterdam</orgName>
								<address>
									<country>Netherlands, the Netherlands</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Supporting Serendipitous and Focused Search</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">2FFDA455B7DEC0B1F77D0393ADEB35F1</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T21:55+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>H.3.3 [Information Search and Retrieval]: Search process</term>
					<term>H.3.7 [Digital Libraries]: Systems issues, user issues</term>
					<term>H.5.2 [Information interfaces and presentation]: Graphical user interfaces (GUI) Design, Human Factors information retrieval, metadata, user interfaces, ehumanities</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>People with complex information needs are for example Humanities researchers, who need advanced search engines to investigate their research questions. Much can be gained by combining research datasets, reusing tools and serendipitously discovering new insights for further research. Humanities researchers have different (large-scale) research datasets and tools, which are described differently with metadata.</p><p>We present a highly interactive advanced search engine for Humanities researchers that semantically converges differently structured metadata records from different collections and institutions. It has features that support serendipitous and focused search in context based on the structure of the metadata used. This single system serves Humanities researchers by allowing them to search interactively across yet unexplored (research) data, discover patterns, locate relevant data for new insights, and find existing tools that could provide novel use cases.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>The Common Language Resources and Technology Infrastructure (CLARIN) initiative seeks to establish an integrated and interoperable research infrastructure of language Presented at EuroHCIR2012. Copyright c 2012 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors.</p><p>resources and its technology. <ref type="foot" target="#foot_0">1</ref> Descriptive metadata is used to characterize large number of (legacy) research data resources (collections) and tools (e.g. Web services) to facilitate their management and discovery. The Search &amp; Develop (S&amp;D) project within CLARIN in the Netherlands uses the Component MetaData Infrastructure (CMDI; <ref type="bibr" target="#b4">[4]</ref>) with ISOcat <ref type="bibr">[6,</ref><ref type="bibr" target="#b12">12]</ref> to open up the sharing of resources and Web services for people and machines first within the collections of a single institution, then across institutions in the Netherlands and eventually across Europe as whole. This infrastructure enables new research methods in language research and stimulates the Digital Humanities, where new insights can be gained by combining and reusing resources from different institutions and domains, and existing tools can be more effectively found and reused based on new insights.</p><p>How to use the CMDI framework with ISOcat to search for data and services, which can be understood by both people from varying disciplines and machines? The challenge is that the data is heterogenous both in content and structure, and can be massive in amount. In <ref type="bibr" target="#b11">[11]</ref>, we show how to deal with such heterogeneously structured data in the CMDI MI Search Engine. Users of the CMDI framework are mostly Humanities researchers. What type of system is needed driven by CMDI that matches with the search behavior of these users? This paper presents a proposition that has been implemented on a live system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">USING CMDI FOR FOCUSED AND SE-MANTIC ACCESS</head><p>CMDI has grown out of the need to facilitate access, reuse, and interoperability using metadata <ref type="bibr" target="#b4">[4]</ref>. A CMDI file in XML consists of a &lt;Header&gt;, &lt;Resources&gt;, and &lt;Compo-nents&gt;. The former two are fixed in structure, while the content and structure within &lt;Components&gt; is flexible and can encapsulate any data in any structured form. An XML schema can be used to make CMDI files coherent in structure for a (sub)collection and it contains references to ISOcat data categories (DC) stored in the Registry (DCR; <ref type="bibr" target="#b7">[7,</ref><ref type="bibr">6]</ref>). The DCR was established by the ISO Technical Committee 37, Terminology and other language and content resources based on the ISO 12620:2009 standard. Because multiple elements may refer to the same DC, semantic interoperability can be achieved across different datasets. A specification using the DCR and projected for example in an XML schema is called a metadata profile and can be (re)used for describ-(a) Query autocompletion based on the count that a query occurs in a tag within the result set. By default the query box is content-centric, but searching directly in a tag is possible with Advanced Search (can be collapsed with a click). Users can express queries using the metadata or only the fulltext of the document by discarding autocompletion.  (a) Retrieved list of results with the display of the list of results with 'fixed' contextual information, snippets and keywords in context within the last searched metadata label and the presentation of all used keywords in context given the fulltext. There is links to the fulltext of the metadata record and the actual resource in the digital archive.</p><p>(b) For each retrieved result in the list, there is a recommendation (when available) of related results based on the content similarity of the last used metadata label. A recommendation consist of a link to the record, the collection it belongs to, and a snippet (can be collapsed with a click). ing datasets and for eventual access. Moreover, RELcat <ref type="bibr" target="#b10">[10]</ref> goes a step further by allowing for the storage of arbitrary relationships between data categories to assist crosswalks and to specify ontological relationships for further semantic search, which in the future can be used in the CMDI MI Search Engine using field collapsing.</p><p>We have indexed 246,728 CMDI files from 18 different profiles consisting of 143 different types of elements in a single stream, which shows our indexing method for CMDI files is robust enough to deal with complex data <ref type="bibr" target="#b11">[11]</ref>. By indexing metadata in CMDI on the XML element level, the search engine can provide focused access <ref type="bibr" target="#b8">[8]</ref>. We use straight-forward information retrieval techniques only. The 'Liederenbank' (Dutch Song Database) alone has 9 different profiles (XML schemas), which is equivalent to a sub-collection, ranging from very differently structured descriptions about songs to singers. How to provide interactive access to such heterogeneously structured data for Humanities researchers?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">SERENDIPITY IN CONTEXT</head><p>When a user with no a priori intentions interacts with a node of information and acquires useful information, then serendipitous information retrieval occurs <ref type="bibr" target="#b9">[9]</ref>. The success of serendipitous discovery is not just the find itself, but being able or willing to do something with it, so that users get more insight and can enhance the domain expertise <ref type="bibr" target="#b1">[1]</ref>. Humanities researchers are the type of users who can be greatly supported in their research tasks with serendipitous IR, because their information-seeking behavior can be described as an idiosyncratic process of constant reading, "digging," searching, and following leads <ref type="bibr" target="#b2">[2]</ref>. This confirms with the Berrypicking model of <ref type="bibr" target="#b3">[3]</ref>, such as that queries are not static, but rather evolve, and users "gather information in bits and pieces instead of in one grand best retrieved set."</p><p>Since the CMDI MI Search Engine should serve Humanities researchers, we design it to support serendipitous search and be highly interactive. The system has been designed to maximize the user's ability to explore. This is our focus. The user interface of the system is depicted in Fig 1 <ref type="figure">.</ref> It uses the JavaScript library AJAX Solr<ref type="foot" target="#foot_1">2</ref> , which has been heavily modified and extended by us with JQuery. It allows for faceted search <ref type="bibr" target="#b5">[5]</ref> as we treat the indexed elements of the CMDI files as one large category hierarchy.</p><p>A user can improving the search episode (session) by effectively reducing the information space step by step. These steps are stored as part of the search trail, so the overview is kept. There are different search strategies possible. Users can search by fulltext by entering a query. This makes sure users can always search in everything. The query get highlighted in context given the fulltext, but the dynamic tag cloud widget that supports query expansion is not activated, see Fig. <ref type="figure" target="#fig_1">1(a)</ref>. Users can also do a focused search request by using structure, i.e. within the content of a specified tag, and get the content of these tags returned. This can be content-centered, as users enter a keyword and the autocompletion widget returns a list consisting of keyword plus field name and hit count. It can also be structure-centered (using the Advanced Search option) by looking up a tag and then entering a keyword also with the autocompletion feature. When the last two options are used, then the keyword highlighting also occurs within the context of the retrieved snippets of the searched tag, see Fig. <ref type="figure" target="#fig_2">2(a)</ref>.</p><p>A challenge is how we can support serendipitous search given the diversely structured metadata in CMDI. Hence, we introduce and propose the concept of serendipitous search in context. We can use the heterogeneous structure of different collections to provide context to the user in a single search engine. We propose the following contextual system features that aim to support serendipitous and focused search.</p><p>• Help users by automatically completing the query that the user is entering while simultaneously and directly giving the hit count for the suggested queries in conjunction with a tag, see Fig. <ref type="figure" target="#fig_1">1(a)</ref>.</p><p>• Provide inline suggestions (Did you mean...) based on a spell checker whenever applicable.</p><p>• Suggest a new parallel search episode (You could also look for...) by presenting interesting terms based on the content of the first few retrieved results after each used query, see Fig. <ref type="figure" target="#fig_1">1(b</ref>). This increments and becomes more focused as a search episode gets more queries.</p><p>• Offer different overviews of the retrieved results and allow for query expansion by directly presenting a dynamic tag cloud of the aggregated content within the metadata label used and highlighting the query entered in this context, see Fig. <ref type="figure" target="#fig_1">1(c</ref>).</p><p>• Preserve the overview of a search episode by storing the search selection (see Fig. • Aggregate and visualize collection-specific search features in extra widgets, such as projecting and clustering the list of retrieved geo-referenced resources on a map (see Fig. <ref type="figure" target="#fig_1">1(c</ref>)), and displaying the date ranges of the documents in charts that can be clicked to narrow down a result set (see Fig. <ref type="figure" target="#fig_1">1(d)</ref>).</p><p>• Entice users to explore further by recommending related resources using the content similarity by presenting a link to the metadata record and a snippet of a recommendation, see Fig. <ref type="figure" target="#fig_2">2(b)</ref>.</p><p>So the context consists of different modalities and features existing in the structure of the metadata of a collection, and used in the retrieval and visualization of information. This can be displayed on a aggregated level based on the set of retrieved results. And it can be displayed with different displays of the result types given the metadata profile. Eventually, the user finds the links to the resources in the digital archive using the metadata, and can use the found resources for further research or development. However, there is no real definite end of the search episode as people still can continue searching using the above proposed system features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">CONCLUSIONS</head><p>We have presented a working proposition for serendipitous and focused search by describing the CMDI MI search engine. The novelty is that it provides semantic access to diversely structured language and digital heritage resources with different metadata schemas for users such as researchers with very specific and complex information (research) needs. The search engine provides faceted search and has serendipitous features that maximize the user's ability to explore any metadata in CMDI in context, such as query autocompletion, tag clouds, and recommendation of related resources, while keeping track of the search trail. It is a tool that provides interactive and focused access to heterogeneous metadata, gives new perspectives on legacy (research) data and tools, and provides new insights for research and development. It has been released as live, and can be used at www.meertens.knaw.nl/cmdi/search.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>(b) The selection widget that allows users to keep overview of the search trail and change it, while updating the result list. Here, the query stored is "periode" (period) within the tag time coverage→description. Interesting terms are suggested by presenting the top TF * IDF terms, which people can use to start a parallel search episode. (c) To further support query expansion and serendipitous information seeking, a dynamic tag cloud is generated based on the last retrieved result list and used metadata label with keyword highlighting. Moreover, retrieved geo-referenced documents are projected on a map and clustered by markers. (d) The distribution of retrieved time-referenced documents (given the tags Century of Publication and Year of Publication) are visualized in bar or line charts. Users can click in the charts to narrow down the result set. The distribution of results in tags collection and schema profile always appear.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: The CMDI MI Search Engine (1).</figDesc><graphic coords="2,59.66,341.46,235.99,273.53" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: The CMDI MI Search Engine (2).</figDesc><graphic coords="3,56.87,410.72,236.00,218.90" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>1(b)), and the overview on collection level by the result type, e.g. the metadata profile 'lied' (song) in the Dutch Song Database, and the collection a document belongs to (see Fig.1(d)).</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">See http://www.clarin.eu/external/index.php?page=aboutclarin</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">See https://github.com/evolvingweb/ajax-solr</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">ACKNOWLEDGMENTS</head><p>This work is part of the Search &amp; Develop project at the Meertens Institute, and funded by CLARIN-NL.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title/>
		<author>
			<persName><surname>References</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Discovery is never by chance: designing for (un)serendipity</title>
		<author>
			<persName><forename type="first">P</forename><surname>André</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schraefel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">T</forename><surname>Teevan</surname></persName>
		</author>
		<author>
			<persName><surname>Dumais</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the seventh ACM conference on Creativity and cognition, C&amp;C &apos;09</title>
				<meeting>the seventh ACM conference on Creativity and cognition, C&amp;C &apos;09<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="305" to="314" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">The information-seeking habits of graduate student researchers in the humanities</title>
		<author>
			<persName><forename type="first">A</forename><surname>Barrett</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Journal of Academic Librarianship</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="324" to="331" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">The design of browsing and berrypicking techniques for the online search interface</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Bates</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Online Review</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="407" to="424" />
			<date type="published" when="1989">1989</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A data category registry-and component-based metadata framework</title>
		<author>
			<persName><forename type="first">D</forename><surname>Broeder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kemps-Snijders</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">V</forename><surname>Uytvanck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Windhouwer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Withers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Wittenburg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zinn</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">LREC</title>
				<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Cat-a-cone: an interactive interface for specifying searches and viewing retrieval results using a large category hierarchy</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Hearst</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Karadi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGIR</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="1997">1997</date>
			<biblScope unit="page" from="246" to="255" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">ISOcat: remodelling metadata for language resources</title>
		<author>
			<persName><forename type="first">M</forename><surname>Kemps-Snijders</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Windhouwer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Wittenburg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">E</forename><surname>Wright</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IJMSO</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="261" to="276" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Ensuring semantic interoperability on lexical resources</title>
		<author>
			<persName><forename type="first">M</forename><surname>Kemps-Snijders</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zinn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ringersma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Windhouwer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">LREC</title>
				<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">XML Retrieval. Synthesis Lectures on Information Concepts</title>
		<author>
			<persName><forename type="first">M</forename><surname>Lalmas</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009">2009</date>
			<publisher>Morgan &amp; Claypool Publishers</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Serendipitous information retrieval</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">G</forename><surname>Toms</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">DELOS Workshop: Information Seeking, Searching and Querying in Digital Libraries</title>
				<imprint>
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">RELcat: a relation registry for isocat data categories</title>
		<author>
			<persName><forename type="first">M</forename><surname>Windhouwer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">LREC</title>
				<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">The CMDI MI Search Engine: Access to language resources and tools using heterogeneous metadata schemas</title>
		<author>
			<persName><forename type="first">J</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kemps-Snijders</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Bennis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">TPDL</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="volume">7489</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">The isocat registry reloaded</title>
		<author>
			<persName><forename type="first">C</forename><surname>Zinn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hoppermann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Trippel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The Semantic Web: Research and Applications</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<meeting><address><addrLine>Berlin / Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="volume">7295</biblScope>
			<biblScope unit="page" from="285" to="299" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
