<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Supporting Serendipitous and Focused Search</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Junte Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Design</institution>
          ,
          <addr-line>Human Factors</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Meertens Institute, Royal Netherlands Academy of Arts and Sciences Amsterdam</institution>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>People with complex information needs are for example Humanities researchers, who need advanced search engines to investigate their research questions. Much can be gained by combining research datasets, reusing tools and serendipitously discovering new insights for further research. Humanities researchers have di erent (large-scale) research datasets and tools, which are described di erently with metadata. We present a highly interactive advanced search engine for Humanities researchers that semantically converges di erently structured metadata records from di erent collections and institutions. It has features that support serendipitous and focused search in context based on the structure of the metadata used. This single system serves Humanities researchers by allowing them to search interactively across yet unexplored (research) data, discover patterns, locate relevant data for new insights, and nd existing tools that could provide novel use cases. H.3.3 [Information Search and Retrieval]: Search process; H.3.7 [Digital Libraries]: Systems issues, user issues; H.5.2 [Information interfaces and presentation]: Graphical user interfaces (GUI) General Terms</p>
      </abstract>
      <kwd-group>
        <kwd>information retrieval</kwd>
        <kwd>metadata</kwd>
        <kwd>user interfaces</kwd>
        <kwd>ehumanities</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Categories and Subject Descriptors</title>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>
        The Common Language Resources and Technology
Infrastructure (CLARIN) initiative seeks to establish an
integrated and interoperable research infrastructure of language
Presented at EuroHCIR2012. Copyright c 2012 for the individual papers
by the papers’ authors. Copying permitted only for private and academic
purposes. This volume is published and copyrighted by its editors.
resources and its technology.1 Descriptive metadata is used
to characterize large number of (legacy) research data
resources (collections) and tools (e.g. Web services) to
facilitate their management and discovery. The Search &amp; Develop
(S&amp;D) project within CLARIN in the Netherlands uses the
Component MetaData Infrastructure (CMDI; [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) with
ISOcat [
        <xref ref-type="bibr" rid="ref12 ref6">6, 12</xref>
        ] to open up the sharing of resources and Web
services for people and machines rst within the collections of
a single institution, then across institutions in the
Netherlands and eventually across Europe as whole. This
infrastructure enables new research methods in language research
and stimulates the Digital Humanities, where new insights
can be gained by combining and reusing resources from
different institutions and domains, and existing tools can be
more e ectively found and reused based on new insights.
      </p>
      <p>
        How to use the CMDI framework with ISOcat to search
for data and services, which can be understood by both
people from varying disciplines and machines? The challenge is
that the data is heterogenous both in content and
structure, and can be massive in amount. In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], we show how
to deal with such heterogeneously structured data in the
CMDI MI Search Engine. Users of the CMDI framework
are mostly Humanities researchers. What type of system is
needed driven by CMDI that matches with the search
behavior of these users? This paper presents a proposition that
has been implemented on a live system.
2.
      </p>
    </sec>
    <sec id="sec-3">
      <title>USING CMDI FOR FOCUSED AND SE</title>
    </sec>
    <sec id="sec-4">
      <title>MANTIC ACCESS</title>
      <p>
        CMDI has grown out of the need to facilitate access,
reuse, and interoperability using metadata [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A CMDI le
in XML consists of a &lt;Header&gt;, &lt;Resources&gt;, and
&lt;Components&gt;. The former two are xed in structure, while the
content and structure within &lt;Components&gt; is exible and
can encapsulate any data in any structured form. An XML
schema can be used to make CMDI les coherent in
structure for a (sub)collection and it contains references to ISOcat
data categories (DC) stored in the Registry (DCR; [
        <xref ref-type="bibr" rid="ref6 ref7">7, 6</xref>
        ]).
      </p>
      <p>The DCR was established by the ISO Technical Committee
37, Terminology and other language and content resources
based on the ISO 12620:2009 standard. Because multiple
elements may refer to the same DC, semantic interoperability
can be achieved across di erent datasets. A speci cation
using the DCR and projected for example in an XML schema
is called a metadata pro le and can be (re)used for
describ1See
http://www.clarin.eu/external/index.php?page=aboutclarin
(a) Query autocompletion based on the count that a query (b) The selection widget that allows users to keep overview of
occurs in a tag within the result set. By default the query box the search trail and change it, while updating the result list.
is content-centric, but searching directly in a tag is possible Here, the query stored is \periode" (period) within the tag
with Advanced Search (can be collapsed with a click). Users time coverage!description. Interesting terms are suggested
can express queries using the metadata or only the fulltext by presenting the top TF IDF terms, which people can use
of the document by discarding autocompletion. to start a parallel search episode.
(c) To further support query expansion and serendipitous in- (d) The distribution of retrieved time-referenced documents
formation seeking, a dynamic tag cloud is generated based (given the tags Century of Publication and Year of
Publicaon the last retrieved result list and used metadata label with tion) are visualized in bar or line charts. Users can click in
keyword highlighting. Moreover, retrieved geo-referenced the charts to narrow down the result set. The distribution of
documents are projected on a map and clustered by markers. results in tags collection and schema pro le always appear.
(a) Retrieved list of results with the display of the list of
results with ` xed' contextual information, snippets and
keywords in context within the last searched metadata label and
the presentation of all used keywords in context given the
fulltext. There is links to the fulltext of the metadata record
and the actual resource in the digital archive.
(b) For each retrieved result in the list, there is a
recommendation (when available) of related results based on the
content similarity of the last used metadata label. A
recommendation consist of a link to the record, the collection it
belongs to, and a snippet (can be collapsed with a click).</p>
      <p>
        ing datasets and for eventual access. Moreover, RELcat [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
goes a step further by allowing for the storage of arbitrary
relationships between data categories to assist crosswalks
and to specify ontological relationships for further semantic
search, which in the future can be used in the CMDI MI
Search Engine using eld collapsing.
      </p>
      <p>
        We have indexed 246,728 CMDI les from 18 di erent
proles consisting of 143 di erent types of elements in a single
stream, which shows our indexing method for CMDI les is
robust enough to deal with complex data [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. By indexing
metadata in CMDI on the XML element level, the search
engine can provide focused access [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We use straight-forward
information retrieval techniques only. The `Liederenbank'
(Dutch Song Database) alone has 9 di erent pro les (XML
schemas), which is equivalent to a sub-collection, ranging
from very di erently structured descriptions about songs to
singers. How to provide interactive access to such
heterogeneously structured data for Humanities researchers?
      </p>
    </sec>
    <sec id="sec-5">
      <title>3. SERENDIPITY IN CONTEXT</title>
      <p>
        When a user with no a priori intentions interacts with a
node of information and acquires useful information, then
serendipitous information retrieval occurs [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The success
of serendipitous discovery is not just the nd itself, but
being able or willing to do something with it, so that users get
more insight and can enhance the domain expertise [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Humanities researchers are the type of users who can be greatly
supported in their research tasks with serendipitous IR,
because their information-seeking behavior can be described
as an idiosyncratic process of constant reading, \digging,"
searching, and following leads [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This con rms with the
Berrypicking model of [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], such as that queries are not static,
but rather evolve, and users \gather information in bits and
pieces instead of in one grand best retrieved set."
      </p>
      <p>
        Since the CMDI MI Search Engine should serve
Humanities researchers, we design it to support serendipitous search
and be highly interactive. The system has been designed to
maximize the user's ability to explore. This is our focus.
The user interface of the system is depicted in Fig 1. It uses
the JavaScript library AJAX Solr2, which has been
heavily modi ed and extended by us with JQuery. It allows for
faceted search [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] as we treat the indexed elements of the
CMDI les as one large category hierarchy.
      </p>
      <p>A user can improving the search episode (session) by
effectively reducing the information space step by step. These
steps are stored as part of the search trail, so the overview
is kept. There are di erent search strategies possible. Users
can search by fulltext by entering a query. This makes sure
users can always search in everything. The query get
highlighted in context given the fulltext, but the dynamic tag
cloud widget that supports query expansion is not activated,
see Fig.1(a). Users can also do a focused search request by
using structure, i.e. within the content of a speci ed tag,
and get the content of these tags returned. This can be
content-centered, as users enter a keyword and the
autocompletion widget returns a list consisting of keyword plus
eld name and hit count. It can also be structure-centered
(using the Advanced Search option) by looking up a tag and
then entering a keyword also with the autocompletion
feature. When the last two options are used, then the keyword
highlighting also occurs within the context of the retrieved
2See https://github.com/evolvingweb/ajax-solr
snippets of the searched tag, see Fig.2(a).</p>
      <p>A challenge is how we can support serendipitous search
given the diversely structured metadata in CMDI. Hence, we
introduce and propose the concept of serendipitous search in
context. We can use the heterogeneous structure of di erent
collections to provide context to the user in a single search
engine. We propose the following contextual system features
that aim to support serendipitous and focused search.</p>
      <sec id="sec-5-1">
        <title>Help users by automatically completing the query that the user is entering while simultaneously and directly giving the hit count for the suggested queries in conjunction with a tag, see Fig.1(a).</title>
      </sec>
      <sec id="sec-5-2">
        <title>Provide inline suggestions (Did you mean...) based on a spell checker whenever applicable.</title>
      </sec>
      <sec id="sec-5-3">
        <title>Suggest a new parallel search episode (You could also</title>
        <p>look for...) by presenting interesting terms based on
the content of the rst few retrieved results after each
used query, see Fig.1(b). This increments and becomes
more focused as a search episode gets more queries.</p>
      </sec>
      <sec id="sec-5-4">
        <title>O er di erent overviews of the retrieved results and</title>
        <p>allow for query expansion by directly presenting a
dynamic tag cloud of the aggregated content within the
metadata label used and highlighting the query entered
in this context, see Fig.1(c).</p>
      </sec>
      <sec id="sec-5-5">
        <title>Preserve the overview of a search episode by storing</title>
        <p>the search selection (see Fig.1(b)), and the overview on
collection level by the result type, e.g. the metadata
pro le `lied' (song) in the Dutch Song Database, and
the collection a document belongs to (see Fig.1(d)).</p>
      </sec>
      <sec id="sec-5-6">
        <title>Aggregate and visualize collection-speci c search fea</title>
        <p>tures in extra widgets, such as projecting and
clustering the list of retrieved geo-referenced resources on a
map (see Fig. 1(c)), and displaying the date ranges of
the documents in charts that can be clicked to narrow
down a result set (see Fig. 1(d)).</p>
      </sec>
      <sec id="sec-5-7">
        <title>Entice users to explore further by recommending related resources using the content similarity by presenting a link to the metadata record and a snippet of a recommendation, see Fig.2(b).</title>
        <p>So the context consists of di erent modalities and features
existing in the structure of the metadata of a collection, and
used in the retrieval and visualization of information. This
can be displayed on a aggregated level based on the set of
retrieved results. And it can be displayed with di erent
displays of the result types given the metadata pro le.
Eventually, the user nds the links to the resources in the digital
archive using the metadata, and can use the found resources
for further research or development. However, there is no
real de nite end of the search episode as people still can
continue searching using the above proposed system features.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSIONS</title>
      <p>We have presented a working proposition for
serendipitous and focused search by describing the CMDI MI search
engine. The novelty is that it provides semantic access to
diversely structured language and digital heritage resources
with di erent metadata schemas for users such as researchers
with very speci c and complex information (research) needs.
The search engine provides faceted search and has
serendipitous features that maximize the user's ability to explore any
metadata in CMDI in context, such as query
autocompletion, tag clouds, and recommendation of related resources,
while keeping track of the search trail. It is a tool that
provides interactive and focused access to heterogeneous
metadata, gives new perspectives on legacy (research) data and
tools, and provides new insights for research and
development. It has been released as live, and can be used at
www.meertens.knaw.nl/cmdi/search.
5.</p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work is part of the Search &amp; Develop project at the
Meertens Institute, and funded by CLARIN-NL.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Andre</surname>
          </string-name>
          , m. schraefel, J. Teevan, and
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Dumais</surname>
          </string-name>
          .
          <article-title>Discovery is never by chance: designing for (un)serendipity</article-title>
          .
          <source>In Proceedings of the seventh ACM conference on Creativity and cognition</source>
          , C&amp;C '09, pages
          <fpage>305</fpage>
          {
          <fpage>314</fpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrett</surname>
          </string-name>
          .
          <article-title>The information-seeking habits of graduate student researchers in the humanities</article-title>
          .
          <source>The Journal of Academic Librarianship</source>
          ,
          <volume>31</volume>
          (
          <issue>4</issue>
          ):
          <volume>324</volume>
          {
          <fpage>331</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Bates</surname>
          </string-name>
          .
          <article-title>The design of browsing and berrypicking techniques for the online search interface</article-title>
          .
          <source>Online Review</source>
          ,
          <volume>13</volume>
          (
          <issue>5</issue>
          ):
          <volume>407</volume>
          {
          <fpage>424</fpage>
          ,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Broeder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kemps-Snijders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. V.</given-names>
            <surname>Uytvanck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Windhouwer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Withers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wittenburg</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Zinn</surname>
          </string-name>
          .
          <article-title>A data category registry- and component-based metadata framework</article-title>
          .
          <source>In LREC</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hearst</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Karadi</surname>
          </string-name>
          .
          <article-title>Cat-a-cone: an interactive interface for specifying searches and viewing retrieval results using a large category hierarchy</article-title>
          .
          <source>In SIGIR</source>
          , pages
          <volume>246</volume>
          {
          <fpage>255</fpage>
          , New York, NY, USA,
          <year>1997</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kemps-Snijders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Windhouwer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wittenburg</surname>
          </string-name>
          , and
          <string-name>
            <surname>S. E. Wright.</surname>
          </string-name>
          <article-title>ISOcat: remodelling metadata for language resources</article-title>
          .
          <source>IJMSO</source>
          ,
          <volume>4</volume>
          (
          <issue>4</issue>
          ):
          <volume>261</volume>
          {
          <fpage>276</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kemps-Snijders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zinn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ringersma</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Windhouwer</surname>
          </string-name>
          .
          <article-title>Ensuring semantic interoperability on lexical resources</article-title>
          .
          <source>In LREC</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lalmas</surname>
          </string-name>
          .
          <source>XML Retrieval. Synthesis Lectures on Information Concepts</source>
          , Retrieval, and Services. Morgan &amp; Claypool Publishers,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E. G.</given-names>
            <surname>Toms</surname>
          </string-name>
          .
          <article-title>Serendipitous information retrieval</article-title>
          .
          <source>In DELOS Workshop: Information Seeking, Searching and Querying in Digital Libraries</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Windhouwer</surname>
          </string-name>
          .
          <article-title>RELcat: a relation registry for isocat data categories</article-title>
          .
          <source>In LREC</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kemps-Snijders</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>H.</given-names>
            <surname>Bennis. The CMDI MI Search</surname>
          </string-name>
          <article-title>Engine: Access to language resources and tools using heterogeneous metadata schemas</article-title>
          .
          <source>In TPDL</source>
          , volume
          <volume>7489</volume>
          of Lecture Notes in Computer Science. Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zinn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hoppermann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Trippel</surname>
          </string-name>
          .
          <article-title>The isocat registry reloaded</article-title>
          .
          <source>In The Semantic Web: Research and Applications</source>
          , volume
          <volume>7295</volume>
          of Lecture Notes in Computer Science, pages
          <volume>285</volume>
          {
          <fpage>299</fpage>
          . Springer Berlin / Heidelberg,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>