Supporting Serendipitous and Focused Search

                                                                Junte Zhang
                             Meertens Institute, Royal Netherlands Academy of Arts and Sciences
                                                  Amsterdam, the Netherlands


ABSTRACT                                                                  resources and its technology.1 Descriptive metadata is used
People with complex information needs are for example Hu-                 to characterize large number of (legacy) research data re-
manities researchers, who need advanced search engines to                 sources (collections) and tools (e.g. Web services) to facili-
investigate their research questions. Much can be gained                  tate their management and discovery. The Search & Develop
by combining research datasets, reusing tools and serendipi-              (S&D) project within CLARIN in the Netherlands uses the
tously discovering new insights for further research. Human-              Component MetaData Infrastructure (CMDI; [4]) with ISO-
ities researchers have different (large-scale) research datasets          cat [6, 12] to open up the sharing of resources and Web ser-
and tools, which are described differently with metadata.                 vices for people and machines first within the collections of
   We present a highly interactive advanced search engine for             a single institution, then across institutions in the Nether-
Humanities researchers that semantically converges differ-                lands and eventually across Europe as whole. This infras-
ently structured metadata records from different collections              tructure enables new research methods in language research
and institutions. It has features that support serendipitous              and stimulates the Digital Humanities, where new insights
and focused search in context based on the structure of the               can be gained by combining and reusing resources from dif-
metadata used. This single system serves Humanities re-                   ferent institutions and domains, and existing tools can be
searchers by allowing them to search interactively across yet             more effectively found and reused based on new insights.
unexplored (research) data, discover patterns, locate rele-                  How to use the CMDI framework with ISOcat to search
vant data for new insights, and find existing tools that could            for data and services, which can be understood by both peo-
provide novel use cases.                                                  ple from varying disciplines and machines? The challenge is
                                                                          that the data is heterogenous both in content and struc-
                                                                          ture, and can be massive in amount. In [11], we show how
Categories and Subject Descriptors                                        to deal with such heterogeneously structured data in the
H.3.3 [Information Search and Retrieval]: Search pro-                     CMDI MI Search Engine. Users of the CMDI framework
cess; H.3.7 [Digital Libraries]: Systems issues, user is-                 are mostly Humanities researchers. What type of system is
sues; H.5.2 [Information interfaces and presentation]:                    needed driven by CMDI that matches with the search be-
Graphical user interfaces (GUI)                                           havior of these users? This paper presents a proposition that
                                                                          has been implemented on a live system.

General Terms
                                                                          2.   USING CMDI FOR FOCUSED AND SE-
Design, Human Factors
                                                                               MANTIC ACCESS
                                                                             CMDI has grown out of the need to facilitate access, re-
Keywords                                                                  use, and interoperability using metadata [4]. A CMDI file
information retrieval, metadata, user interfaces, ehumanities             in XML consists of a <Header>, <Resources>, and <Compo-
                                                                          nents>. The former two are fixed in structure, while the
                                                                          content and structure within <Components> is flexible and
1.    INTRODUCTION                                                        can encapsulate any data in any structured form. An XML
   The Common Language Resources and Technology In-                       schema can be used to make CMDI files coherent in struc-
frastructure (CLARIN) initiative seeks to establish an inte-              ture for a (sub)collection and it contains references to ISOcat
grated and interoperable research infrastructure of language              data categories (DC) stored in the Registry (DCR; [7, 6]).
                                                                          The DCR was established by the ISO Technical Committee
                                                                          37, Terminology and other language and content resources
                                                                          based on the ISO 12620:2009 standard. Because multiple el-
                                                                          ements may refer to the same DC, semantic interoperability
                                                                          can be achieved across different datasets. A specification us-
                                                                          ing the DCR and projected for example in an XML schema
                                                                          is called a metadata profile and can be (re)used for describ-
Presented at EuroHCIR2012. Copyright c 2012 for the individual papers     1
by the papers’ authors. Copying permitted only for private and academic     See http://www.clarin.eu/external/index.php?page=about-
purposes. This volume is published and copyrighted by its editors.        clarin
(a) Query autocompletion based on the count that a query          (b) The selection widget that allows users to keep overview of
occurs in a tag within the result set. By default the query box   the search trail and change it, while updating the result list.
is content-centric, but searching directly in a tag is possible   Here, the query stored is “periode” (period) within the tag
with Advanced Search (can be collapsed with a click). Users       time coverage→description. Interesting terms are suggested
can express queries using the metadata or only the fulltext       by presenting the top TF∗IDF terms, which people can use
of the document by discarding autocompletion.                     to start a parallel search episode.


(c) To further support query expansion and serendipitous in-      (d) The distribution of retrieved time-referenced documents
formation seeking, a dynamic tag cloud is generated based         (given the tags Century of Publication and Year of Publica-
on the last retrieved result list and used metadata label with    tion) are visualized in bar or line charts. Users can click in
keyword highlighting. Moreover, retrieved geo-referenced          the charts to narrow down the result set. The distribution of
documents are projected on a map and clustered by markers.        results in tags collection and schema profile always appear.

                                     Figure 1: The CMDI MI Search Engine (1).
                                                                    ing datasets and for eventual access. Moreover, RELcat [10]
                                                                    goes a step further by allowing for the storage of arbitrary
                                                                    relationships between data categories to assist crosswalks
                                                                    and to specify ontological relationships for further semantic
                                                                    search, which in the future can be used in the CMDI MI
                                                                    Search Engine using field collapsing.
                                                                       We have indexed 246,728 CMDI files from 18 different pro-
                                                                    files consisting of 143 different types of elements in a single
                                                                    stream, which shows our indexing method for CMDI files is
                                                                    robust enough to deal with complex data [11]. By indexing
                                                                    metadata in CMDI on the XML element level, the search en-
                                                                    gine can provide focused access [8]. We use straight-forward
                                                                    information retrieval techniques only. The ‘Liederenbank’
                                                                    (Dutch Song Database) alone has 9 different profiles (XML
                                                                    schemas), which is equivalent to a sub-collection, ranging
                                                                    from very differently structured descriptions about songs to
                                                                    singers. How to provide interactive access to such heteroge-
                                                                    neously structured data for Humanities researchers?

                                                                    3.     SERENDIPITY IN CONTEXT
                                                                       When a user with no a priori intentions interacts with a
                                                                    node of information and acquires useful information, then
                                                                    serendipitous information retrieval occurs [9]. The success
                                                                    of serendipitous discovery is not just the find itself, but be-
                                                                    ing able or willing to do something with it, so that users get
                                                                    more insight and can enhance the domain expertise [1]. Hu-
                                                                    manities researchers are the type of users who can be greatly
(a) Retrieved list of results with the display of the list of re-   supported in their research tasks with serendipitous IR, be-
sults with ‘fixed’ contextual information, snippets and key-        cause their information-seeking behavior can be described
words in context within the last searched metadata label and        as an idiosyncratic process of constant reading, “digging,”
the presentation of all used keywords in context given the
fulltext. There is links to the fulltext of the metadata record     searching, and following leads [2]. This confirms with the
and the actual resource in the digital archive.                     Berrypicking model of [3], such as that queries are not static,
                                                                    but rather evolve, and users “gather information in bits and
                                                                    pieces instead of in one grand best retrieved set.”
                                                                       Since the CMDI MI Search Engine should serve Humani-
                                                                    ties researchers, we design it to support serendipitous search
                                                                    and be highly interactive. The system has been designed to
                                                                    maximize the user’s ability to explore. This is our focus.
                                                                    The user interface of the system is depicted in Fig 1. It uses
                                                                    the JavaScript library AJAX Solr2 , which has been heav-
                                                                    ily modified and extended by us with JQuery. It allows for
                                                                    faceted search [5] as we treat the indexed elements of the
                                                                    CMDI files as one large category hierarchy.
                                                                       A user can improving the search episode (session) by ef-
                                                                    fectively reducing the information space step by step. These
                                                                    steps are stored as part of the search trail, so the overview
                                                                    is kept. There are different search strategies possible. Users
                                                                    can search by fulltext by entering a query. This makes sure
                                                                    users can always search in everything. The query get high-
                                                                    lighted in context given the fulltext, but the dynamic tag
                                                                    cloud widget that supports query expansion is not activated,
                                                                    see Fig.1(a). Users can also do a focused search request by
                                                                    using structure, i.e. within the content of a specified tag,
                                                                    and get the content of these tags returned. This can be
(b) For each retrieved result in the list, there is a recom-        content-centered, as users enter a keyword and the auto-
mendation (when available) of related results based on the          completion widget returns a list consisting of keyword plus
content similarity of the last used metadata label. A recom-        field name and hit count. It can also be structure-centered
mendation consist of a link to the record, the collection it        (using the Advanced Search option) by looking up a tag and
belongs to, and a snippet (can be collapsed with a click).          then entering a keyword also with the autocompletion fea-
                                                                    ture. When the last two options are used, then the keyword
                                                                    highlighting also occurs within the context of the retrieved
    Figure 2: The CMDI MI Search Engine (2).
                                                                    2
                                                                        See https://github.com/evolvingweb/ajax-solr
snippets of the searched tag, see Fig.2(a).                        with very specific and complex information (research) needs.
  A challenge is how we can support serendipitous search           The search engine provides faceted search and has serendipi-
given the diversely structured metadata in CMDI. Hence, we         tous features that maximize the user’s ability to explore any
introduce and propose the concept of serendipitous search in       metadata in CMDI in context, such as query autocomple-
context. We can use the heterogeneous structure of different       tion, tag clouds, and recommendation of related resources,
collections to provide context to the user in a single search      while keeping track of the search trail. It is a tool that pro-
engine. We propose the following contextual system features        vides interactive and focused access to heterogeneous meta-
that aim to support serendipitous and focused search.              data, gives new perspectives on legacy (research) data and
                                                                   tools, and provides new insights for research and develop-
     • Help users by automatically completing the query that       ment. It has been released as live, and can be used at
       the user is entering while simultaneously and directly      www.meertens.knaw.nl/cmdi/search.
       giving the hit count for the suggested queries in con-
       junction with a tag, see Fig.1(a).                          5.   ACKNOWLEDGMENTS
     • Provide inline suggestions (Did you mean...) based on        This work is part of the Search & Develop project at the
       a spell checker whenever applicable.                        Meertens Institute, and funded by CLARIN-NL.

     • Suggest a new parallel search episode (You could also       6.   REFERENCES
       look for...) by presenting interesting terms based on        [1] P. André, m. schraefel, J. Teevan, and S. T. Dumais.
       the content of the first few retrieved results after each        Discovery is never by chance: designing for
       used query, see Fig.1(b). This increments and becomes            (un)serendipity. In Proceedings of the seventh ACM
       more focused as a search episode gets more queries.              conference on Creativity and cognition, C&C ’09,
                                                                        pages 305–314, New York, NY, USA, 2009. ACM.
     • Offer different overviews of the retrieved results and       [2] A. Barrett. The information-seeking habits of graduate
       allow for query expansion by directly presenting a dy-           student researchers in the humanities. The Journal of
       namic tag cloud of the aggregated content within the             Academic Librarianship, 31(4):324 – 331, 2005.
       metadata label used and highlighting the query entered
                                                                    [3] M. J. Bates. The design of browsing and berrypicking
       in this context, see Fig.1(c).
                                                                        techniques for the online search interface. Online
     • Preserve the overview of a search episode by storing             Review, 13(5):407–424, 1989.
       the search selection (see Fig.1(b)), and the overview on     [4] D. Broeder, M. Kemps-Snijders, D. V. Uytvanck,
       collection level by the result type, e.g. the metadata           M. Windhouwer, P. Withers, P. Wittenburg, and
       profile ‘lied’ (song) in the Dutch Song Database, and            C. Zinn. A data category registry- and
       the collection a document belongs to (see Fig.1(d)).             component-based metadata framework. In LREC,
                                                                        2010.
     • Aggregate and visualize collection-specific search fea-      [5] M. A. Hearst and C. Karadi. Cat-a-cone: an
       tures in extra widgets, such as projecting and cluster-          interactive interface for specifying searches and
       ing the list of retrieved geo-referenced resources on a          viewing retrieval results using a large category
       map (see Fig. 1(c)), and displaying the date ranges of           hierarchy. In SIGIR, pages 246–255, New York, NY,
       the documents in charts that can be clicked to narrow            USA, 1997. ACM.
       down a result set (see Fig. 1(d)).                           [6] M. Kemps-Snijders, M. Windhouwer, P. Wittenburg,
                                                                        and S. E. Wright. ISOcat: remodelling metadata for
     • Entice users to explore further by recommending re-
                                                                        language resources. IJMSO, 4(4):261–276, 2009.
       lated resources using the content similarity by present-
       ing a link to the metadata record and a snippet of a         [7] M. Kemps-Snijders, C. Zinn, J. Ringersma, and
       recommendation, see Fig.2(b).                                    M. Windhouwer. Ensuring semantic interoperability
                                                                        on lexical resources. In LREC, 2008.
   So the context consists of different modalities and features     [8] M. Lalmas. XML Retrieval. Synthesis Lectures on
existing in the structure of the metadata of a collection, and          Information Concepts, Retrieval, and Services.
used in the retrieval and visualization of information. This            Morgan & Claypool Publishers, 2009.
can be displayed on a aggregated level based on the set of          [9] E. G. Toms. Serendipitous information retrieval. In
retrieved results. And it can be displayed with different dis-          DELOS Workshop: Information Seeking, Searching
plays of the result types given the metadata profile. Even-             and Querying in Digital Libraries, 2000.
tually, the user finds the links to the resources in the digital   [10] M. Windhouwer. RELcat: a relation registry for isocat
archive using the metadata, and can use the found resources             data categories. In LREC, 2012.
for further research or development. However, there is no          [11] J. Zhang, M. Kemps-Snijders, and H. Bennis. The
real definite end of the search episode as people still can con-        CMDI MI Search Engine: Access to language
tinue searching using the above proposed system features.               resources and tools using heterogeneous metadata
                                                                        schemas. In TPDL, volume 7489 of Lecture Notes in
4.    CONCLUSIONS                                                       Computer Science. Springer, 2012.
  We have presented a working proposition for serendipi-           [12] C. Zinn, C. Hoppermann, and T. Trippel. The isocat
tous and focused search by describing the CMDI MI search                registry reloaded. In The Semantic Web: Research and
engine. The novelty is that it provides semantic access to              Applications, volume 7295 of Lecture Notes in
diversely structured language and digital heritage resources            Computer Science, pages 285–299. Springer Berlin /
with different metadata schemas for users such as researchers           Heidelberg, 2012.