Exposing Ourselves: Displaying our Cultural Assets for
                     Public Consumption

                    Gary Munnelly                                   Kevin Koidl                      Séamus Lawless
                      Adapt Centre                                 Adapt Centre                         Adapt Centre
                     O’Reilly Building                            O’Reilly Building                    O’Reilly Building
                      Trinity College                              Trinity College                      Trinity College
                      Dublin, Ireland                              Dublin, Ireland                      Dublin, Ireland
                  munnellg@tcd.ie                         Kevin.Koidl@scss.tcd.ie              Seamus.Lawless@scss.tcd.ie

ABSTRACT
This paper discusses an early stage project to develop a new,
enhanced interface for Trinity College Dublin (TCD) Digital
Collections website. We describe the current state of the
portal and outline some of the unique issues observed when
examining user engagement.
  A major factor in our development of enhanced search
tools will be to leverage the entities present in the documents
to establish more reliable connections between items in the
collection. Not only do we expect that this will lead to
better ranked search results, but we also wish to investigate
how these entities may be used to encourage site visitors to
explore the site beyond their initial research goal.
  The early stage of this project means that plans are still                 Figure 1: Graph of most popular search terms on
being finalised. Hence we speculate about other methods                      the Digital Collections site
which may be applied to this corpus.

Keywords                                                                     is represented in the collection. The search engine retrieves
Entity Search; Digital Libraries; Information Retrieval                      documents that it judges to be pertinent to our query and
                                                                             returns them to us without any explanation as to why these
                                                                             might be relevant, nor any encouragement to continue our
1.    INTRODUCTION                                                           investigation in a particular direction. It is up to the user
   In many ways, the vision of Digital Humanities with re-                   to interpret the results, it is up to the user to establish rela-
gards to cultural heritage is a noble one. It is one in which all            tionships within the collection and it is up to us as the user
people have free, unbridled access to primary sources from                   to identify worthwhile avenues of future research [7]. Given
which they may learn about their heritage and the rich his-                  that their knowledge of the collection is probably quite lim-
tory of their origins. We are free to lose ourselves in the                  ited to begin with, this is hardly helpful. As was aptly put
depths of a historical archive from the comfort of our com-                  by Mitchell Whitelaw [8], these interfaces are not “generous”.
puter screens and supported in our exploration by a host of                     This need for a more generous interface is the focus of a
intelligent information retrieval systems.                                   project currently being undertaken by Trinity College Dublin
   In theory, after the arduous process of digitising the collec-            (TCD) Digital Collections. At present the website provides
tion, providing such functionality ought to be a simple task.                the simple search box that we have come to expect which
Building and deploying a website has become a trivial pro-                   is driven by a default deployment of Solr. After conducting
cess and off the shelf tools such as Solr provide state-of-the-              a search, users can narrow their interests along a broad se-
art text retrieval functionality with minimal effort. Given                  ries of facets: genre, media type, Trinity department, date
a suitable portal and a search box which returns ranked re-                  and subject area. This interface results in a limited search
sults, what more could a user want?                                          experience, particularly with regards to exploration. The
   As it happens, this approach to curating documents has                    effects of this are demonstrable simply by looking at where
been found wanting in many ways. The most immediate                          the majority of traffic flows through the site (Figure 2).
problem with the query-response paradigm is that in order                       The most famous text on the Digital Collections portal
to be able to use the search interface we must know ex-                      is the Book of Kells [1]. A huge percentage of hits on the
actly what we are looking for and the manner in which it                     site can be attributed to this single page and variants of
                                                                             the query string “Book of Kells” are consistently among the
In Proceedings of 1st International Workshop on Accessing Cultural Her-      most frequent searches conducted. Indeed, it is worth noting
itage at Scale (ACHS’16), June 22, 2016, Newark, NJ, USA. Copyright          that many visitors to the site land directly on the page for
2016 for this paper by its authors. Copying permitted for private and aca-
                                                                             the Book of Kells having been referred there from Google,
demic purposes.
                                                                             Facebook, Twitter etc. They never even see the initial search
                                                                  equivalents. For example, the field denoting the subject of
                                                                  a document is named subjectlcsh indicating that the data
                                                                  stored here is relevant to the LCSH ontology. While this is
                                                                  not ideal, it does mean that semantically linking the collec-
                                                                  tion is possible and has be made easier by this method of
                                                                  annotating the data.
                                                                     In addition to these rigidly defined attribute fields, there
                                                                  are also a number of free text fields, abstract and descrip-
                                                                  tion being the two most verbose. These free text fields
                                                                  contain additional information about the artifact, much of
                                                                  which is not actually described in the more semantic at-
                                                                  tributes. These are human readable sections which describe
                                                                  the artifact in moderate detail, giving an explanation of its
                                                                  origins, who commissioned it, where was it commissioned,
                                                                  how it came to be in the library or any other information
                                                                  which was available to the transcriber. Often these fields
Figure 2: Graph of pages which site visitors first                reference entities which are not mentioned in any of the
land on. Note the DRIS ID for the Book of Kells is                other document attributes, meaning there is much informa-
MS58 003v which ranks above the home page                         tion hidden in these fields which could be extracted and
                                                                  harnessed to power a more meaningful search experience.

box on the homepage. After viewing the book, most users
then simply browse away from the portal, not realising that
                                                                  3.   METHOD
they have barely touched the tip of the iceberg with regards         Fostering engagement and encouraging exploration means
to the volume of information and material available to them.      discerning what interests a user and presenting them with
  Hence our goal is twofold; to provide a better, more accu-      content which relates to that interest. It may also mean
rate, more supportive search experience to users who come         determining what is of interest to a community of people at
to explore the TCD Digital Collections site and to foster a       large and using this group perspective to assist an individual
sense of curiosity in those who come to see one artifact, but     whose exploration has stalled.
may have an interest in so many more.                                While we could use traditional language modelling or prob-
                                                                  abilistic methods to determine which documents may be dis-
                                                                  cussing the same subject and then make recommendations
2.   CORPUS                                                       based on that, it is much better if we can establish what
   The corpus is comprised of approximately 100,000 high          real world, tangible objects are influencing the user’s search
resolution scans of various documents curated by the Digi-        and then trace these figures through the collection. In or-
tal Collections group. These range from manuscripts to il-        der to do this, we must know what entities are present in
lustrations, etchings, postcards, templates, graphs, musical      the corpus to begin with. We are fortunate that many po-
scores and more, spanning more than 1,000 years of human          tentially useful entities have been manually extracted and
history. Information extraction techniques such as optical        stored in the XML file for us. However, much information
character recognition (OCR) have not been applied to the          is also hidden in the free text fields spread throughout the
renderings, but each image has meta-data associated with          meta-data. This presents some interesting opportunities to
it describing important attributes of the artifact. This data     perform automatic information extraction and analysis on
is listed in a single XML file which has been provided to         the collection.
us and is the foundation upon which we must build a new              Named Entity Recognition (NER) is a well established
search interface.                                                 field in Natural Language Processing (NLP) for locating ref-
   As is typical in collections of this type, many of the XML     erences to known entities in a body of text [6]. In general we
fields denote information such as page number, document           search for specific patterns, parts of speech or words which
ID, catalogue number etc. However, there has also been            appear in a gazetteer of terms. Much like anything involving
some effort made to make the collection semantically in-          natural language and computers, the results can be noisy.
clined, although not fully semantically linked. The names         However, after the results of NER have been sanitised, they
of several fields are designed to reflect the structure of four   may then be disambiguated to a suitable knowledge source
well established library cataloguing ontologies: The Library      [5, 2].
of Congress Name Authority File (NAF), The Library of                Within the Digital Collections corpus, identifying men-
Congress Subject Headings (LCSH), Getty Vocabularies Art          tions of entities in the free text fields and disambiguating
and Architecture Thesaurus (AAT) and Getty Vocabularies           them to a common knowledge base will allow us to estab-
Union List of Artist Names (ULAN). The choice of ontology         lish which documents are related to which entities and, by
for a particular field is dependent on the nature of the con-     extension, which documents are related to each other.
tent it represents and the availability of information within        Disambiguation involves more than just co-referencing these
the ontologies themselves. For example, if an artistâĂŹs       entities within the collection. It links the collection’s enti-
name cannot be found in NAF, then ULAN is used instead.           ties to a higher knowledge base which may connect them
   Although the entries in these ontologies are not explicitly    by proxy to external knowledge sources such as Wikipedia.
referenced by the meta-data (i.e. there are no URIs used          These external sources may assist the user in understanding
in the XML file), the names of various fields have been se-       the primary source material making the content more ac-
lected so that they may be related back to their ontological      cessible for those who are inexperienced with the collection.
                Figure 3: A screenshot of the current home page of the digital collections website


The challenge is to determine which entity in the knowledge       poor user interface design. This too will be a factor in the
base is being referred to by the mention found in the text.       final development of the new Digital Collections portal.
   While this focus on entities may be useful, it may also be
of benefit to attempt to establish the larger context in which    5.   ACKNOWLEDGMENTS
a user’s search is taking place. While the corpus is large in
size (the abstracts alone totalling almost 21,000,000 words)      This research is supported by the Science Foundation Ire-
the vocabulary is highly constrained (a little over 10,000        land (Grant 07/CE/I1142) in the ADAPT Centre (adapt-
unique terms) suggesting that topic modelling may also be         centre.ie) at Trinity College, Dublin.
a viable option for structuring the corpus and influencing
search.                                                           6.   REFERENCES
   Accurate topic modelling is difficult to achieve. Determin-
                                                                  [1] Book of Kells. http://digitalcollections.tcd.ie/home/
ing exactly how much content is required in order for a topic
                                                                      index.php?DRIS ID=MS58 003v. [Online; accessed
model to stabilise can be hard [4] and even after the model
                                                                      30-May-2016].
has stabilised there is no guarantee that the topics will be of
                                                                  [2] A. Alhelbawy and R. J. Gaizauskas. Graph ranking for
use. Nevertheless, it may still be a worthwhile investigation
                                                                      collective named entity disambiguation.
to perform topic analysis such as Latent Dirichlet Allocation
[3] on the collection to see if new, useful patterns beyond the   [3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent
broad facets already in use may be found.                             dirichlet allocation. the Journal of machine Learning
                                                                      research, 3:993–1022, 2003.
                                                                  [4] D. Greene, D. OâĂŹCallaghan, and P. Cunningham.
4.   CONCLUSIONS                                                      How many topics? stability analysis for topic models.
   As can been seen, there are several options for what can be        In Machine Learning and Knowledge Discovery in
done when given a collection such as TCD’s Digital Collec-            Databases, pages 498–513. Springer, 2014.
tions corpus. The quality with which we can automatically         [5] Z. Guo and D. Barbosa. Robust entity linking via
extract information and relationships from the collection are         random walks. In Proceedings of the 23rd ACM
greatly dependent on the quality of the data itself. Quantity         International Conference on Conference on Information
of data also plays a role in the accuracy of automatic meth-          and Knowledge Management, pages 499–508. ACM,
ods. However with the data extracted from the collection,             2014.
we have more information at our disposal for assisting and        [6] D. Nadeau and S. Sekine. A survey of named entity
engaging with the user as they search the collection.                 recognition and classification. Lingvisticae
   Of course, even the best search interface can be felled by         Investigationes, 30(1):3–26, 2007.
[7] R. W. White and R. A. Roth. Exploratory search:
    Beyond the query-response paradigm. Synthesis
    Lectures on Information Concepts, Retrieval, and
    Services, 1(1):1–98, 2009.
[8] M. Whitelaw. Generous interfaces for digital cultural
    collections. Digital Humanities Quarterly, 9(1), 2015.