Exposing Ourselves: Displaying our Cultural Assets for Public Consumption Gary Munnelly Kevin Koidl Séamus Lawless Adapt Centre Adapt Centre Adapt Centre O’Reilly Building O’Reilly Building O’Reilly Building Trinity College Trinity College Trinity College Dublin, Ireland Dublin, Ireland Dublin, Ireland munnellg@tcd.ie Kevin.Koidl@scss.tcd.ie Seamus.Lawless@scss.tcd.ie ABSTRACT This paper discusses an early stage project to develop a new, enhanced interface for Trinity College Dublin (TCD) Digital Collections website. We describe the current state of the portal and outline some of the unique issues observed when examining user engagement. A major factor in our development of enhanced search tools will be to leverage the entities present in the documents to establish more reliable connections between items in the collection. Not only do we expect that this will lead to better ranked search results, but we also wish to investigate how these entities may be used to encourage site visitors to explore the site beyond their initial research goal. The early stage of this project means that plans are still Figure 1: Graph of most popular search terms on being finalised. Hence we speculate about other methods the Digital Collections site which may be applied to this corpus. Keywords is represented in the collection. The search engine retrieves Entity Search; Digital Libraries; Information Retrieval documents that it judges to be pertinent to our query and returns them to us without any explanation as to why these might be relevant, nor any encouragement to continue our 1. INTRODUCTION investigation in a particular direction. It is up to the user In many ways, the vision of Digital Humanities with re- to interpret the results, it is up to the user to establish rela- gards to cultural heritage is a noble one. It is one in which all tionships within the collection and it is up to us as the user people have free, unbridled access to primary sources from to identify worthwhile avenues of future research [7]. Given which they may learn about their heritage and the rich his- that their knowledge of the collection is probably quite lim- tory of their origins. We are free to lose ourselves in the ited to begin with, this is hardly helpful. As was aptly put depths of a historical archive from the comfort of our com- by Mitchell Whitelaw [8], these interfaces are not “generous”. puter screens and supported in our exploration by a host of This need for a more generous interface is the focus of a intelligent information retrieval systems. project currently being undertaken by Trinity College Dublin In theory, after the arduous process of digitising the collec- (TCD) Digital Collections. At present the website provides tion, providing such functionality ought to be a simple task. the simple search box that we have come to expect which Building and deploying a website has become a trivial pro- is driven by a default deployment of Solr. After conducting cess and off the shelf tools such as Solr provide state-of-the- a search, users can narrow their interests along a broad se- art text retrieval functionality with minimal effort. Given ries of facets: genre, media type, Trinity department, date a suitable portal and a search box which returns ranked re- and subject area. This interface results in a limited search sults, what more could a user want? experience, particularly with regards to exploration. The As it happens, this approach to curating documents has effects of this are demonstrable simply by looking at where been found wanting in many ways. The most immediate the majority of traffic flows through the site (Figure 2). problem with the query-response paradigm is that in order The most famous text on the Digital Collections portal to be able to use the search interface we must know ex- is the Book of Kells [1]. A huge percentage of hits on the actly what we are looking for and the manner in which it site can be attributed to this single page and variants of the query string “Book of Kells” are consistently among the In Proceedings of 1st International Workshop on Accessing Cultural Her- most frequent searches conducted. Indeed, it is worth noting itage at Scale (ACHS’16), June 22, 2016, Newark, NJ, USA. Copyright that many visitors to the site land directly on the page for 2016 for this paper by its authors. Copying permitted for private and aca- the Book of Kells having been referred there from Google, demic purposes. Facebook, Twitter etc. They never even see the initial search equivalents. For example, the field denoting the subject of a document is named subjectlcsh indicating that the data stored here is relevant to the LCSH ontology. While this is not ideal, it does mean that semantically linking the collec- tion is possible and has be made easier by this method of annotating the data. In addition to these rigidly defined attribute fields, there are also a number of free text fields, abstract and descrip- tion being the two most verbose. These free text fields contain additional information about the artifact, much of which is not actually described in the more semantic at- tributes. These are human readable sections which describe the artifact in moderate detail, giving an explanation of its origins, who commissioned it, where was it commissioned, how it came to be in the library or any other information which was available to the transcriber. Often these fields Figure 2: Graph of pages which site visitors first reference entities which are not mentioned in any of the land on. Note the DRIS ID for the Book of Kells is other document attributes, meaning there is much informa- MS58 003v which ranks above the home page tion hidden in these fields which could be extracted and harnessed to power a more meaningful search experience. box on the homepage. After viewing the book, most users then simply browse away from the portal, not realising that 3. METHOD they have barely touched the tip of the iceberg with regards Fostering engagement and encouraging exploration means to the volume of information and material available to them. discerning what interests a user and presenting them with Hence our goal is twofold; to provide a better, more accu- content which relates to that interest. It may also mean rate, more supportive search experience to users who come determining what is of interest to a community of people at to explore the TCD Digital Collections site and to foster a large and using this group perspective to assist an individual sense of curiosity in those who come to see one artifact, but whose exploration has stalled. may have an interest in so many more. While we could use traditional language modelling or prob- abilistic methods to determine which documents may be dis- cussing the same subject and then make recommendations 2. CORPUS based on that, it is much better if we can establish what The corpus is comprised of approximately 100,000 high real world, tangible objects are influencing the user’s search resolution scans of various documents curated by the Digi- and then trace these figures through the collection. In or- tal Collections group. These range from manuscripts to il- der to do this, we must know what entities are present in lustrations, etchings, postcards, templates, graphs, musical the corpus to begin with. We are fortunate that many po- scores and more, spanning more than 1,000 years of human tentially useful entities have been manually extracted and history. Information extraction techniques such as optical stored in the XML file for us. However, much information character recognition (OCR) have not been applied to the is also hidden in the free text fields spread throughout the renderings, but each image has meta-data associated with meta-data. This presents some interesting opportunities to it describing important attributes of the artifact. This data perform automatic information extraction and analysis on is listed in a single XML file which has been provided to the collection. us and is the foundation upon which we must build a new Named Entity Recognition (NER) is a well established search interface. field in Natural Language Processing (NLP) for locating ref- As is typical in collections of this type, many of the XML erences to known entities in a body of text [6]. In general we fields denote information such as page number, document search for specific patterns, parts of speech or words which ID, catalogue number etc. However, there has also been appear in a gazetteer of terms. Much like anything involving some effort made to make the collection semantically in- natural language and computers, the results can be noisy. clined, although not fully semantically linked. The names However, after the results of NER have been sanitised, they of several fields are designed to reflect the structure of four may then be disambiguated to a suitable knowledge source well established library cataloguing ontologies: The Library [5, 2]. of Congress Name Authority File (NAF), The Library of Within the Digital Collections corpus, identifying men- Congress Subject Headings (LCSH), Getty Vocabularies Art tions of entities in the free text fields and disambiguating and Architecture Thesaurus (AAT) and Getty Vocabularies them to a common knowledge base will allow us to estab- Union List of Artist Names (ULAN). The choice of ontology lish which documents are related to which entities and, by for a particular field is dependent on the nature of the con- extension, which documents are related to each other. tent it represents and the availability of information within Disambiguation involves more than just co-referencing these the ontologies themselves. For example, if an artistâĂŹs entities within the collection. It links the collection’s enti- name cannot be found in NAF, then ULAN is used instead. ties to a higher knowledge base which may connect them Although the entries in these ontologies are not explicitly by proxy to external knowledge sources such as Wikipedia. referenced by the meta-data (i.e. there are no URIs used These external sources may assist the user in understanding in the XML file), the names of various fields have been se- the primary source material making the content more ac- lected so that they may be related back to their ontological cessible for those who are inexperienced with the collection. Figure 3: A screenshot of the current home page of the digital collections website The challenge is to determine which entity in the knowledge poor user interface design. This too will be a factor in the base is being referred to by the mention found in the text. final development of the new Digital Collections portal. While this focus on entities may be useful, it may also be of benefit to attempt to establish the larger context in which 5. ACKNOWLEDGMENTS a user’s search is taking place. While the corpus is large in size (the abstracts alone totalling almost 21,000,000 words) This research is supported by the Science Foundation Ire- the vocabulary is highly constrained (a little over 10,000 land (Grant 07/CE/I1142) in the ADAPT Centre (adapt- unique terms) suggesting that topic modelling may also be centre.ie) at Trinity College, Dublin. a viable option for structuring the corpus and influencing search. 6. REFERENCES Accurate topic modelling is difficult to achieve. Determin- [1] Book of Kells. http://digitalcollections.tcd.ie/home/ ing exactly how much content is required in order for a topic index.php?DRIS ID=MS58 003v. [Online; accessed model to stabilise can be hard [4] and even after the model 30-May-2016]. has stabilised there is no guarantee that the topics will be of [2] A. Alhelbawy and R. J. Gaizauskas. Graph ranking for use. Nevertheless, it may still be a worthwhile investigation collective named entity disambiguation. to perform topic analysis such as Latent Dirichlet Allocation [3] on the collection to see if new, useful patterns beyond the [3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent broad facets already in use may be found. dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003. [4] D. Greene, D. OâĂŹCallaghan, and P. Cunningham. 4. CONCLUSIONS How many topics? stability analysis for topic models. As can been seen, there are several options for what can be In Machine Learning and Knowledge Discovery in done when given a collection such as TCD’s Digital Collec- Databases, pages 498–513. Springer, 2014. tions corpus. The quality with which we can automatically [5] Z. Guo and D. Barbosa. Robust entity linking via extract information and relationships from the collection are random walks. In Proceedings of the 23rd ACM greatly dependent on the quality of the data itself. Quantity International Conference on Conference on Information of data also plays a role in the accuracy of automatic meth- and Knowledge Management, pages 499–508. ACM, ods. However with the data extracted from the collection, 2014. we have more information at our disposal for assisting and [6] D. Nadeau and S. Sekine. A survey of named entity engaging with the user as they search the collection. recognition and classification. Lingvisticae Of course, even the best search interface can be felled by Investigationes, 30(1):3–26, 2007. [7] R. W. White and R. A. Roth. Exploratory search: Beyond the query-response paradigm. Synthesis Lectures on Information Concepts, Retrieval, and Services, 1(1):1–98, 2009. [8] M. Whitelaw. Generous interfaces for digital cultural collections. Digital Humanities Quarterly, 9(1), 2015.