Enhancing Access to Media Collections and Archives Using Computational Linguistic Tools James Pustejovsky, Marc Verhagen Department of Computer Science Brandeis University E-mail: {jamesp,marc}@cs.brandeis.edu Nancy Ide, Keith Suderman Department of Computer Science Vassar College E-mail: {ide,suderman}@cs.vassar.edu Abstract In this paper, we outline the strategies, methodology, and infrastructure needed to bring advanced computational linguistic tools to researchers and archivists in the humanities. We discuss three use cases involving the ap- plication of the Language Application Grid (LAPPS), an open, web-based infrastructure providing interoperable access to hundreds of computational linguistic (CL) component web services, together with facilities for multi- step analyses via tools pipelining, performance evaluation, and resource de- livery. These include: CL analysis of corpora restricted under copyright; the challenge posed by radio and television media collections; and the use of LAPPS for assisting archivists in their collection and cataloguing efforts. We believe that the adoption and use of CL platforms such as LAPPS by the digital humanities (DH) will help foster better communication, sharing, and research between the two communities. 1 Introduction In the 1960s, the fields that are now called “computational linguistics" and “digital humanities" were not recognized as distinct [9, 19, 20]. In the 1970s, when compu- tational linguistics (CL) began to be heavily influenced by advances in the field of Artificial Intelligence and adopted logic- and rule-based, symbolic methods, “Hu- manities Computing" retained the fundamentally statistical approach prevalent in the previous decade. Over the ensuing 40 years, the two fields have evolved in rel- ative isolation. Some efforts were made in the early 1990s to reunite the two when CL once again took up statistical methods, using the argument that statistical meth- ods adopted and adapted in CL had much to offer the field of digital humanities, 19 and also that digital humanities, with its vast store of creative language data, pro- vided a challenge to current methods that could yield fresh insights into the ways language conveys meaning. These efforts failed, and as a result, the two fields con- tinued to evolve along their own paths. CL pushed empirical methods forward into machine learning and, most recently, “deep learning" involving neural networks and similar structures, while humanities computing evolved along a very different path, encompassing the creation, maintenance, and use of massive libraries of dig- itized data, including not only literary, historical, and similar texts but also images, audio, and video, representing artifacts relevant to the arts and humanities. Thus the term “digital humanities" was coined. Within the past few years, Digital Humanities (DH) has looked to CL for meth- ods to enable richer analysis of literary, historical, and other kinds of documents, recognizing that CL methods and procedures can in fact enhance the kinds and amount of information that can be automatically extracted from language data [22]. However, re-marrying the two disciplines has proven non-trivial [18, 20]. The diffi- culty is invariably attributed to a lack of accommodation in CL tools for users who are not technically inclined, and indeed, this is largely true. However, the roots of the problem go far deeper, stemming from two complementary factors: differences in the goals for which the same analyses are applied in each area, and differences in the methodological norms and perspective of the researcher. In this paper, we outline a general methodology towards accomplishing the goal of re-integrating the two fields, and list the requirements on what tools are needed by humanities researchers and archivists. We first review the platform of the LAPPS Grid, and then examine three case studies of how the platform can help the humanities scholar and archivist in their research. 2 The Language Applications (LAPPS) Grid Over the past ten years, there has been increased activity in efforts to integrate Hu- man Language Technologies (HLT) applications, corpora, as well as development platforms. This stems from an obvious and growing demand for robust language processing capabilities across academic disciplines, education, and industry. How- ever, one of the major problems in this area has been and remains component in- teroperability, reusability, and integration. This has resulted in much of the field of HLT being fragmented, characterized by a lack of standard practices, few widely usable and reusable tools and resources, and much redundancy of effort. Rapid de- velopment and deployment of HLT applications has also been hindered by the lack of ready-made, standardized evaluation mechanisms, especially those which en- able evaluation of component performance in applications consisting of a pipeline of processing tools. To address these problems, we have developed an open, web-based infrastruc- ture, called the LAPPS Grid1 [10, 21], that provides interoperable access to hun- 1 Funded by the NSF-SI2 initiative. 20 dreds of HLT component web services, together with facilities for multi-step anal- yses via tools pipelining, performance evaluation, and resource delivery for a wide range of language resources [11, 13]. As an easy-to-use interface and management system, the LAPPS Grid has adopted the Galaxy framework2 [6], a robust, well- developed platform for workflow configuration and management, and persistence of results. The LAPPS Grid affords the possibility of creating ready-made work- flows to perform specific analytic tasks that can be used off-the-shelf or customized to accommodate specific projects, as well as means to compose and evaluate work- flows from atomic NLP components [12]. The LAPPS/Galaxy platform can be accessed through a web interface (http://www.lappsgrid.org), deployed locally on any Unix system, or run from the cloud. Figure 1 provides an overview of the LAPPS Grid architecture. Figure 1: The LAPPS Grid supports discovery, adaptation and composition of lan- guage technologies. 3 Language Analysis within the HathiTrust Data Capsule In our first case study, we examine the role that the LAPPS Grid can play in en- suring access for CL tools over copyright-restricted content, specifically, over the HathiTrust Library [17, 5]. The HathiTrust Digital Library comprises the digitized representations of 13.68 million volumes, 6.84 million book titles, 359,528 serial titles, and 4.79 billion pages. Approximately 39% of the items in the HathiTrust corpus are digital representations of print volumes in the public domain. The re- maining 61% are works under copyright. Because of copyright restrictions, schol- ars have come to see this 61% of the HathiTrust collection of volumes as sitting 2 Funded by NSF awards 0543285 and 0850103. 21 behind a ‘copyright wall’ that makes it next to impossible for them to have mean- ingful access to their content. The HathiTrust Research Center (HTRC) develops software infrastructure, mod- els, and tools to help digital humanities (DH) scholars conduct new computa- tional analyses of works of the HathiTrust corpus, with a focus on analysis of larger datasets than can be done today (what they call “analysis at scale”). One of the key infrastructure components of HTRC is the Data Capsule (DC). Recently, LAPPS/Galaxy has been adopted by a Mellon-funded project at the University of Illinois, which is utilizing the platform to apply sophisticated HLT text mining methods to HTRC’s massive digital library3 . The HTRC’s DC Project involves a collaboration between Illinois, Indiana, Brandeis, Oxford, and Waikato Universi- ties [7]. Working with our Illinois and Indiana collaborators, the project is focused on implementing specific LAPPS tools that are most needed by the digital human- ities scholar within the HTRC user community. The HTRC Data Capsule [23], shown below in Figure 2, is a solution to provi- sioning secure researcher access directly to the raw data objects of the HathTrust. Figure 2: HTRC Secure Commons architectural components. The goals of the present work include: the deployment of tools that enhance search and discovery across the library by complementing traditional volume-level bibliographic metadata with new metadata, using specially-developed LAPPS/Galaxy- based CL applications; the creation of Linked Open Data resources to help scholars find, select, integrate and disseminate a wider range of data as part of their schol- arly analysis life-cycle; a set of exemplar pre-built Data Capsules that incorporate tools commonly used by both the DH and CL communities that scholars can then customize to address their specific needs. The initial work carried out within the WCSA+DC research involved an in- tegrated effort of studying the needs and requirements of the DC users: that is, identifying those NLP web services that have already been wrapped and integrated into the LAPPS Grid, as well as modules that are not yet available. This is being 3 https://www.hathitrust.org 22 followed by the integration of document-level and document collection processing (genre and topic identification) modules into the DC, as well as the most basic low- level processing (sentencization, tokenization, and POS tagging). The next level of processing planned includes more the computationally intensive NLP modules, such as finding "Named Entities" such as cities, countries, people, etc., as well as performing various levels of constituent- and dependency-based parsing at the sentence level. This will be followed by a detailed evaluation of the NLP services. This involves: (a) assessing the overall performance of each component service within the Data Capsule; and (b) examining the possible workflow configurations of the different services as configured in distinct pipelines to determine the optimal configuration in terms of performance. The ability to apply a cyclic process of it- erative testing, evaluation, and re-configuration is particularly important for rapid development of workflows to suit specific user needs, and is one of the benefits offered by adopting the LAPPS Grid framework. 4 CL-Facilitated Access to Media Collections While the previous use case looked at how HLT tools can assist in the discovery of content in text-based corpora that are subject to limited access due to copyright restrictions, the second use case we consider examines the role that CL tools can play in broadcast media collections. We are currently examining the content of the American Archive of Public Broadcasting, a collaboration between WGBH and the Library of Congress4 . The last sixty years of our shared history and culture has been well documented on broadcast media. Many important events, persons, issues, and conflicts have been recorded and discussed in programs at the national and local levels on public television and radio, but much of this material is currently inaccessible and has yet to become part of the historical record. Scholars have long recognized the value of media for the evidence such material can provide about the past as well as the manner in which the public has experienced the news. Likewise, educators have appreciated the ability of audiovisual materials to make history come alive in the classroom setting. Both scholars and educators have been frustrated by the difficulties associated with accessing these materials [8, 14]. From our discussions with archivists and historians, it seems that making historical public broadcasting programs accessible and searchable would be a great enabler for scholarship. Broadcast media is especially important because of the era it reflects, such as that contained in the American Archive. Broadcast media, once it is made acces- sible, will add rich archival material to enhance the historical record. Not only is much of our broadcast media history inaccessible, but it is also in danger of degrad- ing and becoming lost to posterity if it remains much longer on station and archival shelves. Reformatting to a digital format is necessary for long-term preservation, but digitizing is only the first step towards improved access of this material for use 4 americanarchive.org 23 by scholars and educators. Most materials held in storage by stations contain mini- mal descriptive information, in many cases only a program title [1]. Once this ma- terial is digitized, cataloging becomes an extraordinarily labor intensive endeavor. Using CL applications for language-based analysis can help extract significantly more descriptive information; however, optimizing these tools for humanities re- search requires a digital history team working closely with the CL team to map the output to historical events, places, people, and themes; iteratively improving the computation by revising the assumptions the tool makes; and provide historical in- terpretations of the new data. These are some of the tasks we are currently carrying out with WBGH and their affiliates. 5 A Language Application Toolkit for Archivists As our final case study, we examine the role that an HLT platform such as LAPPS Grid can play in helping digital archivists manage their collections [3]. We have re- cently begun collaborating with several media and archival organizations to explore the applicability of the LAPPS Grid platform to the specific needs of cataloguing, indexing, and retrieving data from media collections. In particular, we have part- nered with WGBH of Boston to determine how the configuration and combination of existing computational linguistic (CL) tools can significantly transform the way archivists and librarians describe their media collections. Using WGBH’s corpus of archival video and audio transcripts and metadata as a research data set, we have started to develop a toolkit that will be evaluated on its ability to create and enhance metadata and improve discoverability of large and diverse media collec- tions, allowing for substantial progress in the effort and time spent creating and improving records from their collections. This toolkit will be built on top of the Language Application (LAPPS) Grid and will leverage both the tools and workflow composer environment already present in the LAPPS Grid framework. Audio and video media are, by definition, not text, and therefore opaque to text search engine capabilities. Finding content relevant to one’s research question among thousands of hours of programming is time-consuming, involving watch- ing and/or listening to potentially hours of content in order to zero in on relevant content [15]. The availability of descriptive, structured, textual metadata about the content of these collections and about the items they include radically improves search and browse capabilities for researchers; however, the effort to fully describe and catalog these materials is highly labor-intensive and therefore costly. Such a toolkit is an excellent example of how current CL tools can be con- figured and combined into a drag-and-drop toolkit and incorporated into archival accessioning and cataloging workflows to significantly ease the work involved in creating rich, descriptive metadata records for each item. The toolkit extends the current capabilities of LAPPS to include tools (already available in projects such as Alveo [4]) for accessing text content in audio and video, as well as access to the publicly available audio and video materials in the WGBH archives. By cou- 24 pling these tools with sophisticated CL modules for information retrieval, question answering, and text mining, the goal is to be able to create composite workflows to extract and analyze information gleaned from these resources. Through a pro- cess of iterative refinement involving both testing of tools and augmentation of supporting resources, we will develop a set of optimal workflows for information extraction from audio and video and evaluate the results on both the collection and individual item levels, to determine the degree to which the annotation process is facilitated. These ready-made workflows will be made available to archivists who can customize them for particular domains and applications, augment supporting resources with additional data, either extracted in earlier steps or derived from other sources. It is our expectation that the project will ultimately produce a set of ready-made (but still customizable) workflow ‘packages’ that will dramatically reduce the time and cost of metadata production for digital archival materials. Current archival practice involves the need to dedicate many human hours to create, normalize, and catalog collections; however, cataloging is so time-consuming that it is often the case that collections are put into a queue for cataloging, creat- ing a huge backlog of unprocessed collections. By using such a toolkit, cataloging will be incorporated into the acquisitions workflow and will become a duty of the computer, allowing humans to reallocate their time to work on tasks that still re- quire a human to perform. Instead of catalogers watching hour-long programs and recording descriptive information about the material, the LAPPS toolkit would au- tomate creation of metadata such as creating speech-to-text transcripts, identifying proper names, locations, organizations, and even dates, and would perform meta- data clean-up and normalization, and the output of the system could be ingested into the archives’ metadata repository automatically. This new workflow could save months, even years, of an archivist’s time. 6 What the Digital Humanities Need from CL The example uses of CL technologies for DH research outlined above demonstrate some of the ways in which the use of CL tools differs between the two fields. In broad terms, the goal of CL is to achieve some level of automatic understand- ing or interpretation of human language data for sophisticated applications such as question answering, machine translation, information retrieval, or summarization [2, 16]. Tool chains for end-to-end performance of this kind of task are developed and tested for their efficacy; the focus is on the final result of applying the tool chain comprising the application, with a relatively high tolerance for error or “noise". In contrast, for DH the focus is more often on finding information that may then be subjected to further human analysis, and may require what in CL are considered to be relatively low-level, enabling tasks, such as tokenization, sentence boundary detection, part-of-speech tagging, named entity recognition, or gross-level depen- dency analysis. Furthermore, DH deals with data from vastly varying domains and genres, while CL at present tends to focus on a somewhat more limited range of 25 data. Thus for DH the availability of robust, highly accurate tools for low-level tasks that are applicable or configurable to handle vastly different domains is per- haps the highest priority, rather than the overall performance of high-level sophis- ticated NLP applications. The LAPPS Grid, which provides easy-to-use access to a wide variety of customizable low-level CL tools together with means to evaluate performance on dataset from different domains, already addresses these needs to a large extent. Its adaptation to accommodate the projects described in the previ- ous sections should continue to augment its capabilities to serve the needs of DH research. References [1] Howard Besser. The next stage: Moving from isolated digital collections to interoperable digital libraries. First Monday, 7(6), 2002. [2] Julian Brooke, Adam Hammond, and Graeme Hirst. Proceedings of the Fourth Workshop on Computational Linguistics for Literature, chap- ter GutenTag: an NLP-driven Tool for Digital Humanities Research in the Project Gutenberg Corpus, pages 42–47. Association for Computational Lin- guistics, 2015. [3] Karen Cariani and Casey Davis. Let the computer do the work. In Presentations from FIAT/IFTA 2016 World Conference, Warsaw, Poland, http://fiatifta.org/, October 2016. [4] Steve Cassidy, Dominique Estival, Timothy Jones, Denis Burnham, and Jared Burghold. The alveo virtual laboratory: A web based repository api. In Pro- ceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, may 2014. European Language Resources Association (ELRA). [5] J Stephen Downie, Kirstin Dougan, Sayan Bhattacharyya, and Colleen Fallaw. The hathitrust corpus: A digital library for musicology research? In Proceedings of the 1st International Workshop on Digital Libraries for Musicology, pages 1–8. ACM, 2014. [6] B. Giardine, C. Riemer, R. C. Hardison, R. Burhans, L. Elnitski, P. Shah, Y. Zhang, D. Blankenberg, I. Albert, J. Taylor, W. Miller, W. J. Kent, and A. Nekrutenko. Galaxy: a platform for interactive large-scale genome analy- sis. Genome Research, 15(10):1451–55, 2005. [7] HathiTrust. The workset creation for scholarly analysis + data capsules. https://www.hathitrust.org/2016-spring-update. [8] Annika Hinze, Craig Taube-Schock, David Bainbridge, Rangi Matamua, and J Stephen Downie. Improving access to large-scale digital libraries through 26 semantic-enhanced search and disambiguation. In Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 147–156. ACM, 2015. [9] Susan Hockey. The history of humanities computing. A companion to digital humanities, pages 3–19, 2004. [10] Nancy Ide, James Pustejovsky, Christopher Cieri, Eric Nyberg, Di Wang, Keith Suderman, Marc Verhagen, and Jonathan Wright. The language appli- cation grid. In Proceedings of the Ninth International Conference on Lan- guage Resources and Evaluation (LREC’14), Reykjavik, Iceland, may 2014. European Language Resources Association (ELRA). [11] Nancy Ide, James Pustejovsky, Keith Suderman, and Marc Verhagen. The Language Application Grid Web Service Exchange Vocabulary. In Proceed- ings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT (OIAF4HLT), Dublin, Ireland, 2014. [12] Nancy Ide, Keith Suderman, James Pustejovsky, Eric Nyberg, Christopher Cieri, and Marc Verhagen. The Language Application Grid and Galaxy. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, 2016. European Language Resources Association (ELRA). [13] Nancy Ide, Keith Suderman, Marc Verhagen, and James Pustejovsky. The Language Application Grid Web Service Exchange Vocabulary. In Revised Selected Papers of the Second International Workshop on Worldwide Lan- guage Service Infrastructure - Volume 9442, pages 18–32. Springer-Verlag New York, Inc., 2016. [14] Peter Leonard. Mining large datasets for the humanities. IFLA WLIC, pages 16–22, 2014. [15] Johan Oomen, Riste Gligorov, and Michiel Hildebrand. Waisda?: making videos findable through crowdsourced annotations. Crowdsourcing our Cul- tural Heritage, pages 161–184, 2014. [16] Sandford Bolette Pedersen, Sussi Olsen, and Lars Borin. Proceedings of the workshop on Semantic resources and semantic annotation for Natural Lan- guage Processing and the Digital Humanities at NODALIDA 2015, chapter Proceedings of the workshop on Semantic resources and semantic annotation for Natural Language Processing and the Digital Humanities at NODALIDA 2015. Northern European Association for Language Technology, 2015. [17] Beth Plale, Robert McDonald, Yiming Sun, Inna Kouper, Ryan Cobine, J Stephen Downie, Beth Sandore Namachchivaya, and John Unsworth. Hathitrust research center: computational access for digital humanities and 27 beyond. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digi- tal libraries, pages 395–396. ACM, 2013. [18] Susan Schreibman, Ray Siemens, and John Unsworth. A companion to digital humanities. John Wiley & Sons, 2008. [19] Patrik Svensson. The landscape of digital humanities. Digital Humanities, 2010. [20] Edward Vanhoutte. The gates of hell: History and definition of digital| hu- manities| computing. Defining Digital Humanities: A Reader, pages 119–56, 2013. [21] Marc Verhagen, Keith Suderman, Di Wang, Nancy Ide, Chunqi Shi, Jonathan Wright, and James Pustejovsky. The LAPPS Interchange Format. In Revised Selected Papers of the Second International Workshop on Worldwide Lan- guage Service Infrastructure - Volume 9442, pages 33–47. Springer-Verlag New York, Inc., 2016. [22] Christopher Welty and Nacy Ide. Using the right tools: enhancing retrieval from marked-up documents. Computers and the Humanities, 33(1-2):59–84, 1999. [23] Jiaan Zeng, Guangchen Ruan, Alexander Crowell, Atul Prakash, and Beth Plale. Cloud computing data capsules for non-consumptiveuse of texts. In Proceedings of the 5th ACM workshop on Scientific cloud computing, pages 9–16. ACM, 2014. 28