Towards Semantic Enrichment of Newspapers: A Historical Ecology Use Case Marieke van Erp1 , Thomas van Goethem2 , Katrien Depuydt3 , and Jesse de Does3 1 Vrije Universiteit Amsterdam marieke.van.erp@vu.nl 2 Radboud University Nijmegen tgoethem@science.ru.nl 3 Instituut voor de Nederlandse Taal {Katrien.Depuydt,Jesse.Dedoes}@ivdnt.org Abstract. Historical ecology research relies on historical accounts of human-animal interactions to study this interaction through space and time. Newspaper archives are a rich source of information, but require careful querying and filtering to collect the relevant information. Tra- ditionally, this is a laborious manual task. In this position paper, we describe our ongoing work on semantically enriching a newspaper collec- tion to create a knowledge base to support historical ecological work. Keywords: text enrichment, historical ecology, lexicology 1 Introduction Historical research often relies on manual inspection of documents. Historical ecology investigates, a.o., the occurrence of particular animals in a distinctive region over time. When using a large newspaper corpus, this would mean having to sift through thousands of documents to identify whether each document is relevant or not, before the researcher can even begin to analyse the content at a more detailed level. In this position paper, we present ongoing work on the creation of a knowledge base that provides a semantic enrichment layer over a large newspaper corpus. We highlight the challenges we discovered in our first data analysis as well as the solutions we intend to implement to resolve these. 2 Background and Related Work Historically, humans have had an ambivalent relationship with animals, perceiv- ing animals not only as sources of food, tools or totems, but also as threats and nuisances. Many birds, small mammals and insects were believed to carry diseases or to be harmful to crops or livestock. Furthermore, large predatory species (e.g. wolf) or venomous species (e.g. viper) were feared for injuring or killing humans [1]. These perceived threats have led to a ‘cultural fear’ of pest 40 A. Adamou, E. Daga and L. Isaksen (eds.) species, which has been reinforced through storytelling and mythology [2]. Re- cently, our relationship to many of these so-called “vermin” species has changed. Some species are now valued as key species in nature rehabilitation, while others are reintroduced to our country. It is therefore becoming increasingly relevant to understand how these historical relationships between man and nature relate to the present time. A comprehensive historical study on pest and nuisance species is lacking for the Netherlands. Newspapers reporting on interactions with pest and nuisance species may be an important source of information for such a study. Currently, the majority of such newspaper analyses are done manually. They involve sending a query to the newspaper interface and clicking every article link, reading the article and recording whether it is relevant to the research question or not. We propose to automate the classification of newspaper articles and storing the results in a knowledge base that contains structured, semantic information about and extracted from the articles along with a link to the original articles4 to help researchers focus their time on a deeper analysis of the relevant articles. 3 Resources A mix of unstructured and structured resources is used. The newspaper corpus is the main information source, but structured resources to systematically query the newspapers and inform the language technology tools are used. National Library Newspaper Corpus The Dutch National Library has made available the original texts from 1.3 million newspapers, 1.5 million magazine pages and 320,000 books from the 15th to the 21st century through the Delpher portal.5 We focus on newspaper articles published between 1800 and 1940, for two reasons: 1) The OCR quality on these is most likely better than on the older material and 2) This period also saw the “biological reveil”, a reawakening of interest in biological, in the Netherlands, which also may be reflected in mentions of animals in newspapers [3]. Taxonomic Resources and Lexicons A list of pest and nuisance species compiled in the ATHENA project,6 is used, which provides the latin name and its common vernacular name. However, due to the local and tempo- ral variance in animal names we also employ diachronic lexicons that each contain Dutch language variations across time and dialects78 [4]. 4 Due to copyright restrictions it is not possible to include all article texts in the knowledge base, but the articles are freely accessible through the Dutch National Library newspaper portal. 5 http://www.delpher.nl 6 http://www.athena-research.org/ 7 http://ivdnt.org/onderzoek-a-onderwijs/projecten/gigant 8 http://ivdnt.org/onderzoek-a-onderwijs/projecten/diamant 2nd Workshop on Humanities in the Semantic Web (WHiSe 2017) 41 Fig. 1. SERPENS Workflow 4 Design of the Knowledge Base We aim to create a knowledge base containing information about animal reports. We start by broadly querying the National Library Newspaper Corpus through the taxonomic and diachronic resources (Step (1) in Figure 1). This results in many articles returned from the newspaper collection that are not necessarily about animals, such as persons with last name “De Wolf”) (2). We therefore employ document classification to filter out irrelevant documents (3). Our initial analysis on this is presented in Section 5. Simply obtaining a set of relevant documents (4) is already useful to ecologi- cal historians, but we would like to dive further into the documents and classify what type of animal report it is (5). We will investigate what level of specificity the tools can handle. This results in a knowledge base that contains document classifications, links to the Delpher sources, animal mentions, its spelling vari- ations, document metadata such as publication date and article length, and factual information extracted from the documents (6). New mentions will be fed back to enrich the lexicons.The knowledge base enables humanities and biology researchers to study pest and nuisance species across species, space and time (7). 5 European polecats and lynxes As a first use case, we chose to investigate documents mentioning ‘bunzing’ (European polecat) and ‘lynx’ (Lynx). These two species are chosen as a first query on the database, which returns relatively modest result sets (2,515 and 5,530 documents respectively) showing a wide variety of topics in the documents. Language variation As our research covers a relatively long period of time, as well as a corpus that contains quite some local newspapers, we expect to find a fair amount of language variation. Indeed, through the diachronic lexicons as well as the returned hits, we find a reasonable set of terms to expand our query with. Table 1 lists the 42 A. Adamou, E. Daga and L. Isaksen (eds.) query terms employed for “bunzing” (European polecat), what type of language variation they express and in the last column the number of hits in the corpus. Table 1. Bunzing variations found through the lexicon Term Type Hits Bunzing base name 2,515 Bunsing spelling variation 1,319 Bonzing form variation 47 Bunzel morphological variation 617 Bunsel morphological variation 67 Ulk synonym 21,993 9 Ulling synonym 1,153 Fret10 related 25,830 11 Eierdief12 hyperonym 98 Document categories The hypothesis that will be tested is that the negative perception of native species has subsided with the passing of time, while it has grown for invasive (non-native) species. Categorization helps to dissect the newspaper corpus, mak- ing it more easy to measure perception and understand its determinants. The categories have been chosen to be broad enough to be applicable to a broad range of species over an extended period of time, but specific enough for a meaningful analysis. Initially, we set out to classify whether a document is about the animal or not. However, upon inspecting the document sets, we discovered that documents that may not directly be about the animal, may also be informative and useful to include in the research on species perceptions. For example, an uptick in ads for ‘bunzinghonden’ (dogs used in the hunt for polecats) or ‘bunzingklemmen’ (‘polecat traps’) can be indicative for the species to be a nuisance and thus for people to want to get rid of them. Furthermore, figurative language use that has a positive or negative connotation (‘eyes like lynx’ or ‘stinks like a polecat’) also says something about the public perception of a species. Upon annotating about 100 examples with three annotators (consisting of a historical ecologist, a lexicologist and a computational linguist), we came to the following categories, which largely correspond to categories described in [1, 5]. Natural history General articles about the animal, e.g. it subsists on birds or x number were stuffed and became part of a museum collection Nuisance, material damages The article mentions the animal as causing material damages, e.g. beetles damaging crops or lynxes killing chickens 9 ‘Ulk’ was a German satirical magazine whose cartoons were sometimes republished in Dutch newspapers. 10 Fret (ferret) is a domesticated polecat 11 We found OCR errors that map ‘het’ (’the’) to ‘fret’, further stressing the need for OCR correction and automatic document classification to yield relevant documents. 12 ‘egg thief’ 2nd Workshop on Humanities in the Semantic Web (WHiSe 2017) 43 Nuisance, immaterial damages The article mentions the animal as a nuisance without material damages e.g. polecats found to walk over someone’s face whilst they were in bed, or (possibly irrational) fear for a certain animal Pest control Organised hunt to bring down the number of pest species, e.g. ad for hunting dogs Hunt for economic reasons Hunting to use the fur, meat or other parts of the an- imal e.g. an article mentioning that the hunting season has started again Prevention Non-lethal actions against pest species, e.g. advice in the newspaper on which plants keep away pest species Accidents Mention of an unintentional encounter with the animal, e.g. roadkill Figurative Figurative language featuring the animal e.g. eyes like a lynx Other Articles not pertaining to the animal, e.g. a ship named ‘Lynx’ or a person whose last name is ‘Bunzing’ Fact and Fiction Another interesting dimension of the dataset is that it does not only cover ‘news’ but also other types of texts. Currently, the National Library corpus distinguishes 4 types of documents in its newspaper corpus: article, ad, announcement and illustration. In our result set, we also encounter cross- word puzzles, feuilletons, poems and cartoons, which are all classified as ‘article’ in the metadata. This is understandable as the newspaper corpus has been processed largely automatically, but for our purposes it makes sense to distinguish at least between text with an ‘imaginative’ primary aim (a.o. fiction) and text with an informative primary aim (non-fiction), where we put crossword puzzles in the imaginative category. We are annotating the articles with these classes, and intend to train a classifier from this to automatically detect these categories. We expect that the crosswords are the easiest here (when looking at features such as the occurrence of horizontal, vertical and numbers), but jokes are more difficult to identify automatically e.g.: Guest: “Could you perhaps bring me a ferret?” Waitress: “Why would you want one?” Guest: “Perhaps it could find the hare that is hidden in this jugged hare” 13 Document quality It is well-known that Optical Character Recognition is not perfect, especially not on older documents [6]. As a first attempt to identify which documents are of highest and lowest quality, we compare each OCRed text to a historical lexicon of known words and return the percentage of words recognised (cf. also [7, 8] for a lexical and a geometrical approach to quality assessment). 6 Discussion and Future Work Semantic Web research revolves around structured data, but for humanities re- searchers, text is often the core source of investigation. We argue that some of 13 Arnhemsche Courant, 09-01-1926, http://resolver.kb.nl/resolve?urn=MMKB08: 000106191:mpeg21:a0117 44 A. Adamou, E. Daga and L. Isaksen (eds.) the manual data collection work for humanities researchers can be alleviated through semantic enrichment of the texts. We propose to use a combination of language technology and structured resources to create a knowledge base as a more sophisticated entry point to collections. However, working with historical textual collections is not without challenges; in this contribution. we have identified historical language variation, document classification, and document quality as major problems to overcome. By bring- ing together the knowledge of historical ecology, (historical) lexicography and computational linguistics, we believe we are in the best position to address these issues. The knowledge base we are creating will be published through Timbuctoo.14 Besides data access via SPARQL through an API, it provides a programming- free manner to access the data for humanities researchers. Currently, ports to visualisation tools such as Gephi15 are being built. The annotated data and ex- periments thusfar can be found at: http://www.github.com/clariah/serpens. Acknowledgements The research for this paper was made possible by the CLARIAH-CORE project financed by NWO: http://www.clariah.nl References 1. Lenders, H.J.R.: Ten a penny? deadly viper bites in the netherlands in a socio- economic perspective. Litteratura Serpentium 34 (2014) 290–316 2. Lenders, H.J.R., I. A. W. Janssen, I.: The grass snake and the basilisk: From pre- christian protective house god to the antichrist. Environment and History 20 (2014) 319 – 346 3. van Berkel, K.: Vóór Heimans en Thijsse: Frederik van Eeden sr. en de natuurbelev- ing in negentiende-eeuws Nederland. Volume 63. Koninklijke Nederlandse Akademie van Wetenschappen (2006) 4. Maks, I., van Erp, M., Vossen, P., Hoekstra, R., van der Sijs, N.: Integrating di- achronous conceptual lexicons through linked open data. Presented at DHBenelux 2016 (9-10 June 2016) 5. Dirke, K.: Where is the big bad wolf? notes and narratives on wolves in swedish newspapers during the eighteenth and nineteenth centuries. In Masius, P., Sprenger, J., eds.: A fairy tale in question. Historical interactions between humans and wolves. The White Horse Press, Cambridge (2015) 101–118 6. Reynaert, M.: Non-interactive ocr post-correction for giga-scale digitization projects. In: CICLing, Springer (2008) 617–630 7. Springmann, U., Fink, F., Schulz, K.U.: Automatic quality evaluation and (semi-) automatic improvement of mixed models for OCR on historical documents. CoRR abs/1606.05157 (2016) 8. Gupta, A., Gutierrez-Osuna, R., Christy, M., Capitanu, B., Auvil, L., Grumbach, L., Furuta, R., Mandell, L.: Automatic assessment of ocr quality in historical doc- uments. In: Proc. AAAI. Volume in press. (2015) 14 https://github.com/HuygensING/timbuctoo 15 https://gephi.org/