=Paper=
{{Paper
|id=Vol-2526/short2
|storemode=property
|title=Challenging Knowledge Extraction to Support the Curation of Documentary Evidence in the Humanities (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2526/short2.pdf
|volume=Vol-2526
|authors=Enrico Daga,Enrico Motta
|dblpUrl=https://dblp.org/rec/conf/kcap/DagaM19a
}}
==Challenging Knowledge Extraction to Support the Curation of Documentary Evidence in the Humanities (short paper)==
Challenging knowledge extraction to support the curation of documentary evidence in the humanities Enrico Daga Enrico Motta enrico.daga@open.ac.uk enrico.motta@open.ac.uk The Open University The Open University Milton Keynes, United Kingdom Milton Keynes, United Kingdom ABSTRACT These are catalogued through a sophisticated workflow but more The identification and cataloguing of documentary evidence from importantly by means of a rich ontology covering a variety of as- textual corpora is an important part of empirical research in the pects related to the experience, for example, the time and place it humanities. In this position paper, we ponder the applicability of occurred, the source where the evidence has been retrieved, and knowledge extraction techniques to support the data acquisition the entities involved, such as, a performer, a composer, or a creative process. Initially, we characterise the task by analysing the end- work [1]. Another example is the UK Reading Experience Database to-end process occurring in the data curation activity. After that, (RED). UK RED includes over 30,000 records of reading experiences we examine general knowledge extraction tasks and discuss their sourced from the English literature. The curatorial effort required relation to the problem at hand. Considering the case of the Listen- to populate these databases was significant and the size and quality ing Experience Database (LED), we perform an empirical analysis of these databases is a major achievement of these projects. focusing on two roles: the listener and the place. The results show, In this position paper we ponder the applicability of knowledge among other things, how the entities are often mentioned many extraction techniques to support the data curation activity. Initially, paragraphs away from the evidence text or are not in the source at we introduce the case study and analyse the data curation activity. all. We discuss the challenges emerged from the point of view of After that, we examine general knowledge extraction tasks and scientific knowledge acquisition. discuss their relation to the problem at hand. Considering the case of the Listening Experience Database (LED), we perform an em- CCS CONCEPTS pirical analysis of a portion of the database, focusing on the role "listener" and "place". Specifically, we elaborate on the hypothesis • Information systems → Information extraction; • Comput- that the related entities can be automatically retrieved from the ing methodologies → Information extraction; • Applied com- source. Finally, we discuss a set of challenges for knowledge ex- puting → Arts and humanities. traction related to supporting the curation of this type of evidence KEYWORDS databases. documentary evidence, knowledge extraction, named entity recog- 2 DATA CURATION ACTIVITY nition, DBpedia In general, the discovery and selection of documentary evidence 1 INTRODUCTION is an activity that may not be conducted systematically. However, in the context of enterprises such as the LED project, there is an The identification and cataloguing of documentary evidence from attempt to objectively select, extract, and curate documentary ev- textual corpora is an important part of empirical research in the idence from texts. From the curator’s perspective, it is not about humanities. An increasing number of recent initiatives in the dig- searching archives or repositories but exploring specific sources of ital humanities have as primary objective the curation of a data- value, for example, specific books. In [8] we developed an approach base collecting text excerpts augmented with fine-grained meta- for retrieving textual excerpts relevant for a certain theme of inter- data, mentioned entities, and their relations, often in the form of est in a book by combining language analysis, entity recognition, knoweldge graphs developed adopting the linked data paradigm. and a general purpose knowledge graph (DBpedia) and showed These databases are developed following controlled processes, in that many of those pieces of evidence are characterised by implicit the spirit of digital library management, where the identification information. In addition, once the text is found, populating all the and onboarding of relevant information is substantially entrusted metadata is a long and difficult task. to research students, librarians, and similar domain experts. The To illustrate the problem, let’s consider two examples from the Listening Experience Database Project (LED)1 , for example, is an ini- LED project: tiative aimed at collecting accounts of people’s private experiences E1 "Music is certainly a pleasure that may be reckoned intellectual, of listening to music [4]. Since 2012, the LED community explored and we shall never again have it in the perfection it is this year, because a wide variety of sources, collecting over 10.000 unique experiences. Mr. Handel will not compose any more! Oratorios begin next week, 1 https://led.kmi.open.ac.uk/ to my great joy, for they are the highest entertainment to me." 2 The Copyright ©2019 for this paper by its authors. Use permitted under Creative Commons 2 Source: Mary Granville, and Augusta Hall (ed.), Autobiography and Correspondence License Attribution 4.0 International (CC BY 4.0). In: Proceedings of the 3rd International Workshop on Capturing Scientific Knowledge of Mary Granville, Mrs Delany: with interesting Reminiscences of King George the (Sciknow), November 19th, 2019. Collocated with the tenth International Conference on Third and Queen Charlotte, volume 1 (London, 1861), p. 594. https://led.kmi.open.ac. Knowledge Capture (K-CAP), Los Angeles, CA, USA.. uk/entity/lexp/1444424772006 accessed: 30 September, 2019. Third International Workshop on Capturing Scientific Knowledge (Sciknow), November 19th, 2019. Collocated with the tenth International Conference on Knowledge Capture (K-CAP), Los Angeles, CA, USA. Enrico Daga and Enrico Motta excerpt refers to Mrs Delany’s report of a (series of) live perfor- output described using a knowledge representation formalism. En- mances of Operas and Oratorios by George Frideric Handel, tity extraction and classification are two related tasks referring happened in March, 1737. to the location of mentions of entities in an input text and their E2 "I then went to Amsterdam to conduct Oedipus at the Concert- categorization, as in the following example: "We went to the re- gebouw, which was celebrating its fortieth anniversary by a series of hearsal of JoshuaP er son last TuesdayT ime ". Entity Linking, in- sumptuous musical productions. The fine Concertgebouw orchestra, stead, refers to finding mentions of entities from a database into always at the same high level, the magnificent male choruses from a natural language resource or, similarly, to appropriately disam- the Royal Apollo Society, soloists of the first rank - among them Mme biguate words by associating a knowledge base identifier. Often, Hélène Sadoven as Jocasta, Louis van Tulder as Oedipus, and Paul the three tasks are performed together and labelled Named En- Huf, an excellent reader - and the way in which my work was re- tity Recognition and Classification (NERC) [12]. Linked Data ceived by the public, have left a particularly precious memory that and NER together have been extensively employed in a number I recall with much enjoyment." 3 Stravinsky, in the beginning of of knowledge extraction and data mining tasks (e.g., the work of 1928, celebrates the high level of the Concertgebouw orchestra H. Paulheim [21]). Relation extraction refers to the identifica- and singers performing his Oedipus Rex. All of them are listed as tion of n − ary relations (for n ≥ 2) within the source, usually entities in the LED database. addressed with a combination of NLP and machine learning tech- In both examples, several of the entities involved are not men- niques [22]. The relations Composer(Opedipus Rex,Starvinsky) tioned in the excerpt and are derived from the curator’s knowledge and Performed(Opedipus Rex,Concertgebouw,1928) are two of the source (for example, Mrs Delany is the author of the letter in examples. Event extraction is a special case of relation extrac- E 1 ) and the domain (e.g. the full name of the work is Oedipus Rex tion where the focus is on identifying an event, usually an action in E 2 ). being performed by an agent in a certain setting. This task is ex- Here we focus on the challenge of automatically populating the tensively studied in domains such as Biomedicine [5], Finance and record and support an expert in identifying, collecting and inputting Politics [15], and Science [26]. Approaches dedicated to the de- the relevant information. In other words, we aim at automatically tection and extraction of historical and biographical events are populating (as many as possible) roles of the ontology. For instance, designed in [25, 29]. The notion of event is generally considered as a listening experience specification can be derived from the avail- something happening at a specific time and place, which constitutes able graph on data.open.ac.uk [7]. The type ListeningExperience an incident of substantial relevance [14]. Therefore, the objective is includes the following properties, among others (we omit names- to identify the action triggering the event (e.g. the verb perform) and paces for readability): then the associated roles. Data-driven approaches usually involve • agent (who is the listener) statistical reasoning or probabilistic methods like Machine Learning • time (when the listening event occurred) techniques. In contrast, knowledge-based methods are generally • place (where it occurred) top-down and based on pre-defined templates, for example, lexico- • subject (what was listened) semantic patterns [15]. The two approaches can be combined and • is_reported_in (a link to the source) machine learning methods used to learn such patterns [23]. How- • has_environment (e.g. was it a public or a private place, ever, the notion of event is still ill-defined in NLP research and indoor or outdoor) this makes it hard to develop methods which are portable, effec- tively, to multiple domains [14]. Research in open domain event A ListeningExperience is related to other relevant items, notably extraction focuses essentially on social media data [24] where the Performance, WrittenWork, MusicArtist, and Country. The knowl- task is the extraction of statements for summarization purposes, edge extraction system should be able to derive the requirements similar to the one of key-phrases extraction [28]. Ontology-based from the ontology specification, primarily the data values and roles information extraction (OBIE) uses formal ontologies to guide the involved. For example, it should derive the requirement to find extraction process [17, 27]. Relevant work in the area is surveyed the agent of the ListeningExperience, its place and time, and in [9, 19]. In 2013, Gangemi provided an introduction and compari- that there may be a specific musical work to be identified and, son of fourteen tools for knowledge extraction over unstructured eventually, the author of the musical work, filling the roles associ- corpora, where the task is defined as general purpose machine ated to the path subject -> ? a Performance -> performance of -> ? a reading [10]. A machine reader transforms a natural language text MusicExpression. into formal knowledge, according to a shared semantics. State of art methods include FRED [11] and PIKES [6]. These approaches are 3 KNOWLEDGE EXTRACTION based on a frame-based semantics that is at the same time domain- Knowledge extraction is a branch of artificial intelligence cover- and task-independent. Instead, a domain-oriented solution would ing a variety of tasks related to the automatic or semi-automatic identify knowledge components of interest in the text, similarly to derivation of formal symbolic knowledge from unstructured or what explored, for example, in the work of Alani [3]. This task is semi-structured sources4 . also considered as an automatic ontology instantiation [2] or semi- The area comprehends research in a variety of problems re- automatic creation of metadata [13]. A suitable approach should be lated to lifting an unstructured or semi-structured source into an able to detect the requirements from a domain-specific ontology 3 Igor Stravinksy, Igor Stravinsky: An Autobiography (1936), p. 139. https://led.kmi. and, having as input the text excerpt, the source metadata, and open.ac.uk/entity/lexp/1435674909834 accessed: 30 September, 2019. potentially other knowldge bases, generate suitable hypotheses of 4 For a general introduction, see [16]. values and entities on any relevant role. Challenging Third International knowledge Workshop extraction on to Capturing support Scientific Knowledge (Sciknow), November 19th, 2019. Collocated with the tenth International Conference on Knowledge Capture the curation of documentary evidence in the humanities (K-CAP), Los Angeles, CA, USA. Listing 1: Detect the location of an excerpt in a source. excerpt , Source ; b e s t [ t , b , e , s ] ; / / t e x t , b e g i n , end , s c o r e words [ ] = t o k e n i z e ( e x c e r p t ) words [ ] = s o r t B y L e n g t h D e s c ( words [ ] ) / / L o n g e s t on t o p F o r e a c h word i n words [ ] : o c c u r r e n c e s [ ] [ b , e ] = f i n d ( word , S o u r c e ) p o s i t i o n [ b , e ] = f i n d ( word , e x c e r p t ) Foreach occurrence [ b , e ] in occurrences [ ] [ b , e ] : Figure 1: Statistics on the LED database to illustrate the cov- begin = occurrence . b − p o s i t i o n . b erage of DBpedia entities and the scope of our analysis. end = occurrence . e + len ( excerpt ) − position . e p o s s i b l e = s u b s t r i n g ( S o u r c e , b e g i n , end ) score = levenshtein ( excerpt , possible ) i f ( score < best [ s ]) 4 EMPIRICAL ANALYSIS b e s t [ t , b , e , s ] = [ p o s s i b l e , b e g i n , end , s c o r e ] To discuss the feasibility and difficulty of the task, we relax the fi End problem and verify to what extent the entities that are part of the End curated metadata could potentially be automatically derived from return best the sources. Specifically, we want to answer the questions: (Q1) Could a system find the target entities in the excerpt? (Q2) Could a system find the target entities in the text surrounding the excerpt? (Q3) How far from the excerpt the entity is? (Q4) Could it be found in the metadata of the source? We consider the case of the LED database and focus on two rela- tion and roles: the listener and the place of the listening event. The LED curation workflow reuses entities from DBpedia, MusicBrainz, and also defines new entities in the Linked Data. Our analysis is (a) Places. (b) Agents. limited to books from archive.org annotated with a listener or place from DBpedia. We use DBpedia Spotlight [20] as entity recognition Figure 2: Distance of entity mention, in paragraphs. and linking system. First, we need to find the position of the evidence text back in the original source. Identifying the position of LED items in the these, 7.3% includes DBpedia entities as place or agent, 690 excerpts original book is not an easy task. In fact, the process of reporting from 26 books. These are the objects in our analysis. an excerpt from the book involves a number of modifications in Results are summarised in Figure 2. Charts display the distance the format that makes it very rare the chance that a precise text of the entity mentions, measured in number of paragraphs5 . This match would work. In addition, the reported text includes often analysis is partial as it only covers DBpedia entities being used as omissis or rephrasing in order to include co-references derived places or agents (listeners) with relation to books which sources from previous paragraphs. To solve the problem, we developed the we could retrieve from the Web. However, the answers to the re- algorithm presented in Listing 1. The method is based on using the maining questions are quite interesting. (Q1) The DBpedia place longest words as locators. The algorithm selects the occurrences of was mentioned in the textual excerpt only in 25.9% of the observed the longest words and isolate the surrounding portion of text using cases (179). The listener was mentioned in the excerpt only in 13 the length of the excerpt as heuristic. The resulting candidates are cases, 13.4% of the observed population (97). (Q2) 10% of the times then ranked according to their similarity against the excerpt using the place mention is less then 5 paragraphs from the evidence text. the well-known Levenshtein distance [18]. The candidate with the The agent is mentioned within 5 paragraphs from the evidence lowest score is elected as the original text. in 4% of the observed cases. (Q3) 83.2% of the times the DBpedia Figure 1 illustrates the features of the corpus. Of the 9059 lis- place was explicitely mentioned at least once in the source (574). In tening experiences in the database with a textual excerpt reported, 79 cases (11.4%) the place hasn’t been found either in the excerpt 7999 include a place (88.3%) and in 7222 of them the place points to or anywhere else in the source. A similar result is observable for DBpedia (79.9%). The agent is specified in 8258 of them (91.2%) but agents. Finally, there is good chance the entity is somewhere away only 2996 refer to a DBpedia entity (33.1%). In all other cases the from the evidence text. listener is created as a novel entity. 64.8% of the listeners are also the authors of the text - 5874 cases. 5 CHALLENGES This is not surprising as one of the most researched type of resources There are several aspects that make the task of automatically sup- were memories, diaries, and collection of letters. In addition, this porting the acquisition of knowledge about documentary evidence answers our Q4 and shows how important it could be to intelligently particularly interesting from the point of view of scientific knowl- derive information from the source metadata. However, less than edge acquisition. half of the agents exist in DBpedia (2130 times, 23.5% of the total). 5 Text segmentation is itself a difficult task. In our analysis, we measured distances Finally, only 11.3% of the sources could be retrieved as open texts, in number of characters, considered one word to be 5 characters (the approximated referring to 1026 of the documentary evidence in the database. Of average length in english) and one paragraph to amount to 200 words. Third International Workshop on Capturing Scientific Knowledge (Sciknow), November 19th, 2019. Collocated with the tenth International Conference on Knowledge Capture (K-CAP), Los Angeles, CA, USA. Enrico Daga and Enrico Motta An important characteristic is the amount of implicit informa- [2] Harith Alani, Sanghee Kim, David E Millard, Mark J Weal, Wendy Hall, Paul H tion necessary to characterise the documentary evidence that is Lewis, and Nigel Shadbolt. 2003. Web based knowledge extraction and consoli- dation for automatic ontology instantiation. (2003). not derivable from the reference text. As a result, a typycal knowl- [3] Harith Alani, Sanghee Kim, David E Millard, Mark J Weal, Wendy Hall, Paul H edge extraction approach may fail at performing an inference that Lewis, and Nigel R Shadbolt. 2003. Automatic ontology-based knowledge extrac- tion from web documents. IEEE Intelligent Systems 18, 1 (2003), 14–21. is normally the result of user’s expertise. A domain-independent [4] Helen Barlow and David Rowland. 2017. Listening to music: people, practices machine reader could produce a formal representation of the text and experiences. with entities and roles linked together. Theoretically, processing a [5] Jari Björne, Filip Ginter, Sampo Pyysalo, Jun’ichi Tsujii, and Tapio Salakoski. 2010. Complex event extraction at PubMed scale. Bioinformatics 26, 12 (2010). text through a machine reading system would reduce the problem [6] Francesco Corcoglioniti, Marco Rospocher, and Alessio Palmero Aprosio. 2016. to one of ontology alignment. However, as we have seen, the needed Frame-based ontology population with PIKES. IEEE Transactions on Knowledge entities may not be mentioned in the text excerpt at a reasonable and Data Engineering 28, 12 (2016), 3261–3275. [7] Enrico Daga, Mathieu d’Aquin, Alessandro Adamou, and Stuart Brown. 2016. proximity. In addition, having to deal with an ontology alignment The Open University Linked Data - data.open.ac.uk. Semantic Web 7, 2 (2016). problem does not necessarily reduces the distance to the goal. [8] Enrico Daga and Enrico Motta. 2019. Capturing themed evidence, a hybrid approach. In 18th Int. Conference on Knowledge Capture (K-CAP). ACM, To appear. Crucially, metadata about the sources should be used to derive [9] Dejing Dou, Hao Wang, and Haishan Liu. 2015. Semantic data mining: A survey information such as the time span of the documentary material or of ontology-based approaches. In Proceedings of the 2015 IEEE 9th international information about the author(s). Determining who is the person conference on semantic computing (IEEE ICSC 2015). IEEE, 244–251. [10] Aldo Gangemi. 2013. A comparison of knowledge extraction tools for the semantic reporting the event could contribute to populate the agent (for first- web. In Extended semantic web conference. Springer, 351–366. person reports) but also on deriving more contextual information, [11] Aldo Gangemi, Valentina Presutti, Diego Reforgiato Recupero, Andrea Giovanni for example, related to the historical period or the interests of the Nuzzolese, Francesco Draicchio, and Misael Mongiovì. 2017. Semantic web machine reading with FRED. Semantic Web 8, 6 (2017), 873–893. author. Linking an author to a knowledge graph (such as DBpedia) [12] Archana Goyal, Vishal Gupta, and Manish Kumar. 2018. Recent named entity could provide insight on the validity of the hypotheses for assigning recognition and classification techniques: a systematic review. Computer Science Review 29 (2018), 21–43. certain roles, for example, by deriving that Stravinsky is the author [13] Siegfried Handschuh, Steffen Staab, and Fabio Ciravegna. 2002. S-CREAM—semi- of Oedipus Rex (E 2 ). Therefore, a general solution should be able automatic creation of metadata. In International Conference on Knowledge Engi- to reason upon contextual knowledge. Intuitively, the system neering and Knowledge Management. Springer, 358–372. [14] Frederik Hogenboom, Flavius Frasincar, Uzay Kaymak, Franciska De Jong, and should be capable of fitting within the constraints of the domain Emiel Caron. 2016. A survey of event extraction methods from text for decision specific ontology and exploit it to tailor the approach. The ontology support systems. Decision Support Systems 85 (2016), 12–22. specification would provide information about the main types and [15] Wouter IJntema, Jordy Sangers, Frederik Hogenboom, and Flavius Frasincar. 2012. A lexico-semantic pattern language for learning ontology instances from text. relations of interest, and those can be used to derive contextual Web Semantics: Science, Services and Agents on the World Wide Web 15 (2012). information from existing commons sense knowledge bases (e.g. [16] Jing Jiang. 2012. Information extraction from text. In Mining text data. Springer. [17] Vangelis Karkaletsis, Pavlina Fragkou, Georgios Petasis, and Elias Iosif. 2011. On- ConceptNet6 ). tology based information extraction from text. In Knowledge-driven multimedia Although it may seem that these databases have a limited domain information extraction and ontology evolution. Springer, 89–109. of interest, there are few chances that the variety of types and [18] Vladimir I Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10. 707–710. entities useful could be found in a single, encyclopedic, knowledge [19] Jose L Martinez-Rodriguez, Aidan Hogan, and Ivan Lopez-Arevalo. 2018. Infor- base. In the case of the LED project, part of the Linked Open Data, mation extraction meets the semantic web: a survey. Semantic Web (2018). the documentary evidence links to a variety of external resources [20] Pablo N Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer. 2011. DBpedia spotlight: shedding light on the web of documents. In Proceedings of the (e.g. MusicBrainz7 and Geonames8 ). The system should be able to 7th international conference on semantic systems. ACM, 1–8. work across distributed and heterogeneus datasets in search for [21] Heiko Paulheim. 2013. Exploiting Linked Open Data as Background Knowledge in Data Mining. DMoLD 1082 (2013). relevant resources. These may include common-sense knowledge [22] Sachin Pawar, Girish K Palshikar, and Pushpak Bhattacharyya. 2017. Relation and linguistic resources, textual corpora, gazetteeres, thesauri, and extraction: A survey. arXiv preprint arXiv:1712.05191 (2017). specialised digital libraries. Ultimately, the system should be able [23] Jakub Piskorski, Hristo Tanev, and Pinar Oezden Wennerberg. 2007. Extract- ing violent events from on-line news for ontology population. In International to recognise entities and their roles despite the fact that they can Conference on Business Information Systems. Springer, 287–300. be linked to any reference database. [24] Alan Ritter, Oren Etzioni, Sam Clark, et al. 2012. Open domain event extraction Ultimately, cultural studies like the ones performed in the LED from twitter. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1104–1112. and RED projects often coin novel concepts, such as Listening [25] Roxane Segers, Marieke Van Erp, Lourens Van Der Meij, Lora Aroyo, Jacco van Experience, whose structure and features cannot be found in pre- Ossenbruggen, Guus Schreiber, Bob Wielinga, Johan Oomen, and Geertje Jacobs. 2011. Hacking history via event extraction. In Proceedings of the sixth international existing databases. In fact, the definition of a concept of interest conference on Knowledge capture. ACM. is itself a scientific output for which the database constitutes the [26] Maria Vargas-Vera and David Celjuska. 2004. Event recognition on news stories empirical proof of relevance to scholarship in the related field. It is and semi-automatic population of an ontology. In IEEE/WIC/ACM International Conference on Web Intelligence (WI’04). IEEE, 615–618. an open question to what extent learning from one of such databases [27] Daya C Wimalasuriya and Dejing Dou. 2010. Ontology-based information ex- could help in supporting a new, coming one. traction: An introduction and a survey of current approaches. [28] Ian H Witten, Gordon W Paynter, Eibe Frank, Carl Gutwin, and Craig G Nevill- Manning. 2005. Kea: Practical automated keyphrase extraction. In Design and REFERENCES Usability of Digital Libraries: Case Studies in the Asia Pacific. IGI Global, 129–152. [1] Alessandro Adamou, Mathieu d’Aquin, Helen Barlow, and Simon Brown. 2014. [29] Kalliopi Zervanou, Ioannis Korkontzelos, Antal Van Den Bosch, and Sophia LED: curated and crowdsourced linked data on music listening experiences. Ananiadou. 2011. Enrichment and structuring of archival description metadata. Proceedings of the ISWC 2014 Posters & Demonstrations Track (2014). In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities. ACL. 6 http://conceptnet.io/ 7 https://musicbrainz.org/ 8 https://www.geonames.org/