=Paper= {{Paper |id=Vol-2526/short2 |storemode=property |title=Challenging Knowledge Extraction to Support the Curation of Documentary Evidence in the Humanities (short paper) |pdfUrl=https://ceur-ws.org/Vol-2526/short2.pdf |volume=Vol-2526 |authors=Enrico Daga,Enrico Motta |dblpUrl=https://dblp.org/rec/conf/kcap/DagaM19a }} ==Challenging Knowledge Extraction to Support the Curation of Documentary Evidence in the Humanities (short paper)== https://ceur-ws.org/Vol-2526/short2.pdf
                Challenging knowledge extraction to support
           the curation of documentary evidence in the humanities
                                   Enrico Daga                                                                     Enrico Motta
                           enrico.daga@open.ac.uk                                                          enrico.motta@open.ac.uk
                            The Open University                                                              The Open University
                       Milton Keynes, United Kingdom                                                    Milton Keynes, United Kingdom

ABSTRACT                                                                                These are catalogued through a sophisticated workflow but more
The identification and cataloguing of documentary evidence from                         importantly by means of a rich ontology covering a variety of as-
textual corpora is an important part of empirical research in the                       pects related to the experience, for example, the time and place it
humanities. In this position paper, we ponder the applicability of                      occurred, the source where the evidence has been retrieved, and
knowledge extraction techniques to support the data acquisition                         the entities involved, such as, a performer, a composer, or a creative
process. Initially, we characterise the task by analysing the end-                      work [1]. Another example is the UK Reading Experience Database
to-end process occurring in the data curation activity. After that,                     (RED). UK RED includes over 30,000 records of reading experiences
we examine general knowledge extraction tasks and discuss their                         sourced from the English literature. The curatorial effort required
relation to the problem at hand. Considering the case of the Listen-                    to populate these databases was significant and the size and quality
ing Experience Database (LED), we perform an empirical analysis                         of these databases is a major achievement of these projects.
focusing on two roles: the listener and the place. The results show,                        In this position paper we ponder the applicability of knowledge
among other things, how the entities are often mentioned many                           extraction techniques to support the data curation activity. Initially,
paragraphs away from the evidence text or are not in the source at                      we introduce the case study and analyse the data curation activity.
all. We discuss the challenges emerged from the point of view of                        After that, we examine general knowledge extraction tasks and
scientific knowledge acquisition.                                                       discuss their relation to the problem at hand. Considering the case
                                                                                        of the Listening Experience Database (LED), we perform an em-
CCS CONCEPTS                                                                            pirical analysis of a portion of the database, focusing on the role
                                                                                        "listener" and "place". Specifically, we elaborate on the hypothesis
• Information systems → Information extraction; • Comput-
                                                                                        that the related entities can be automatically retrieved from the
ing methodologies → Information extraction; • Applied com-
                                                                                        source. Finally, we discuss a set of challenges for knowledge ex-
puting → Arts and humanities.
                                                                                        traction related to supporting the curation of this type of evidence
KEYWORDS                                                                                databases.
documentary evidence, knowledge extraction, named entity recog-
                                                                                        2    DATA CURATION ACTIVITY
nition, DBpedia
                                                                                        In general, the discovery and selection of documentary evidence
1     INTRODUCTION                                                                      is an activity that may not be conducted systematically. However,
                                                                                        in the context of enterprises such as the LED project, there is an
The identification and cataloguing of documentary evidence from
                                                                                        attempt to objectively select, extract, and curate documentary ev-
textual corpora is an important part of empirical research in the
                                                                                        idence from texts. From the curator’s perspective, it is not about
humanities. An increasing number of recent initiatives in the dig-
                                                                                        searching archives or repositories but exploring specific sources of
ital humanities have as primary objective the curation of a data-
                                                                                        value, for example, specific books. In [8] we developed an approach
base collecting text excerpts augmented with fine-grained meta-
                                                                                        for retrieving textual excerpts relevant for a certain theme of inter-
data, mentioned entities, and their relations, often in the form of
                                                                                        est in a book by combining language analysis, entity recognition,
knoweldge graphs developed adopting the linked data paradigm.
                                                                                        and a general purpose knowledge graph (DBpedia) and showed
These databases are developed following controlled processes, in
                                                                                        that many of those pieces of evidence are characterised by implicit
the spirit of digital library management, where the identification
                                                                                        information. In addition, once the text is found, populating all the
and onboarding of relevant information is substantially entrusted
                                                                                        metadata is a long and difficult task.
to research students, librarians, and similar domain experts. The
                                                                                           To illustrate the problem, let’s consider two examples from the
Listening Experience Database Project (LED)1 , for example, is an ini-
                                                                                        LED project:
tiative aimed at collecting accounts of people’s private experiences
                                                                                           E1 "Music is certainly a pleasure that may be reckoned intellectual,
of listening to music [4]. Since 2012, the LED community explored
                                                                                        and we shall never again have it in the perfection it is this year, because
a wide variety of sources, collecting over 10.000 unique experiences.
                                                                                        Mr. Handel will not compose any more! Oratorios begin next week,
1 https://led.kmi.open.ac.uk/
                                                                                        to my great joy, for they are the highest entertainment to me." 2 The
Copyright ©2019 for this paper by its authors. Use permitted under Creative Commons
                                                                                        2 Source: Mary Granville, and Augusta Hall (ed.), Autobiography and Correspondence
License Attribution 4.0 International (CC BY 4.0).
In: Proceedings of the 3rd International Workshop on Capturing Scientific Knowledge     of Mary Granville, Mrs Delany: with interesting Reminiscences of King George the
(Sciknow), November 19th, 2019. Collocated with the tenth International Conference on   Third and Queen Charlotte, volume 1 (London, 1861), p. 594. https://led.kmi.open.ac.
Knowledge Capture (K-CAP), Los Angeles, CA, USA..                                       uk/entity/lexp/1444424772006 accessed: 30 September, 2019.
Third International Workshop on Capturing Scientific Knowledge (Sciknow), November 19th, 2019. Collocated with the tenth International Conference on Knowledge Capture
(K-CAP), Los Angeles, CA, USA.                                                                                                              Enrico Daga and Enrico Motta


excerpt refers to Mrs Delany’s report of a (series of) live perfor-                     output described using a knowledge representation formalism. En-
mances of Operas and Oratorios by George Frideric Handel,                               tity extraction and classification are two related tasks referring
happened in March, 1737.                                                                to the location of mentions of entities in an input text and their
    E2 "I then went to Amsterdam to conduct Oedipus at the Concert-                     categorization, as in the following example: "We went to the re-
gebouw, which was celebrating its fortieth anniversary by a series of                   hearsal of JoshuaP er son last TuesdayT ime ". Entity Linking, in-
sumptuous musical productions. The fine Concertgebouw orchestra,                        stead, refers to finding mentions of entities from a database into
always at the same high level, the magnificent male choruses from                       a natural language resource or, similarly, to appropriately disam-
the Royal Apollo Society, soloists of the first rank - among them Mme                   biguate words by associating a knowledge base identifier. Often,
Hélène Sadoven as Jocasta, Louis van Tulder as Oedipus, and Paul                        the three tasks are performed together and labelled Named En-
Huf, an excellent reader - and the way in which my work was re-                         tity Recognition and Classification (NERC) [12]. Linked Data
ceived by the public, have left a particularly precious memory that                     and NER together have been extensively employed in a number
I recall with much enjoyment." 3 Stravinsky, in the beginning of                        of knowledge extraction and data mining tasks (e.g., the work of
1928, celebrates the high level of the Concertgebouw orchestra                          H. Paulheim [21]). Relation extraction refers to the identifica-
and singers performing his Oedipus Rex. All of them are listed as                       tion of n − ary relations (for n ≥ 2) within the source, usually
entities in the LED database.                                                           addressed with a combination of NLP and machine learning tech-
    In both examples, several of the entities involved are not men-                     niques [22]. The relations Composer(Opedipus Rex,Starvinsky)
tioned in the excerpt and are derived from the curator’s knowledge                      and Performed(Opedipus Rex,Concertgebouw,1928) are two
of the source (for example, Mrs Delany is the author of the letter in                   examples. Event extraction is a special case of relation extrac-
E 1 ) and the domain (e.g. the full name of the work is Oedipus Rex                     tion where the focus is on identifying an event, usually an action
in E 2 ).                                                                               being performed by an agent in a certain setting. This task is ex-
    Here we focus on the challenge of automatically populating the                      tensively studied in domains such as Biomedicine [5], Finance and
record and support an expert in identifying, collecting and inputting                   Politics [15], and Science [26]. Approaches dedicated to the de-
the relevant information. In other words, we aim at automatically                       tection and extraction of historical and biographical events are
populating (as many as possible) roles of the ontology. For instance,                   designed in [25, 29]. The notion of event is generally considered as
a listening experience specification can be derived from the avail-                     something happening at a specific time and place, which constitutes
able graph on data.open.ac.uk [7]. The type ListeningExperience                         an incident of substantial relevance [14]. Therefore, the objective is
includes the following properties, among others (we omit names-                         to identify the action triggering the event (e.g. the verb perform) and
paces for readability):                                                                 then the associated roles. Data-driven approaches usually involve
     • agent (who is the listener)                                                      statistical reasoning or probabilistic methods like Machine Learning
     • time (when the listening event occurred)                                         techniques. In contrast, knowledge-based methods are generally
     • place (where it occurred)                                                        top-down and based on pre-defined templates, for example, lexico-
     • subject (what was listened)                                                      semantic patterns [15]. The two approaches can be combined and
     • is_reported_in (a link to the source)                                            machine learning methods used to learn such patterns [23]. How-
     • has_environment (e.g. was it a public or a private place,                        ever, the notion of event is still ill-defined in NLP research and
       indoor or outdoor)                                                               this makes it hard to develop methods which are portable, effec-
                                                                                        tively, to multiple domains [14]. Research in open domain event
A ListeningExperience is related to other relevant items, notably
                                                                                        extraction focuses essentially on social media data [24] where the
Performance, WrittenWork, MusicArtist, and Country. The knowl-
                                                                                        task is the extraction of statements for summarization purposes,
edge extraction system should be able to derive the requirements
                                                                                        similar to the one of key-phrases extraction [28]. Ontology-based
from the ontology specification, primarily the data values and roles
                                                                                        information extraction (OBIE) uses formal ontologies to guide the
involved. For example, it should derive the requirement to find
                                                                                        extraction process [17, 27]. Relevant work in the area is surveyed
the agent of the ListeningExperience, its place and time, and
                                                                                        in [9, 19]. In 2013, Gangemi provided an introduction and compari-
that there may be a specific musical work to be identified and,
                                                                                        son of fourteen tools for knowledge extraction over unstructured
eventually, the author of the musical work, filling the roles associ-
                                                                                        corpora, where the task is defined as general purpose machine
ated to the path subject -> ? a Performance -> performance of -> ? a
                                                                                        reading [10]. A machine reader transforms a natural language text
MusicExpression.
                                                                                        into formal knowledge, according to a shared semantics. State of art
                                                                                        methods include FRED [11] and PIKES [6]. These approaches are
3     KNOWLEDGE EXTRACTION
                                                                                        based on a frame-based semantics that is at the same time domain-
Knowledge extraction is a branch of artificial intelligence cover-                      and task-independent. Instead, a domain-oriented solution would
ing a variety of tasks related to the automatic or semi-automatic                       identify knowledge components of interest in the text, similarly to
derivation of formal symbolic knowledge from unstructured or                            what explored, for example, in the work of Alani [3]. This task is
semi-structured sources4 .                                                              also considered as an automatic ontology instantiation [2] or semi-
   The area comprehends research in a variety of problems re-                           automatic creation of metadata [13]. A suitable approach should be
lated to lifting an unstructured or semi-structured source into an                      able to detect the requirements from a domain-specific ontology
3 Igor Stravinksy, Igor Stravinsky: An Autobiography (1936), p. 139. https://led.kmi.   and, having as input the text excerpt, the source metadata, and
open.ac.uk/entity/lexp/1435674909834 accessed: 30 September, 2019.                      potentially other knowldge bases, generate suitable hypotheses of
4 For a general introduction, see [16].                                                 values and entities on any relevant role.
Challenging
 Third International
             knowledge
                     Workshop
                       extraction
                               on to
                                  Capturing
                                     support Scientific Knowledge (Sciknow), November 19th, 2019. Collocated with the tenth International Conference on Knowledge Capture
the curation of documentary evidence in the humanities                                                                                      (K-CAP), Los Angeles, CA, USA.



                                                                                             Listing 1: Detect the location of an excerpt in a source.
                                                                                         excerpt , Source ;
                                                                                         b e s t [ t , b , e , s ] ; / / t e x t , b e g i n , end , s c o r e
                                                                                         words [ ] = t o k e n i z e ( e x c e r p t )
                                                                                         words [ ] = s o r t B y L e n g t h D e s c ( words [ ] ) / / L o n g e s t on t o p
                                                                                         F o r e a c h word i n words [ ] :
                                                                                           o c c u r r e n c e s [ ] [ b , e ] = f i n d ( word , S o u r c e )
                                                                                           p o s i t i o n [ b , e ] = f i n d ( word , e x c e r p t )
                                                                                           Foreach occurrence [ b , e ] in occurrences [ ] [ b , e ] :
Figure 1: Statistics on the LED database to illustrate the cov-                                begin = occurrence . b − p o s i t i o n . b
erage of DBpedia entities and the scope of our analysis.                                       end         = occurrence . e + len ( excerpt ) − position . e
                                                                                               p o s s i b l e = s u b s t r i n g ( S o u r c e , b e g i n , end )
                                                                                               score = levenshtein ( excerpt , possible )
                                                                                               i f ( score < best [ s ])
4    EMPIRICAL ANALYSIS                                                                            b e s t [ t , b , e , s ] = [ p o s s i b l e , b e g i n , end , s c o r e ]
To discuss the feasibility and difficulty of the task, we relax the                            fi
                                                                                          End
problem and verify to what extent the entities that are part of the
                                                                                         End
curated metadata could potentially be automatically derived from                         return best
the sources. Specifically, we want to answer the questions: (Q1)
Could a system find the target entities in the excerpt? (Q2) Could a
system find the target entities in the text surrounding the excerpt?
(Q3) How far from the excerpt the entity is? (Q4) Could it be found
in the metadata of the source?
    We consider the case of the LED database and focus on two rela-
tion and roles: the listener and the place of the listening event. The
LED curation workflow reuses entities from DBpedia, MusicBrainz,
and also defines new entities in the Linked Data. Our analysis is                                        (a) Places.                                    (b) Agents.
limited to books from archive.org annotated with a listener or place
from DBpedia. We use DBpedia Spotlight [20] as entity recognition                             Figure 2: Distance of entity mention, in paragraphs.
and linking system.
    First, we need to find the position of the evidence text back in
the original source. Identifying the position of LED items in the                       these, 7.3% includes DBpedia entities as place or agent, 690 excerpts
original book is not an easy task. In fact, the process of reporting                    from 26 books. These are the objects in our analysis.
an excerpt from the book involves a number of modifications in                             Results are summarised in Figure 2. Charts display the distance
the format that makes it very rare the chance that a precise text                       of the entity mentions, measured in number of paragraphs5 . This
match would work. In addition, the reported text includes often                         analysis is partial as it only covers DBpedia entities being used as
omissis or rephrasing in order to include co-references derived                         places or agents (listeners) with relation to books which sources
from previous paragraphs. To solve the problem, we developed the                        we could retrieve from the Web. However, the answers to the re-
algorithm presented in Listing 1. The method is based on using the                      maining questions are quite interesting. (Q1) The DBpedia place
longest words as locators. The algorithm selects the occurrences of                     was mentioned in the textual excerpt only in 25.9% of the observed
the longest words and isolate the surrounding portion of text using                     cases (179). The listener was mentioned in the excerpt only in 13
the length of the excerpt as heuristic. The resulting candidates are                    cases, 13.4% of the observed population (97). (Q2) 10% of the times
then ranked according to their similarity against the excerpt using                     the place mention is less then 5 paragraphs from the evidence text.
the well-known Levenshtein distance [18]. The candidate with the                        The agent is mentioned within 5 paragraphs from the evidence
lowest score is elected as the original text.                                           in 4% of the observed cases. (Q3) 83.2% of the times the DBpedia
    Figure 1 illustrates the features of the corpus. Of the 9059 lis-                   place was explicitely mentioned at least once in the source (574). In
tening experiences in the database with a textual excerpt reported,                     79 cases (11.4%) the place hasn’t been found either in the excerpt
7999 include a place (88.3%) and in 7222 of them the place points to                    or anywhere else in the source. A similar result is observable for
DBpedia (79.9%). The agent is specified in 8258 of them (91.2%) but                     agents. Finally, there is good chance the entity is somewhere away
only 2996 refer to a DBpedia entity (33.1%). In all other cases the                     from the evidence text.
listener is created as a novel entity.
    64.8% of the listeners are also the authors of the text - 5874 cases.                5     CHALLENGES
This is not surprising as one of the most researched type of resources                  There are several aspects that make the task of automatically sup-
were memories, diaries, and collection of letters. In addition, this                    porting the acquisition of knowledge about documentary evidence
answers our Q4 and shows how important it could be to intelligently                     particularly interesting from the point of view of scientific knowl-
derive information from the source metadata. However, less than                         edge acquisition.
half of the agents exist in DBpedia (2130 times, 23.5% of the total).                    5 Text segmentation is itself a difficult task. In our analysis, we measured distances
Finally, only 11.3% of the sources could be retrieved as open texts,                     in number of characters, considered one word to be 5 characters (the approximated
referring to 1026 of the documentary evidence in the database. Of                        average length in english) and one paragraph to amount to 200 words.
Third International Workshop on Capturing Scientific Knowledge (Sciknow), November 19th, 2019. Collocated with the tenth International Conference on Knowledge Capture
(K-CAP), Los Angeles, CA, USA.                                                                                                              Enrico Daga and Enrico Motta


    An important characteristic is the amount of implicit informa-                       [2] Harith Alani, Sanghee Kim, David E Millard, Mark J Weal, Wendy Hall, Paul H
tion necessary to characterise the documentary evidence that is                              Lewis, and Nigel Shadbolt. 2003. Web based knowledge extraction and consoli-
                                                                                             dation for automatic ontology instantiation. (2003).
not derivable from the reference text. As a result, a typycal knowl-                     [3] Harith Alani, Sanghee Kim, David E Millard, Mark J Weal, Wendy Hall, Paul H
edge extraction approach may fail at performing an inference that                            Lewis, and Nigel R Shadbolt. 2003. Automatic ontology-based knowledge extrac-
                                                                                             tion from web documents. IEEE Intelligent Systems 18, 1 (2003), 14–21.
is normally the result of user’s expertise. A domain-independent                         [4] Helen Barlow and David Rowland. 2017. Listening to music: people, practices
machine reader could produce a formal representation of the text                             and experiences.
with entities and roles linked together. Theoretically, processing a                     [5] Jari Björne, Filip Ginter, Sampo Pyysalo, Jun’ichi Tsujii, and Tapio Salakoski.
                                                                                             2010. Complex event extraction at PubMed scale. Bioinformatics 26, 12 (2010).
text through a machine reading system would reduce the problem                           [6] Francesco Corcoglioniti, Marco Rospocher, and Alessio Palmero Aprosio. 2016.
to one of ontology alignment. However, as we have seen, the needed                           Frame-based ontology population with PIKES. IEEE Transactions on Knowledge
entities may not be mentioned in the text excerpt at a reasonable                            and Data Engineering 28, 12 (2016), 3261–3275.
                                                                                         [7] Enrico Daga, Mathieu d’Aquin, Alessandro Adamou, and Stuart Brown. 2016.
proximity. In addition, having to deal with an ontology alignment                            The Open University Linked Data - data.open.ac.uk. Semantic Web 7, 2 (2016).
problem does not necessarily reduces the distance to the goal.                           [8] Enrico Daga and Enrico Motta. 2019. Capturing themed evidence, a hybrid
                                                                                             approach. In 18th Int. Conference on Knowledge Capture (K-CAP). ACM, To appear.
    Crucially, metadata about the sources should be used to derive                       [9] Dejing Dou, Hao Wang, and Haishan Liu. 2015. Semantic data mining: A survey
information such as the time span of the documentary material or                             of ontology-based approaches. In Proceedings of the 2015 IEEE 9th international
information about the author(s). Determining who is the person                               conference on semantic computing (IEEE ICSC 2015). IEEE, 244–251.
                                                                                        [10] Aldo Gangemi. 2013. A comparison of knowledge extraction tools for the semantic
reporting the event could contribute to populate the agent (for first-                       web. In Extended semantic web conference. Springer, 351–366.
person reports) but also on deriving more contextual information,                       [11] Aldo Gangemi, Valentina Presutti, Diego Reforgiato Recupero, Andrea Giovanni
for example, related to the historical period or the interests of the                        Nuzzolese, Francesco Draicchio, and Misael Mongiovì. 2017. Semantic web
                                                                                             machine reading with FRED. Semantic Web 8, 6 (2017), 873–893.
author. Linking an author to a knowledge graph (such as DBpedia)                        [12] Archana Goyal, Vishal Gupta, and Manish Kumar. 2018. Recent named entity
could provide insight on the validity of the hypotheses for assigning                        recognition and classification techniques: a systematic review. Computer Science
                                                                                             Review 29 (2018), 21–43.
certain roles, for example, by deriving that Stravinsky is the author                   [13] Siegfried Handschuh, Steffen Staab, and Fabio Ciravegna. 2002. S-CREAM—semi-
of Oedipus Rex (E 2 ). Therefore, a general solution should be able                          automatic creation of metadata. In International Conference on Knowledge Engi-
to reason upon contextual knowledge. Intuitively, the system                                 neering and Knowledge Management. Springer, 358–372.
                                                                                        [14] Frederik Hogenboom, Flavius Frasincar, Uzay Kaymak, Franciska De Jong, and
should be capable of fitting within the constraints of the domain                            Emiel Caron. 2016. A survey of event extraction methods from text for decision
specific ontology and exploit it to tailor the approach. The ontology                        support systems. Decision Support Systems 85 (2016), 12–22.
specification would provide information about the main types and                        [15] Wouter IJntema, Jordy Sangers, Frederik Hogenboom, and Flavius Frasincar. 2012.
                                                                                             A lexico-semantic pattern language for learning ontology instances from text.
relations of interest, and those can be used to derive contextual                            Web Semantics: Science, Services and Agents on the World Wide Web 15 (2012).
information from existing commons sense knowledge bases (e.g.                           [16] Jing Jiang. 2012. Information extraction from text. In Mining text data. Springer.
                                                                                        [17] Vangelis Karkaletsis, Pavlina Fragkou, Georgios Petasis, and Elias Iosif. 2011. On-
ConceptNet6 ).                                                                               tology based information extraction from text. In Knowledge-driven multimedia
    Although it may seem that these databases have a limited domain                          information extraction and ontology evolution. Springer, 89–109.
of interest, there are few chances that the variety of types and                        [18] Vladimir I Levenshtein. 1966. Binary codes capable of correcting deletions,
                                                                                             insertions, and reversals. In Soviet physics doklady, Vol. 10. 707–710.
entities useful could be found in a single, encyclopedic, knowledge                     [19] Jose L Martinez-Rodriguez, Aidan Hogan, and Ivan Lopez-Arevalo. 2018. Infor-
base. In the case of the LED project, part of the Linked Open Data,                          mation extraction meets the semantic web: a survey. Semantic Web (2018).
the documentary evidence links to a variety of external resources                       [20] Pablo N Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer. 2011.
                                                                                             DBpedia spotlight: shedding light on the web of documents. In Proceedings of the
(e.g. MusicBrainz7 and Geonames8 ). The system should be able to                             7th international conference on semantic systems. ACM, 1–8.
work across distributed and heterogeneus datasets in search for                         [21] Heiko Paulheim. 2013. Exploiting Linked Open Data as Background Knowledge
                                                                                             in Data Mining. DMoLD 1082 (2013).
relevant resources. These may include common-sense knowledge                            [22] Sachin Pawar, Girish K Palshikar, and Pushpak Bhattacharyya. 2017. Relation
and linguistic resources, textual corpora, gazetteeres, thesauri, and                        extraction: A survey. arXiv preprint arXiv:1712.05191 (2017).
specialised digital libraries. Ultimately, the system should be able                    [23] Jakub Piskorski, Hristo Tanev, and Pinar Oezden Wennerberg. 2007. Extract-
                                                                                             ing violent events from on-line news for ontology population. In International
to recognise entities and their roles despite the fact that they can                         Conference on Business Information Systems. Springer, 287–300.
be linked to any reference database.                                                    [24] Alan Ritter, Oren Etzioni, Sam Clark, et al. 2012. Open domain event extraction
    Ultimately, cultural studies like the ones performed in the LED                          from twitter. In Proceedings of the 18th ACM SIGKDD international conference on
                                                                                             Knowledge discovery and data mining. ACM, 1104–1112.
and RED projects often coin novel concepts, such as Listening                           [25] Roxane Segers, Marieke Van Erp, Lourens Van Der Meij, Lora Aroyo, Jacco van
Experience, whose structure and features cannot be found in pre-                             Ossenbruggen, Guus Schreiber, Bob Wielinga, Johan Oomen, and Geertje Jacobs.
                                                                                             2011. Hacking history via event extraction. In Proceedings of the sixth international
existing databases. In fact, the definition of a concept of interest                         conference on Knowledge capture. ACM.
is itself a scientific output for which the database constitutes the                    [26] Maria Vargas-Vera and David Celjuska. 2004. Event recognition on news stories
empirical proof of relevance to scholarship in the related field. It is                      and semi-automatic population of an ontology. In IEEE/WIC/ACM International
                                                                                             Conference on Web Intelligence (WI’04). IEEE, 615–618.
an open question to what extent learning from one of such databases                     [27] Daya C Wimalasuriya and Dejing Dou. 2010. Ontology-based information ex-
could help in supporting a new, coming one.                                                  traction: An introduction and a survey of current approaches.
                                                                                        [28] Ian H Witten, Gordon W Paynter, Eibe Frank, Carl Gutwin, and Craig G Nevill-
                                                                                             Manning. 2005. Kea: Practical automated keyphrase extraction. In Design and
REFERENCES                                                                                   Usability of Digital Libraries: Case Studies in the Asia Pacific. IGI Global, 129–152.
 [1] Alessandro Adamou, Mathieu d’Aquin, Helen Barlow, and Simon Brown. 2014.           [29] Kalliopi Zervanou, Ioannis Korkontzelos, Antal Van Den Bosch, and Sophia
     LED: curated and crowdsourced linked data on music listening experiences.               Ananiadou. 2011. Enrichment and structuring of archival description metadata.
     Proceedings of the ISWC 2014 Posters & Demonstrations Track (2014).                     In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural
                                                                                             Heritage, Social Sciences, and Humanities. ACL.
6 http://conceptnet.io/
7 https://musicbrainz.org/
8 https://www.geonames.org/