BiographyNet: Managing Provenance at multiple levels and from di↵erent perspectives Niels Ockeloen, Antske Fokkens, Serge ter Braake, Piek Vossen, Victor de Boer, Guus Schreiber, and Susan Legêne The Network Institute, VU University Amsterdam De Boelelaan 1081, 1081 HV Amsterdam, The Netherlands {niels.ockeloen,antske.fokkens,s.ter.braake,piek.vossen v.de.boer,guus.schreiber,s.legene}@vu.nl http://wm.cs.vu.nl Abstract. The BiographyNet project aims at inspiring historians when setting up new research projects. The goal is to create a semantic knowl- edge base by extracting links between people, historic events, places and time periods from a variety of Dutch biographical dictionaries. A demon- strator will be developed providing visualization and browsing techniques for the knowledge base. In order to establish its credibility as a serious research tool, keeping track of provenance information is crucial. This paper describes a schema that models provenance from di↵erent per- spectives and at multiple levels within BiographyNet. We will present a concrete model for the BiographyNet demonstrator that uses elements from the Europeana Data Model [6], PROV-DM [17] and P-PLAN [11]. Keywords: eHumanities, Linked Data, PROV-DM, P-PLAN, ORE, EDM 1 Introduction E-humanities investigates what can be done in humanities with modern tech- niques which we could not do before, or only could do with a great deal of e↵ort. E-history is a subdomain of e-humanities which o↵ers a way of linking pieces of information and discovering relationships which otherwise would be difficult to trace. It generally aims at improving methods of existing historical research rather than introducing a whole new way of historical research [22]. It creates pathways through information, rather than being the closing factor or end re- sult in historical research [1, 41]. E↵orts in e-humanities often concentrate on how to mine ‘big data’, which we define as data which is very difficult to han- dle manually for a traditional researcher. More challenging, and in general also more interesting, are projects which aim to go beyond the simple data mining and endeavor to answer difficult research questions like the similarity between and interdependability of two or three texts, tracing and defining the subjective elements and descriptions, or signaling traces of political or cultural influences 2 Ockeloen et al. from a society during a given period. These new ways of mining historical data lead to new questions on provenance of information. It is imperative for histo- rians to keep a good oversight over the sources which were used to produce a certain output. How reliable are the sources which were used and what do they tell about the significance of the outcome? What di↵erences are found in the information that individual sources provide? When information di↵ers, how are specific points of view distributed over di↵erent sources? How can results be manipulated by adjusting queries for a more accurate result? For these reasons, the historian needs to have an aggregated view of the process from query to output and, if necessary, inspect the whole process step by step to learn which additional sources and heuristics were involved. 1.1 Use Case: BiographyNet The BiographyNet project is an e-history project bringing together researchers from history, computational linguistics and computer science. The project uses data from the Biography Portal of the Netherlands (BP), which contains approx- imately 125,000 biographies from a variety of Dutch biographical dictionaries, describing around 76,000 individuals. The aim of BiographyNet is to develop a demonstrator which supports the discovery of interrelations between people, events, places and time periods in biographical descriptions. Through a com- bination of data enrichment, quantitative analysis, visualization and browsing techniques, the demonstrator should provide leads and insights that may be hard to discover using traditional methods. As such, it may inspire historians to investigate more ambitious research questions. The BP links biographies written by thousands of authors with very di↵erent temporal and academic backgrounds. This results in many levels of reliability of the 125,000 entries in this melting pot of Dutch biographies. Provenance informa- tion is therefore an important factor. It must however be noted that provenance information on the original sources does not go beyond the information that is provided by the BP such as author, publisher or the book from which a text was taken. 2 Motivation The demonstrator should help historians do their research. This goal can only be met if the validity of the demonstrator’s results can be verified. To this end, information needs to be available on performed operations as well as on used sources. According to Groth et al. [12], “data can only be meaningfully reused if the collection processes are exposed to users. This enables the assessment of the context in which the data was created, its quality and validity, and the appropriate conditions for use”. Hence, provenance plays an important role in establishing the demonstrator’s credibility. Provenance needs to be modelled from di↵erent perspectives and at multiple levels for BiographyNet. These di↵erent perspectives include 1) the perspective BiographyNet: Managing Provenance 3 of the information used to produce the results provided by the demonstrator, e.g. which original sources contributed to the outcome, 2) the perspective of the processes involved in creating the results and 3) the perspective of the people that were involved in setting up the pipeline of processes. The various levels include 1) provenance at component level, recording each aspect of the processing steps involved such as tool name, version, etc. and 2) an aggregated view of the provenance information for the interlinked processes as a whole. The latter is targeted at the end user of the system, in this case the historian, while the former is needed by the computer scientist in case the outcome of an aggregated process is pulled into question. In the next two sections, we address the requirements for provenance mod- elling specific for BiographyNet. First, we will address the point of view of histo- rians who are primarily interested in the reliability of the system. We will explain how the requirements for historians relate to the categories for provenance on the web defined by Groth et al [12]. Section 4 will outline BiographyNet from the point of view of the system developers whose primary interest is to improve the technology behind the demonstrator. Section 6 will describe the BiographyNet schema devised to allocate the required provenance information as described in the preceding sections. 3 Requirements for Historians There are two main requirements for the historian regarding provenance when using the demonstrator: A trace back to the text and metadata in the original source, and insight into the processes manipulating and selecting the original data. We will explain the first requirement through a research question on the background of the 71 governors-general of the Dutch Indies between 1610 and 1949. If, for instance, we run a query to find out what the average age of these individuals was at the time of their appointment, provenance information of di↵erent granularity should be present: a) an overview of the sources (in our case biographical dictionaries) that were used for the overall outcome and how often each individual source was consulted, b) an overview of potentially relevant data that was excluded from the end result. This is important in case of conflicting data, where one source generally considered more reliable was used rather than another and c) the sources that were used for a specific results (i.e. the age of a specific governor at the time of his appointment). One can assume that few historians will have the background to completely (or even partly) understand the finer technical details of how data are processed in order to answer a query. Even when a new generation of ‘e-historians’ is trained, one cannot expect them to be computer scientists. Therefore, provenance of data manipulation should be modelled as simple as possible and focus on aspects that may directly influence the outcome of research questions. First and foremost, it should always be indicated whether information is directly extracted from the metadata or the result of automatic interpretation of text. Complete accuracy in automatic text interpretation cannot be guaranteed. Information 4 Ockeloen et al. extracted from text should therefore always include a direct link to the original source. Provenance should also indicate the overall performance of the system that interpreted the text; depending on the kind of question, the historian may want to have results that aim for high recall or high precision. Finally, a global description of heuristics used when interpreting data should be provided. While resolving ambiguous location names, for instance, a strategy that always prefers locations in or near the Netherlands is likely to lead to good results within the BiographyNet project. However, if the historian wants to investigate the ties between officials in the former Dutch colonies (where cities with Dutch names can be found), this strategy would bear a direct undesirable influence on the results. The historian should thus be able to check whether the interpretation process used any strategies that may introduce a bias that influences results. If we translate this to the categories outlined in [12], this leads to the following requirements.1 The objects for which we need to model provenance are texts from several sources, metadata and statements extracted from the text. Texts and metadata are attributed to publishers and authors of this data. Extracted information should also indicate the author or publisher of the original text and, in addition, point to the system used to extract the information. There is thus a tight link between the process and the attribution while modelling provenance of automatically extracted text. Attribution plays a significant role in establishing the reliability of information and this includes the reliability of the methods that were used to extract information from text. Information on the process should include detailed indications of the sys- tem?s overall performance: i.e. it should indicate the precision and recall of the system for specific categories. Furthermore, the version, publication date and person responsible for generating the output should be indicated in case the historian wants to replicate their results at a later stage. Finally, provenance should include justifications for decisions made in the extraction process, in particular concerning techniques used to disambiguate terms or resolve entities. The historian may need such information to check whether the information ex- traction used heuristics or forms of entailment that may interfere with the outcome of the research question addressed by the demonstrator, as illustrated by the location disambiguation example above. In order to address the aspects of trust and accountability as outlined above, it must be crystal clear which information comes directly from original sources, and which information is the result of the processing or interpretation of these sources. Hence, the schema for BiographyNet should accommodate for this. The distinction should be marked prominently, because automatic processes add a dimension to reliability that not all historians will be familiar with. One of the main challenges therefore is that technical processes should be explained in terms that are understable to researchers who generally do not have a technical background. Strong collaboration between the historians and system designers is thus required when designing this part of provenance modelling throughout the project. At this level, an indication of responsibility is necessary so that 1 Concepts that are addressed in [12] will be marked in bold font. BiographyNet: Managing Provenance 5 historians can contact the persons who designed the interpretation pipeline in case of an unexpected outcome or if questions arise on the made assumptions or used heuristics. 4 Requirements for computer scientists Researchers working on demonstrators are mainly interested in provenance be- cause it helps to make experiments replicable and it supports research to improve existing technologies. We use the term replication to refer to the process of fol- lowing the exact same procedure as in the original work and thereby obtain the exact same output. This is di↵erent from reproduction where the same question is answered using di↵erent means (e.g. a new implementation or evaluation set). The validity of research results increases when they can be reproduced, whereas replication only verifies that an outcome was valid under specific conditions [8]. Within our setup, replication matters for two reasons. First, we need to be able to create the exact same dataset for historians if they want to compare new re- sults to previous results. Second, when results cannot be reproduced, it is almost impossible to find the cause without being able to replicate the original results [18]. It is well known that both replicating and reproducing results is challenging when computer programs are involved. This especially holds if the code is not available [19, 18] but even if code is present [21, 10]. Fokkens et al. [10] define five categories that may influence results in pipelines that involve Natural Language Processing (NLP). They are preprocessing (e.g. tokenization, cleaning up data), experimental setup (e.g. splitting folds for 10-fold cross validation, evaluation set), versioning (e.g. version of resources such as WordNet [9], or tools such as Mallet [15] for machine learning), system output (e.g. the exact features for specific tokens, intermediate output of the system in a pipeline) and system variation (e.g. treatment of ties, thresholds). This information must be explicit in order to replicate results. Information on influential factors immediately contributes to the second use of provenance for computer scientists: improving existing technologies. Individ- ual tools and datasets interact in di↵erent ways with each other. Systematic testing of influential parameters, exchanging tools for subtasks and combining the output of di↵erent tools can lead to significant improvement in performance. The interaction between performance of subtasks and overall performance of the system is not always straightforward. The output of the sentence splitter, for instance, influences the output of the parser. However, even if the output of the parser of the utterance as a whole is incorrect, we may still obtain the grammatical relations we need to identify the participant of an event. The object for which we need to model provenance thus is the data at various stages of the provenance pipeline. This data is attributed to a specific tool that has taken data from the previous stage and possibly one or more external resources as input. Again, attribution is tightly linked to the process. Modeling the process is the most complex aspect of modelling provenance for 6 Ockeloen et al. the NLP pipeline. It requires registering detailed information on all tools and data sets involved including preprocessing steps, steps to generate features and the process of creating training data for machine learning. For all tools and resources, the version should be indicated. A detail in implementation or a small step or setting can make a significant di↵erence in the results. It should therefore be registered who is responsible so that di↵erences can be traced when third parties do not manage to reproduce results. Finally, documentation should clearly describe the decisions made in the setup which both serves as a justification of the approach and a way to indicate any form of entailment that may be required by the historian. 5 Retrieving information from text One of the main challenges of building a demonstrator lies in creating tools that can automatically interpret text and extract information from it. The design of the system that is responsible for automatic text interpretation is work in progress. We will therefore provide a description of what this process is likely to look like based on the work carried out so far as well as systems used in related work. The main purpose of this section is to provide an indication of the di↵erent steps involved in automatic text interpretation. We start by identifying linguistic information in text, where we distinguish two processes: named entity recognition and concept identification. Named en- tity recognizers identify names of persons, organizations and locations. Some also identify dates. We will use an o↵-the-shelf named entity recognizer for Dutch, for instance LingPipe2 . Concept identification involves linking words in text to a set of concepts of interest. We will use revisions of tools described in [20] and [5]. Their approach is based on McCarthy et al’s [16] observation that words tend to have a predominant sense within a specific genre or domain. The approach involves two steps. Concepts of interest are first identified in the corpus where- after an executing step is performed in which these concepts are labeled in the text. We will briefly describe the two steps below. – First, candidate terms are identified in the text. In a basic system, these may be verbs and nouns co-occurring in a sentence. We thus start by running a sentence identifier, tokenizer and part-of-speech tagger and lemmatizer over the entire corpus. – Next, we link all these terms to WordNet entries and create hypernym chains. This process results in an overview of the hypernym chains identified in the text. For each hypernym, the set of hyponyms occurring in the text is given. We manually select a set of hypernyms from this overview. This set of hypernyms constitutes our concepts of interest. As soon as we have created a set of concepts of interest, we can tag these concepts in the text. First, we create a corpus by running a tokenizer, part-of-speech tagger and lemmatizer over the text. For each lemma in the corpus, we check whether 2 http://alias-i.com/lingpipe/web/demo-ne.html BiographyNet: Managing Provenance 7 one of its senses is a hyponym of one of the concepts of interest. In this case, we associate the lemma to this concept of interest. Lemmas are thus only linked with selected concepts of interest and the senses that are related to these concepts constitute their predominant sense within our domain. Together, named entity recognition and concept identification provide a corpus in which persons, organizations, times, locations and concepts are labelled. Consequently, we can apply two strategies to extract useful information from text: a rule based approach and an machine learning approach (ML). We can define basic mapping rules that directly map the resulting labels within this corpus to usable metadata. If for instance, we encounter a person name identified by the named entity recognizer in close proximity of a profession tagged by our concept identifier, we assign this profession to the person. The ML strategy uses existing metadata to discover similar information in biographies for which that metadata is missing using named entities and concepts as features. The biographies obtained from the BP are accompanied by metadata that includes information on the subject of the biography. The completeness of this metadata varies significantly from source to source. Biographies with rich metadata can be used to learn to identify information in text and hence find this information in biographies with poorer metadata. We have created a corpus in which information from metadata is tagged in the original text of the biography. This corpus can be used as a training set for machine learning to discover information in texts that is missing in the metadata. For example, we found that the metadata field ‘religion’ was available for only 6 out of the 71 governor-generals in our use case. However, using ML we found this information in the text for 20 governors. Together these strategies form the core of our system for text interpreta- tion. It should be noted that the descriptions provided above illustrate a basic system that is currently under development. Throughout the project, we will incrementally improve the system by adding more linguistic information. 6 The BiographyNet schema Having outlined the main concerns and requirements for the BiographyNet demon- strator, the following section describes the schema devised to manage the data used and produced for the demonstrator. It describes how data from both original sources and enrichments is stored, how provenance information is handled for in- volved processes and how this ties into the formulated requirements. An impres- sion of the schema can be found at: http://www.biographynet.nl/schema/. The following subsections are best read with the schema alongside. The men- tioned concepts and relations can then be traced and followed in the schema. Description of the various parts of the schema generally takes place from left to right. Please note that this impression includes the various aspects described in this section in order to provide a general overview of the schema for Biogra- phyNet. It does not include every aspect of the biographical data and provenance information in order to maintain overview. Information on individual Activities, 8 Ockeloen et al. Entities etc. such as start times, version numbers etc. is left out and qualified relations are only modelled if needed to illustrate the ideas behind the schema. 6.1 Foundations of the BiographyNet schema The collection of biographies is made available to the BiographyNet project as a collection of XML files. Each XML file contains a ‘Biographical Descrip- tion’, which in turn contains three di↵erent types of data; A ‘File Description’ that contains the metadata on the original source, a ‘Person Description’ that contains limited metadata on the depicted person, and the actual biographical description. Currently, the available biographical data is not linked to any other sources. To be more flexible when it comes to linking to external sources in the near future and in order to reason over the data, the BiographyNet demonstrator will be based on Linked Data [2] principles. Therefore, the collection of XML files is converted to RDF [4]. How this conversion was done in detail is out of scope for this paper, but a similar conversion process is described in [3]. When data needs to be converted, it is advisable to stay as close as reasonably pos- sible to the original schema, in this case defined by the structure of the XML files. Any altering of the schema involves interpretation, and as interpretation can change over time, such a process has the potential for information loss. For this reason we started out with a schema for BiographyNet that closely follows the structure of the original XML files; it contains a resource that represents a ‘Biographical Description’ (BioDes) that has connections with resources that represent a ‘File Description’ (FileDes), a ‘Person Description’ (PersonDes) and a resource for ‘Biographical Parts’ (BioParts). In the illustration, these are the blue outlined ovals, starting with the second leftmost. Within the provided collection, multiple biographical descriptions are often available for the same person, originating from di↵erent sources. While these are represented as separate XML files in the provided collection, they need to coexist within the created Linked Data corpus. To this end, the BioDes objects are tied together using a resource representing the depicted person. This is the leftmost blue outlined oval. However, this means that -through the BioDes objects- a person can have multiple PersonDes objects containing possibly conflicting sets of metadata. In order to make the semantics of this more clear, we used parts of the Open Archives Initiative’s ‘Object Re-use & Exchange’ ontology (OAI- ORE) [13, 14] in a way similar to how the Europeana Data Model (EDM) [7] uses concepts from that ontology. By defining the PersonDes objects as a subclass of the ore:Proxy class, defining the depicted person as an edm:ProvidedCHO (Cultural Heritage Object) and incorporating the associated predicate relations, the model becomes compatible with the Europeana data model while still staying true to the original data structure. The depicted person can now be viewed as a ‘Cultural Heritage Object’, of which multiple sets of metadata are made available through proxies, indicating that these sets of metadata represent di↵erent ‘views’ of that person. This solution also allows for adding a new BioDes object for a person that ‘aggregates’ multiple other sources (BioDes objects) through the ore:Aggregates BiographyNet: Managing Provenance 9 and edm:AggregatedCHO predicates. Besides the original biographical descrip- tions and an aggregated version of them, the model can also be used to ac- commodate enrichments. In that sense, an enrichment is a ‘new’ biographical description which was derived from original sources. A FileDes object will not be available for the enrichment, as the enrichment itself does not directly come from an original source, i.e. a biographical dictionary. Similarly, a BioDes object for an enrichment will most likely not contain a BioParts object, as it represents a set of metadata resulting from the enrichment process, but does not contain actual biographical texts. By modelling the was derived from relation, the en- richment can be traced back to the biographical description it was derived from and its original source, a hard requirement formulated in section 3. 6.2 Extending the schema with Provenance PROV-DM [17] is the logical candidate for modelling provenance, since W3C3 made it a recommendation promoting its widespread use. Furthermore, PROV concepts can be modelled in RDF making it suitable for use in the BiographyNet schema. Besides relations such as was derived from, the PROV ontology can be used to model Entities, Agents and Activities that played a role during the enrichment process and the creation of the pipeline of processes itself, including their mutual relations. Additionally, concepts from the new P-PLAN [11] are integrated in the BiographyNet schema to specify plans made for the actual activities involved in the enrichment process. Specifying planning information is useful in that it provides a way of verifying to what extend actions were performed according to plan. Hence, integrating this information makes it easier to identify errors in individual processes of the aggregated enrichment process. It also makes replication of results more feasible, as the plans provide a description of what the input and output of activities should look like. As such, the combined use of these ontologies ties into the requirements of the historian to be able to trace which original sources were used to obtain a result and to gather additional information on possible heuristics and biases. It also ties into the requirement of the computer scientist to be able to replicate results. In order to fulfill the requirements of the historian and computer scientist to have both an aggregated view on provenance (i.e. which original sources contributed to an enrichment) and a detailed view (i.e. specified information for all processing steps involved), these two levels are modelled separately in the schema. In the illustration, the aggregated level is represented by the orange outlined ovals (and the green one for the plan) between the two blue biographical structures. The detailed view is made up by the remainder of the schema. Clearly visible in the schema is how these activities and plans are parts and steps of the aggregated enrichment activity and its associated plan. These two views are described in more detail in the subsections below. 3 http://www.w3.org/ 10 Ockeloen et al. 6.3 Aggregated provenance information A prov:wasDerivedFrom relation is made between the BioDes object of the en- richment and the BioDes object of the original source in order to model the information on an enrichment process as a whole. Furthermore, a prov:Activity to represent the aggregated process and its relations to the BioDes objects are specified. That activity has a prov:Agent associated with it. This Agent is the aggregated set of tools used for the enrichment, otherwise known as a ‘pipeline’. The desired behavior of the integrated process is described by a prov:Plan ob- ject, which has its own provenance information; the plan for the enrichment process is attributed to an Agent, e.g. a computer scientist and can be derived from an earlier version of that plan or another enrichment. This aggregated provenance view allows the end user to identify which enrichments were used to produce a final aggregated view of information. The end user can determine the original sources through the various provenance relations. Furthermore, the end user knows who to contact in case an enrichment process seems to have produced questionable results. The aggregated plan can provide an overview of the input variables used in the underlying processes, as they are referenced through p-plan:isVariableOfPlan relations. This information allows for possible adjustments in order to adjust the output of the overall process. 6.4 Detailed provenance information The detailed provenance information on individual processes is modeled as a chain of Activities which all have their own input and output Entities, asso- ciated Agents and Plan. The Agents are specific tools such as a tokenizer or part-of-speech tagger. The plan describes what the specific tool should do. Each Plan has its own provenance information. These plans are plans in their own right, but are also designated a p-plan:Step to indicate that they are a step of the aggregated plan for the enrichment as a whole. As such, these steps have input and output variables that describe the input and output of the related Activity. These variables correspond to the entities used by and generated by the related activities. An Activity together with its used and generated Entities can be seen as a ‘bundle’ of objects that together are derived from the Plan for that activity. Each individual Activity is designated as a part of the aggregated enrichment Activity using the Dublin Core ‘hasPart’ predicate. The order in which the individual Activities are executed can be derived from the prov:used and prov:wasGeneratedBy relations that tie the individual Activities to the En- tities representing intermediate results. Besides these intermediate results, other Entities may be used by a specific Activity, e.g. a list of cities for Named Entity Recognition. For both the intermediate results as well as these ‘external sources’, the data format is unknown. An intermediate result could be a collection of RDF triples, an XML file or plain text file. An external source could be one of those or basically any type of document. In order to cope with this variety, these Entities are represented by a prov:Entity of subclass bgn:IntermidiateResult or BiographyNet: Managing Provenance 11 bgn:ExternalSource, that can point to the actual document or serve as Named Graph to contain RDF data. The aggregated view and detailed view of provenance information are re- lated together by the fact that all Activities in the detailed view are parts of the aggregated Activity, all Plans of the individual Activities are Steps in the aggregated Plan and the biographical description of the ‘Source’ BioDes object is the actual Entity that is used by the first individual Activity, whereas the En- tity produced by the last individual Activity is the resulting set of metadata of the enrichment BioDes object. Any form of pre- or post-processing of input data or results, needed to relate to those objects, needs to be viewed as a separate individual step in the overall plan. For without provenance information on those steps, replicability is not ensured. 7 Conclusion Keeping track of provenance information is essential for the BiographyNet demon- strator to be viewed as a valid research tool for historians. In this paper we described why this is the case, what the requirements are to model provenance from multiple perspectives and which existing ontologies we used to devise a schema for BiographyNet that meets those requirements. We presented a first version of the BiographyNet schema that not only models provenance on what has taken place, but also models plans to compare against. The next step is to proceed with building a first version of the demonstrator. We will then have to evaluate how the schema holds up in practice, and use the output of such evaluation to further improve the schema. Acknowledgements This work was supported by the BiographyNet project (Nr. 660.011.308), funded by the Netherlands eScience Center (http://esciencecenter.nl/). Partners in this project are the Netherlands eScience Center, the Huygens/ING Institute of the Royal Dutch Academy of Sciences and VU University Amsterdam. References 1. Arthur, P.: Exhibiting history. the digital future. Recollections 1(1) (2008) 2. Bizer, C., Heath, T., Berners-Lee, T.: Linked data - the story so far. International Journal on Semantic Web and Information Systems 5(3), 1–22 (2009) 3. de Boer, V., Wielemaker, J., van Gent, J., Hildebrand, M., Isaac, A., van Ossen- bruggen, J., Schreiber, G.: Supporting linked data production for cultural heritage institutes: The amsterdam museum case study. In: ESWC, volume 7295 of Lecture Notes in Computer Science. p. 733?747. Springer Berlin / Heidelberg (2012) 4. Carroll, J.J., Klyne, G.: Resource description framework (RDF): Con- cepts and abstract syntax. W3C recommendation, W3C (Feb 2004), http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/ 12 Ockeloen et al. 5. Cybulska, A., Vossen, P.: Using semantic relations to solve event coreference in text. In: Mititelu, V., Popescu, O., (Eds.), V.P. (eds.) Proceedings of the Workshop Semantic relations-II. pp. 60–67. Istanbul, Turkey (2012) 6. Doerr, M., Gradmann, S., Hennicke, S., Isaac, A., Meghini, C., van de Sompel, H.: The europeana data model (edm). In: World Library and Information Congress: 76th IFLA General Conference and Assembly. Gothenburg, Sweden (2010) 7. Doerr, M., Gradmann, S., Hennicke, S., Isaac, A., Meghini, C., van de Sompel, H.: The Europeana Data Model (EDM). In: World Library and Information Congress: 76th IFLA general conference and assembly. pp. 10–15 (2010) 8. Drummond, C.: Replicability is not reproducibility: nor is it good science. In: Workshop on Evaluation Methods for Machine Learning IV (2009) 9. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cam- bridge, MA, USA (1998) 10. Fokkens, A., van Erp, M., Postma, M., Pedersen, T., Vossen, P., Freire, N.: O↵- spring from reproduction problems: What replication failure teaches us. In: Pro- ceedings of the 51st ACL. Sofia, Bulgaria (2013) 11. Garijo, D., Gil, Y.: The p-plan ontology (2013), http://www.opmw.org/model/ p-plan/ 12. Groth, P., Gil, Y., Cheney, J., Miles, S.: Requirements for provenance on the web. International Journal of Digital Curation 7(1) (2012) 13. Lagoze, C., van de Sompel, H.: Open archives initiative object re-use & exchange (2007), http://www.openarchives.org/ore/documents/ore-jcdl2007.pdf 14. Lagoze, C., Van de Sompel, H., Nelson, M.L., Warner, S., Sanderson, R., Johnston, P.: Object re-use & exchange: A resource-centric approach. Tech. rep. (2008) 15. McCallum, A.K.: MALLET: A machine learning for language toolkit. http:// mallet.cs.umass.edu (2002) 16. McCarthy, D., Koeling, R., Weeds, J., Carroll, J.: Finding predominant word senses in untagged text. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. p. 279. Association for Computational Linguistics (2004) 17. Moreau, L., Missier, P., Belhajjame, K., B’Far, R., Cheney, J., Coppens, S., Cress- well, S., Gil, Y., Groth, P., Klyne, G., Lebo, T., McCusker, J., Miles, S., Myers, J., Sahoo, S., Tilmes, C.: PROV-DM: The PROV Data Model. Tech. rep. (2012), http://www.w3.org/TR/prov-dm/ 18. Neylon, C., Aerts, J., Brown, C.T., Coles, S.J., Hatton, L., Lemire, D., Millman, K.J., Murray-Rust, P., Perez, F., Saunders, N., Shah, N., Smith, A., Varoquaux, G., Willighagen, E.: Changing computational research. the challenges ahead. Source Code for Biology and Medicine 7(2) (2012) 19. Pedersen, T.: Empiricism is not a matter of faith. Computational Linguistics 34(3), 465–470 (2008) 20. P.Vossen, Bosma, W., Rigau, G., Agirre, E., Soria, A., Aliprandi, C., de Jonge, J., Hielkema, F., Monachini, M., Bartolini, R., Frontini, F.: Kyotocore: integrated system for knowledge mining from text (2011) 21. Vanschoren, J., Blockeel, H., Pfahringer, B., Holmes, G.: Experiment databases. Machine Learning 87(2), 127–158 (2012) 22. Zaagsma, G.: Doing history in the digital age: history as a hy- brid practice (2013), http://gerbenzaagsma.org/blog/16-03-2013/ doing-history-digital-age-history-hybrid-practice