=Paper=
{{Paper
|id=Vol-2065/paper09
|storemode=property
|title=A Holistic Approach to Scientific Reasoning Based on Hybrid Knowledge Representations and Research Objects
|pdfUrl=https://ceur-ws.org/Vol-2065/paper09.pdf
|volume=Vol-2065
|authors=Jose Manuel Gomez-Perez,Ronald Denaux,Andres Garcia,Raul Palma
|dblpUrl=https://dblp.org/rec/conf/kcap/Gomez-PerezDGP17
}}
==A Holistic Approach to Scientific Reasoning Based on Hybrid Knowledge Representations and Research Objects==
A Holistic Approach to Scientific Reasoning Based on Hybrid Knowledge Representations and Research Objects Jose Manuel Gomez-Perez Ronald Denaux Expert System Expert System Madrid, Spain Madrid, Spain jmgomez@expertsystem.com rdenaux@expertsystem.com Andres Garcia Raul Palma Expert System PSNC Madrid, Spain Poznan, Poland rdenaux@expertsystem.com rpalma@man.poznan.pl 1 MOTIVATION AND GOALS already captured explicitly in structured representations. Embed- Under the light of current developments in AI it appears the time dings provide a compact and portable representation of words and is ripe for a shared partnership with machines, whereby humans their meaning that stems directly from a document corpus. In this can benefit from augmented reasoning and information manage- scenario, a notion of semantic portability [3] emerges that refers to ment capabilities provided that machines are endowed with the the capability to capture as an information artifact (a vector) the necessary intelligence to assist with such tasks. This seems to be semantics of a linguistic unit (a word) from its occurrences in the particularly the case of the scientific domain, where some envision corpus and how such artifact enables that meaning to be merged the development of an AI that can make major scientific discover- with other forms of knowledge representation. ies and that eventually becomes worthy of a Nobel Prize [9]. This Furthermore, scientific knowledge is heterogeneous and can vision may still be far from realization, but it is not completely new present itself in many forms. During its analysis phase, Halo pro- nevertheless. duced an inventory of the different types of knowledge identified. NLP technologies based on well-formed, logically sound struc- Such knowledge types include among others: factual knowledge, tured knowledge representations (knowledge graphs, ontologies) procedural, classification, mathematical, diagrammatic, tabular and leverage expressive and actionable descriptions of the domain of experimental. It is therefore clear that successfully reading and un- interest through logical deduction and inference, and can provide derstanding scientific knowledge (either by humans or machines) logical explanations of reasoning outcomes. Closely related to this requires addressing the different knowledge types in a holistic way, family of approaches, project Halo [7] aimed to develop a Digi- which remains a challenging task. We argue that addressing such tal Aristotle able to answer novel questions in scientific domains challenge requires generalizing the notion of semantic portability with expertise equivalent to Advanced Placement competence level. from a text understanding scenario to a broader one where other Halo enabled subject matter experts (SMEs) to model complex sci- modalities, such as diagrams, processes, experiments and related entific knowledge from textbooks and related questions, based on artifacts like scientific workflows and their execution provenance, an underlying logical formalism and a knowledge modeling work- are also involved. This can be achieved by learning individual mod- bench to assist SMEs in the task. The resulting system achieved els for each modality in the form of concept embeddings following an unprecedented question answering performance level for SME- a distributional semantics [8, 12] and learning the corresponding entered knowledge, but it also had a number of severe drawbacks, transformations between each vector space. The result will be a including brittleness (coverage, precision or granularity gaps), scal- shared, hybrid formalism that encompasses the different modalities ability issues, and the need for a considerable force of well trained involved in scientific knowledge. Using embeddings to represent human labor to manually encode large amounts of scientific knowl- not only words but arbitrary features has been recently popularized edge. by Chen and Manning in [2]. On the other hand, the last decade has witnessed a shift towards At this point, the question remains where to obtain the cross- statistical methods due to the increasing availability of raw data modal data required to learn such models and the necessary trans- and cheap computing power. These have proved to be powerful and formations between them. We argue that the growing collections convenient in many linguistic tasks, such as part-of-speech tagging of research objects from different scientific disciplines available or dependency parsing. However, they are also limited, e.g. humans in repositories like ROHub.org [11] will play a key role in this re- seek causal explanations, which are hard to provide based on statis- gard. Conceptually speaking, a research object [1] is a container tical induction rather than logical deduction. Recent results in the of scientific knowledge, a semantically rich aggregation of all the field of distributional semantics [10] have shown promising ways materials involved in a scientific investigation, such as papers and to learn features from text that can complement the knowledge bibliography, numerical data, hypotheses, methods, experiments, workflows encoding such experiments and the provenance of their executions. A research object thus becomes the carrier of the sci- K-CAP2017 Workshops and Tutorials Proceedings, 2017 entific knowledge associated to a specific investigation. They also ©Copyright held by the owner/author(s). bring together all the necessary information to preserve scientific K-CAP2017 Workshops and Tutorials Proceedings, 2017 Jose Manuel Gomez-Perez, Ronald Denaux, Andres Garcia, and Raul Palma work against potential decay [13] and can be shared, reused and Future Generation Computer Systems 29, 2 (2013), 599 – 611. https://doi.org/10. cited in scholarly communications. As scholars move away from 1016/j.future.2011.08.004 Special section: Recent advances in e-Science. [2] Danqi Chen and Christopher Manning. 2014. A Fast and Accurate Dependency paper towards digital content, research objects have a key role to Parser using Neural Networks. In Proceedings of the 2014 Conference on Empirical play in the way scientific results are communicated and validated Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 740–750. http://www.aclweb.org/anthology/D14-1082 by the communities, given the need for mechanisms that support [3] Ronald Denaux and Jose M Gomez-Perez. 2017. Towards a Vecsigrafo: Portable the production of self-contained publications involving not only Semantics in Knowledge-based Text Analytics. In Proceedings of the 2017 workshop text but also data, methods and software implementations. on Hybrid Statistical Semantic Understanding and Emerging Semantics (HSSUES). CEUR Workshop Proceedings, Held in Conjunction with the 16th International In [6], we show how research objects are key pieces of a human- Semantic Web Conference, Vienna, Austria. machine scientific partnership. Building on that, we aim at fur- [4] Jose Manuel Gomez-Perez, Michael Erdmann, Mark Greaves, and Oscar Corcho. thering the role of research objects in such partnership, leveraging 2013. A Formalism and Method for Representing and Reasoning with Process Models Authored by Subject Matter Experts. IEEE Trans. on Knowl. and Data Eng. research object corpora of cross-modal scientific knowledge to 25, 9 (Sept. 2013), 1933–1945. https://doi.org/10.1109/TKDE.2012.127 develop hybrid models for scientific reasoning and question an- [5] Jose Manuel Gomez-Perez, Michael Erdmann, Mark Greaves, Oscar Corcho, and V. Richard Benjamins. 2010. A framework and computer system for knowledge- swering. During the workshop, we aim at sharing and discussing level acquisition, representation, and reasoning with process knowledge. 68 (10 these ideas, explore related lines of work and establish areas of com- 2010), 641–668. mon interest and collaboration with the participants. Key topics [6] Jose M Gomez-Perez, Andres Garcia-Silva, and Raul Palma. 2017. Towards a Human-Machine Scientific Partnership Based on Semantically Rich Research and research questions we wish to address include: approaches for Objects. In eScience. IEEE Computer Society, 1–9. hybrid reasoning, question answering and explanation, methods [7] David Gunning, Vinay K Chaudhri, Peter E Clark, Ken Barker, Shaw-Yi Chaw, to build portable knowledge representations of multimodal data, Mark Greaves, Benjamin Grosof, Alice Leung, David D McDonald, Sunil Mishra, and Others. 2010. Project Halo Update—Progress Toward Digital Aristotle. AI how to combine the knowledge extracted from each modality in Magazine 31, 3 (2010), 33–58. the research objects to recompose a coherent, more complete view [8] Zellig S. Harris. 1981. Distributional Structure. Springer Netherlands, Dordrecht, 3–22. https://doi.org/10.1007/978-94-009-8467-7_1 of the scientific facts documented by them, and how each modality [9] Hiroaki Kitano. 2016. Artificial Intelligence to Win the Nobel Prize and Beyond: interplay with each other in doing so. Creating the Engine for Scientific Discovery. AI Magazine 37, 1 (2016), 39–49. http://www.aaai.org/ojs/index.php/aimagazine/article/view/2642 [10] Tomác Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. 2 ABOUT THE AUTHORS Distributed Representations of Words and Phrases and their Compositionality.. In NIPS. https://doi.org/10.1162/jmlr.2003.3.4-5.951 arXiv:1310.4546 This research is conducted by a team of researchers at Expert Sys- [11] Raul Palma, Piotr Hołubowicz, Oscar Corcho, Jose M Gomez-Perez, and Cezary tem’s COGITO Lab and the Poznan Supercomputing and Network- Mazurek. 2014. ROHub—A Digital Library of Research Objects Supporting Sci- entists Towards Reproducible Science. In Semantic Web Evaluation Challenge. ing Center. Through the years, we have developed a body of work Springer, 77–82. in the intersection of several areas of AI that converge in the ideas [12] Magnus Sahlgren. 2008. The distributional hypothesis. Italian Journal of Linguis- discussed in this document, including NLP, Knowledge Discovery, tics 20, 1 (2008), 33–54. [13] J Zhao, JM Gomez-Perez, K Belhajjame, G Klyne, E García-Cuesta, A Garrido, Representation and Reasoning and new ways of scholarly com- KM Hettne, M Roos, D De Roure, and C Goble. 2012. Why workflows break - munication and preservation of scientific knowledge (as research Understanding and combating decay in Taverna workflows.. In eScience. IEEE objects). This work aims at enabling machines to understand text Computer Society, 1–9. http://dblp.uni-trier.de/db/conf/eScience/eScience2012. html#ZhaoGBKGGHRRG12 and other modalities in which knowledge can be expressed in a way similar to how humans read, bridging the gap between both through semantically rich knowledge representations and human- machine interfaces. In doing so, we believe that such vision is best served through a combination of structured knowledge and proba- bilistic approaches. The main author of this document participated in project Halo as a member of the DarkMatter team, focused on process knowledge acquisition from textbooks and question an- swering by domain experts [4, 5]. He is also one of the founders and key personnel behind ROHub.org, the reference platform for research object management. ROHub currently hosts almost 2,500 research objects and 180 scientists in a variety of experimental and observational scientific disciplines like Biology, Astrophysics and Earth Science. ACKNOWLEDGMENTS This research is funded by the EU H2020 and national research projects EVER-EST (674907), xLiMe-ES (20160805) and DANTE (700367). REFERENCES [1] S Bechhofer, I Buchan, D De Roure, P Missier, J Ainsworth, J Bhagat, P Couch, D Cruickshank, M Delderfield, I Dunlop, M Gamble, D Michaelides, S Owen, D Newman, S Sufi, and C Goble. 2013. Why linked data is not enough for scientists.