=Paper= {{Paper |id=Vol-2065/paper09 |storemode=property |title=A Holistic Approach to Scientific Reasoning Based on Hybrid Knowledge Representations and Research Objects |pdfUrl=https://ceur-ws.org/Vol-2065/paper09.pdf |volume=Vol-2065 |authors=Jose Manuel Gomez-Perez,Ronald Denaux,Andres Garcia,Raul Palma |dblpUrl=https://dblp.org/rec/conf/kcap/Gomez-PerezDGP17 }} ==A Holistic Approach to Scientific Reasoning Based on Hybrid Knowledge Representations and Research Objects== https://ceur-ws.org/Vol-2065/paper09.pdf
    A Holistic Approach to Scientific Reasoning Based on Hybrid
         Knowledge Representations and Research Objects
                      Jose Manuel Gomez-Perez                                                 Ronald Denaux
                            Expert System                                                     Expert System
                             Madrid, Spain                                                    Madrid, Spain
                       jmgomez@expertsystem.com                                         rdenaux@expertsystem.com

                               Andres Garcia                                                    Raul Palma
                             Expert System                                                        PSNC
                             Madrid, Spain                                                    Poznan, Poland
                       rdenaux@expertsystem.com                                           rpalma@man.poznan.pl

1    MOTIVATION AND GOALS                                                already captured explicitly in structured representations. Embed-
Under the light of current developments in AI it appears the time        dings provide a compact and portable representation of words and
is ripe for a shared partnership with machines, whereby humans           their meaning that stems directly from a document corpus. In this
can benefit from augmented reasoning and information manage-             scenario, a notion of semantic portability [3] emerges that refers to
ment capabilities provided that machines are endowed with the            the capability to capture as an information artifact (a vector) the
necessary intelligence to assist with such tasks. This seems to be       semantics of a linguistic unit (a word) from its occurrences in the
particularly the case of the scientific domain, where some envision      corpus and how such artifact enables that meaning to be merged
the development of an AI that can make major scientific discover-        with other forms of knowledge representation.
ies and that eventually becomes worthy of a Nobel Prize [9]. This           Furthermore, scientific knowledge is heterogeneous and can
vision may still be far from realization, but it is not completely new   present itself in many forms. During its analysis phase, Halo pro-
nevertheless.                                                            duced an inventory of the different types of knowledge identified.
   NLP technologies based on well-formed, logically sound struc-         Such knowledge types include among others: factual knowledge,
tured knowledge representations (knowledge graphs, ontologies)           procedural, classification, mathematical, diagrammatic, tabular and
leverage expressive and actionable descriptions of the domain of         experimental. It is therefore clear that successfully reading and un-
interest through logical deduction and inference, and can provide        derstanding scientific knowledge (either by humans or machines)
logical explanations of reasoning outcomes. Closely related to this      requires addressing the different knowledge types in a holistic way,
family of approaches, project Halo [7] aimed to develop a Digi-          which remains a challenging task. We argue that addressing such
tal Aristotle able to answer novel questions in scientific domains       challenge requires generalizing the notion of semantic portability
with expertise equivalent to Advanced Placement competence level.        from a text understanding scenario to a broader one where other
Halo enabled subject matter experts (SMEs) to model complex sci-         modalities, such as diagrams, processes, experiments and related
entific knowledge from textbooks and related questions, based on         artifacts like scientific workflows and their execution provenance,
an underlying logical formalism and a knowledge modeling work-           are also involved. This can be achieved by learning individual mod-
bench to assist SMEs in the task. The resulting system achieved          els for each modality in the form of concept embeddings following
an unprecedented question answering performance level for SME-           a distributional semantics [8, 12] and learning the corresponding
entered knowledge, but it also had a number of severe drawbacks,         transformations between each vector space. The result will be a
including brittleness (coverage, precision or granularity gaps), scal-   shared, hybrid formalism that encompasses the different modalities
ability issues, and the need for a considerable force of well trained    involved in scientific knowledge. Using embeddings to represent
human labor to manually encode large amounts of scientific knowl-        not only words but arbitrary features has been recently popularized
edge.                                                                    by Chen and Manning in [2].
   On the other hand, the last decade has witnessed a shift towards         At this point, the question remains where to obtain the cross-
statistical methods due to the increasing availability of raw data       modal data required to learn such models and the necessary trans-
and cheap computing power. These have proved to be powerful and          formations between them. We argue that the growing collections
convenient in many linguistic tasks, such as part-of-speech tagging      of research objects from different scientific disciplines available
or dependency parsing. However, they are also limited, e.g. humans       in repositories like ROHub.org [11] will play a key role in this re-
seek causal explanations, which are hard to provide based on statis-     gard. Conceptually speaking, a research object [1] is a container
tical induction rather than logical deduction. Recent results in the     of scientific knowledge, a semantically rich aggregation of all the
field of distributional semantics [10] have shown promising ways         materials involved in a scientific investigation, such as papers and
to learn features from text that can complement the knowledge            bibliography, numerical data, hypotheses, methods, experiments,
                                                                         workflows encoding such experiments and the provenance of their
                                                                         executions. A research object thus becomes the carrier of the sci-
K-CAP2017 Workshops and Tutorials Proceedings, 2017                      entific knowledge associated to a specific investigation. They also
©Copyright held by the owner/author(s).                                  bring together all the necessary information to preserve scientific
K-CAP2017 Workshops and Tutorials Proceedings, 2017                             Jose Manuel Gomez-Perez, Ronald Denaux, Andres Garcia, and Raul Palma


work against potential decay [13] and can be shared, reused and                               Future Generation Computer Systems 29, 2 (2013), 599 – 611. https://doi.org/10.
cited in scholarly communications. As scholars move away from                                 1016/j.future.2011.08.004 Special section: Recent advances in e-Science.
                                                                                          [2] Danqi Chen and Christopher Manning. 2014. A Fast and Accurate Dependency
paper towards digital content, research objects have a key role to                            Parser using Neural Networks. In Proceedings of the 2014 Conference on Empirical
play in the way scientific results are communicated and validated                             Methods in Natural Language Processing (EMNLP). Association for Computational
                                                                                              Linguistics, Doha, Qatar, 740–750. http://www.aclweb.org/anthology/D14-1082
by the communities, given the need for mechanisms that support                            [3] Ronald Denaux and Jose M Gomez-Perez. 2017. Towards a Vecsigrafo: Portable
the production of self-contained publications involving not only                              Semantics in Knowledge-based Text Analytics. In Proceedings of the 2017 workshop
text but also data, methods and software implementations.                                     on Hybrid Statistical Semantic Understanding and Emerging Semantics (HSSUES).
                                                                                              CEUR Workshop Proceedings, Held in Conjunction with the 16th International
   In [6], we show how research objects are key pieces of a human-                            Semantic Web Conference, Vienna, Austria.
machine scientific partnership. Building on that, we aim at fur-                          [4] Jose Manuel Gomez-Perez, Michael Erdmann, Mark Greaves, and Oscar Corcho.
thering the role of research objects in such partnership, leveraging                          2013. A Formalism and Method for Representing and Reasoning with Process
                                                                                              Models Authored by Subject Matter Experts. IEEE Trans. on Knowl. and Data Eng.
research object corpora of cross-modal scientific knowledge to                                25, 9 (Sept. 2013), 1933–1945. https://doi.org/10.1109/TKDE.2012.127
develop hybrid models for scientific reasoning and question an-                           [5] Jose Manuel Gomez-Perez, Michael Erdmann, Mark Greaves, Oscar Corcho, and
                                                                                              V. Richard Benjamins. 2010. A framework and computer system for knowledge-
swering. During the workshop, we aim at sharing and discussing                                level acquisition, representation, and reasoning with process knowledge. 68 (10
these ideas, explore related lines of work and establish areas of com-                        2010), 641–668.
mon interest and collaboration with the participants. Key topics                          [6] Jose M Gomez-Perez, Andres Garcia-Silva, and Raul Palma. 2017. Towards a
                                                                                              Human-Machine Scientific Partnership Based on Semantically Rich Research
and research questions we wish to address include: approaches for                             Objects. In eScience. IEEE Computer Society, 1–9.
hybrid reasoning, question answering and explanation, methods                             [7] David Gunning, Vinay K Chaudhri, Peter E Clark, Ken Barker, Shaw-Yi Chaw,
to build portable knowledge representations of multimodal data,                               Mark Greaves, Benjamin Grosof, Alice Leung, David D McDonald, Sunil Mishra,
                                                                                              and Others. 2010. Project Halo Update—Progress Toward Digital Aristotle. AI
how to combine the knowledge extracted from each modality in                                  Magazine 31, 3 (2010), 33–58.
the research objects to recompose a coherent, more complete view                          [8] Zellig S. Harris. 1981. Distributional Structure. Springer Netherlands, Dordrecht,
                                                                                              3–22. https://doi.org/10.1007/978-94-009-8467-7_1
of the scientific facts documented by them, and how each modality                         [9] Hiroaki Kitano. 2016. Artificial Intelligence to Win the Nobel Prize and Beyond:
interplay with each other in doing so.                                                        Creating the Engine for Scientific Discovery. AI Magazine 37, 1 (2016), 39–49.
                                                                                              http://www.aaai.org/ojs/index.php/aimagazine/article/view/2642
                                                                                         [10] Tomác Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013.
2    ABOUT THE AUTHORS                                                                        Distributed Representations of Words and Phrases and their Compositionality..
                                                                                              In NIPS. https://doi.org/10.1162/jmlr.2003.3.4-5.951 arXiv:1310.4546
This research is conducted by a team of researchers at Expert Sys-                       [11] Raul Palma, Piotr Hołubowicz, Oscar Corcho, Jose M Gomez-Perez, and Cezary
tem’s COGITO Lab and the Poznan Supercomputing and Network-                                   Mazurek. 2014. ROHub—A Digital Library of Research Objects Supporting Sci-
                                                                                              entists Towards Reproducible Science. In Semantic Web Evaluation Challenge.
ing Center. Through the years, we have developed a body of work                               Springer, 77–82.
in the intersection of several areas of AI that converge in the ideas                    [12] Magnus Sahlgren. 2008. The distributional hypothesis. Italian Journal of Linguis-
discussed in this document, including NLP, Knowledge Discovery,                               tics 20, 1 (2008), 33–54.
                                                                                         [13] J Zhao, JM Gomez-Perez, K Belhajjame, G Klyne, E García-Cuesta, A Garrido,
Representation and Reasoning and new ways of scholarly com-                                   KM Hettne, M Roos, D De Roure, and C Goble. 2012. Why workflows break -
munication and preservation of scientific knowledge (as research                              Understanding and combating decay in Taverna workflows.. In eScience. IEEE
objects). This work aims at enabling machines to understand text                              Computer Society, 1–9. http://dblp.uni-trier.de/db/conf/eScience/eScience2012.
                                                                                              html#ZhaoGBKGGHRRG12
and other modalities in which knowledge can be expressed in a
way similar to how humans read, bridging the gap between both
through semantically rich knowledge representations and human-
machine interfaces. In doing so, we believe that such vision is best
served through a combination of structured knowledge and proba-
bilistic approaches. The main author of this document participated
in project Halo as a member of the DarkMatter team, focused on
process knowledge acquisition from textbooks and question an-
swering by domain experts [4, 5]. He is also one of the founders
and key personnel behind ROHub.org, the reference platform for
research object management. ROHub currently hosts almost 2,500
research objects and 180 scientists in a variety of experimental and
observational scientific disciplines like Biology, Astrophysics and
Earth Science.

ACKNOWLEDGMENTS
This research is funded by the EU H2020 and national research
projects EVER-EST (674907), xLiMe-ES (20160805) and DANTE
(700367).

REFERENCES
 [1] S Bechhofer, I Buchan, D De Roure, P Missier, J Ainsworth, J Bhagat, P Couch,
     D Cruickshank, M Delderfield, I Dunlop, M Gamble, D Michaelides, S Owen, D
     Newman, S Sufi, and C Goble. 2013. Why linked data is not enough for scientists.