A newborn development insights mining and recommendation system from scientific literature and clinical guidelines? Sergio Consoli1,2 , Kees Wouters1 , Renée Otte1 , and Adrienne Heinrich1 1 Philips Research, High Tech Campus 34, 5656 AE Eindhoven, The Netherlands. 2 European Commission, Joint Research Centre, Directorate A-Strategy, Work Programme and Resources, Scientific Development Unit, Via E. Fermi 2749, I-21027 Ispra (VA), Italy. sergio.consoli@ec.europa.eu Abstract. In this short contribution we describe a method that is able to automatically retrieve relevant newborn development content from scientific documents, including scientific papers and standard guidelines, and then to recommend parents automatically the most relevant per- sonal advices, also known as insights, from the extracted knowledge. The approach cannot replace specialist advice but it rather provides quick in- formation from reliable sources with a certain degree of specificity for the parents and the child. The system builds on recent technological develop- ments on big data, knowledge engineering, and cognitive computing, in particular related to the task of extracting relations between conceptual entities in the data sources. Keywords: Generation and aggregation of health semantics; Ontolo- gies; Recommendations for health data; Information Extraction; Insights. 1 Introduction The availability of abundant computing and storage resources combined with the evolution of analytics has made affordable the use of cognitive computing technology to deliver industrial solutions of all kinds [11]. Cognitive computing systems depend on various aspects of artificial intelligence (AI), such as machine learning, reasoning, natural language processing, speech and vision, human- computer interaction, dialogue and narrative generation, and more. The ma- chine learning algorithms learn and acquire knowledge from the massive amount of data fed into to them [10, 11]. Nowadays there is a lot of interest in adopting cognitive computing tech- nologies in healthcare3 , which is particularly characterized by a vast amount of data coming from different sources [7, 4]. Through the application of natural ? Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 3 https://www2.deloitte.com/us/en/pages/deloitte-analytics/articles/ cognitive-technology-for-health-care.html language processing (NLP), data mining, and advanced text analytics, cognitive systems can assist doctors in diagnosing and faster decision making [19, 4]. They optimize patient selection for clinical trials through intelligent matching. In on- cology, these systems can assist in the creation of individualized treatment plans that enhance patient trust and experience [7, 19, 4]. The idea is that a machine can process more information than a doctor and potentially discover links and patterns not immediately visible at a first glance or that would require a complete overview of all possible interventions. An example of this is Watson4 , the popular question-answering (Q&A) machine by IBM, which has been recently employed to provide diagnosis and treatments to cancer patients, enabling faster and better care for patients5 [10]. It can analyse the meaning and context of structured and unstructured data coming from a variety of inputs including handwritten documents [6, 8], and derive data from various sources including curated literature and rationales, as well as medical journals and textbooks6 . Currently, there is an increasing number of new data-driven solutions in the market which provide pregnant women and new parents suggestions and per- sonal advices, also known as insights, which address their needs and wishes, on the basis of behavioural and contextual data, scientific literature and clinical guidelines. It is important to underline that insights cannot replace specialist advice but they rather provide quick information from reliable sources with a certain degree of specificity for the parents and the child. Mor precisely, insights refer to small pieces of text that are suggested to parents on the basis of a techni- cal rule [16]. This rule analyses parents-tracked data and the available scientific literature and guidelines, and defines when an insight text is presented to the user. For example, if a mother is keeping track of her newborn’s breastfeeds, an insight could be the following: “Recently, your tracked breast feeds with Sara have taken around 14 minutes. In general, newborns feed for 10-30 minutes at a time – occasionally even longer” [16]. In this way, insights provide personal advice, tips, and information tailored to the unique situation of the parent and the newborn [16]. Currently these insights and articles are selected manually by curators, who need to grasp and exploit a large number of scientific documents and select the most relevant content from that to be selected as candidate insights or articles to users. However, manual generation of insights takes time and a lot of effort, because all the scientific content needs to be read and the most relevant insights or articles extracted. In addition, an optimal selection is hard to be manually established by a human who can only rely on his intuition, bringing in most cases non-optimal decisions. In addition, linking all the relationships among the 4 https://www.ibm.com/watson/ 5 http://pulse.embs.org/may-2017/cognitive-computing-and-the-future-of- health-care/ 6 https://mihin.org/wp-content/uploads/2015/06/The-Impact-of-Cognitive- Computing-on-Healthcare-Final-Version-for-Handout.pdf pregnancy or newborn concepts reported in the scientific documents still remains a challenge. The aim of the system proposed in this contribution is to support the process of insights or article generation by recommending automatically a set of the most important facts coming from the scientific literature and clinical guidelines given in input. The system leverages on state-of-art technologies on Cognitive Computing, Natural Language Processing (NLP), Ontology Engineering, and Big-Data, producing an advanced AI system for semantically mining information from a scientific pregnancy and newborn development domains repository. 2 Description of the method The algorithm leverages the recent AI developments in cognitive computing and applies them to automatically mining information from scientific literature and guidelines. In this way, new knowledge on the pregnancy or newborn develop- ment domains can be extracted, automatically providing applications with a recommendation of the most relevant insights or articles to give to users in a personalized way. The schematic workflow of the system is depicted in Figure 1. Fig. 1: Pipeline of the system illustrating its main elements. The system is able to parse, extract, transform and load the unstructured in- formation coming from clinical guidelines and scientific papers. It is able then to structure the free-format text using machine reading from natural language pro- cessing for extracting RDF7 /OWL8 graphs that are linked to the Linked Open 7 RDF: Resource Description Framework 8 OWL: Web Ontology Language Data cloud and compliant to Semantic Web and Linked Data patterns [14]. In this way the information is translated into machine-readable semantic informa- tion in RDF/OWL format, which is a W3C standard for exchanging semantic data. Machine reading is typically much less accurate than human reading, but can process massive amounts of text in reasonable time, can detect regulari- ties hardly noticeable by humans, and its results can be reused by machines for applied tasks [13]. The system recognizes and resolves named entities, links them to the existing knowledge base, and gives them a type by using different cognitive computing functionalities [1, 13, 14, 15, 18]: frame detection, topic extraction, named en- tity recognition, resolution and co-reference, terminology extraction, sense tag- ging, word-sense disambiguation, taxonomy induction, semantic-role labelling, and type induction. In this way the algorithm recognizes the main entities and concepts, and most importantly, it performs relations extraction to derive the main relationships among them [17]. The main focus indeed is to extract the relationships among the obtained conceptual entities and link them together to allow interoperability among the information contained in the scientific documents. Using named entity recognition (NER) and resolution (a.k.a. entity linking) with standard biomedical ontologies in the pregnancy or newborn development domains, the algorithm makes sure to restrict the extraction of the structured information within the specific pregnancy or newborn development domains. Biomedical ontologies ensure both syntactic and semantic interoperability among all heterogeneous data coming into the system [9]. The extracted big data leverage the pregnancy and newborn development knowledge-base, which is updated periodically by the system with new, updated information coming from its input sources. Each concept in the system is linked to other concepts in the knowledge-base by using ontologies and allowing in- teroperability. The stored relationships may include, for example, sensitive user status/conditions (e.g. pregnant, parent, infant, etc), and diseases (asthma, al- lergies, etc.) linked to other conditions and information. The knowledge-base is constantly maintained, updated, and integrated in the ontology model. The knowledge-base contains the semantic model with the updated infor- mation coming from the scientific literature and clinical guidelines. In order to provide recommendation of the most relevant insights, a ranking is produced as following: 1. Consider each sentence with all its extracted relationships. Sum the absolute frequencies scores among all documents associated to the relationships. The result will be a score associated to the sentence. 2. Rank the extracted sentences with respect to the scores and identify the insight(s) having the largest score. 3. Put this sentence(s) set in the list of the insights to recommend. 4. Iteratively exclude from the remaining sentences the relationships that were already considered in the previous insights selection, and recalculate accord- ingly the scores of the sentences. 5. Re-order the sentence with the respect to the re-calculated scores, and iden- tify the insight(s) having the largest re-calculated score. 6. Go to Step 3, and continue until no further sentences are remaining. 7. The output will be the final list of the insights to recommend. The output of the system is a document with the list of top recommended insights that needs to be checked and validated by a curator, and successively the next step would be to automatically provide those top insights directly to end users. The recursive ranking re-calculation is aimed at increasing the diver- sification of the top insights that are finally recommended. Based on a specific user profile the system may be also able to personalize the insights that are recommended9 . In this way the algorithm could provide relevant insights to a pregnant woman or parent, linking validated information on pregnancy and newborn development to user-specific conditions. By stor- ing the specific insights in the system for later usage, the method would be able to re-adopt this information by, and share with, different products of the user. The standardized RDF/OWL format of the produced information in the knowledge-base guarantees standard communication among the different prod- ucts overcoming any incompatibility and interoperability issues. Summarizing, the described system is able to identify novel insights that go beyond clinical guidelines, and provide relationships which can help parents and pregnant women. 3 A controlled experiment In order to provide a business scenario as an example giving a detailed descrip- tion of the system, a controlled prototype experiment of relations extraction in the pregnancy and newborn development domains from a small set of scien- tific paper and clinical guidelines as input (18 in total) has been carried out. In this experiment the system produced 38878 relationships among the extracted concepts from the 18 documents in the input repository. Table 1 shows some examples of the extracted relationships among some concepts and their frequencies. Each entry in the subject and object columns represents the label associated with the related concept from a standard biomed- ical ontology. For example, “Child” is the label of the concept http://purl. bioontology.org/ontology/HL7/C0008059 belonging to the Health Level 7 (HL7) ontology. Similarly, “Asthma” is the label of the concept http://purl.bioontology. org/ontology/MESH/D001249 from the Medical Subject Headings (MeSH) ontol- ogy. Each concept in the ontology is uniquely identified by its corresponding URI, which conveys other undirected information coming from the ontology, relations to other concepts and its position in the hierarchy, and most important, it allows disambiguation among terms and linking concepts together to ensure syntactic as well semantic interoperability. 9 Functionality not yet implemented, under current development. Table 1: Example of extracted relationships subject relation object frequency “Infant” “hasAttribute” “Infection” 8 “Infant” “have” “Breastfeeding” 6 “Child” “hasAttribute” “Fever” 5 “Child” “hasAttribute” “Asthma” 5 “Physician” “cure” “Allergy” 5 “Night” “related” “Sleep” 4 “Milk” “related” “Metabolism” 4 “Parent” “encourage” “Infant” 4 “Individual” “do” “Feed” 3 “Mother” “express” “Milk” 3 “Breastfeeding” “protect” “Disease” 2 “Immunoglobulin E” “protect” “Asthma” 2 ... ... ... ... Figure 2 shows some relationships in the prototype example among the “Hu- man milk” concept (http://purl.jp/bio/4/id/200906042293515949) and some of the other concepts extracted from the input data sources related to the preg- nancy and newborn development domains of interest. As an example of the process of extraction of relationships from the input source, consider the following sentence: “Colostrum contains low concentrations of both lactose and fat in comparison to mature milk”. The concepts detailed in Table 2 are extracted and related each other. Table 2: Example of concepts and relations extracted for the sentence: “Colostrum contains low concentrations of both lactose and fat in comparison to mature milk”. subject relation object “Colostrum” “contain” “Decreased concentration” “Colostrum” “contain” “Lactose” “Decreased concentration” “related” “Lactose” ... ... ... In particular, the reported concepts stand for: – “Colostrum” is the concept: http://purl.obolibrary.org/obo/UBERON_ 0001914 coming from the Uber-anatomy ontology (UBERON) ontology; – “Lactose” is the concept: http://purl.bioontology.org/ontology/HL7/ C1696723 coming from the Health Level 7 (HL7) ontology; – “Decreased concentration” is the concept: http://purl.obolibrary.org/ obo/PATO_0001163 coming from the Phenotype And Trait Ontology (PATO) ontology; Fig. 2: View of the relationships among the “Human milk” concept (http:// purl.jp/bio/4/id/200906042293515949) and other concepts extracted from the input data sources related to the pregnancy domain. The relationships can be also explored by interactive chord diagrams for visualizations [12, 2]. A chord diagram is a graphical method of displaying the inter-relationships between data in a matrix. The data is arranged radi- ally around a circle with the relationships between the points typically drawn as arcs connecting the data together. When a specific concept is selected in- teractively, only its relationships are visualized in the diagram, helping users to grasp and understand more intuitively the inter-relationships among the dif- ferent entities. The format of chord diagrams is aesthetically pleasing, making it a popular choice in the world of data visualization [12, 3, 5]. For example, Figure 3 shows the relationships of the extracted concept “Breast”, i.e. con- cept http://purl.obolibrary.org/obo/UBERON_0000310 from the UBERON ontology with the other concepts. In the controlled prototype experiment with the 18 scientific paper and clini- cal guidelines as input, a total of 2573 different sentences were extracted. Figure 4 shows a chart with the ranked sentences extracted from the input pregnancy and newborn development repository for the controlled experiment. Fig. 3: Chord visualization of the relationships of the “Breast” (http://purl. obolibrary.org/obo/UBERON_0000310) concept with the other extracted con- cepts. If in Figure 4 we consider a threshold to cut the tail of the curve, so that to have 95% of AUC for the given chart (corresponding to a minimum scoring threshold equal to 10), we can reduce the total number of insights that are recommended to 1523. If we aim for a more selective selection, if we cut at a score of 100, we have only 53 most relevant insights to look at. For example, the following are the top ones: “Demonstrating the bioactivity of breast milk, a study on shed epithelial cells in the faeces of infants has shown that gene expression in the neonatal gastroin- testinal tract is influenced by breastfeeding, with differential expression found between formula fed and breast fed infants in genes regulating intestinal cell pro- liferation, differentiation and barrier function.” Fig. 4: Chart diagram showing the ranked sentences extracted from the input pregnancy repository for the controlled experiment. “Protein concentration is highest in breast milk of mothers aged 20-30, how- ever, maternal age does not seem to influence either lipid or lactose concentra- tions, and maternal age does not have a large impact on breast milk composi- tion.” “Infants and young children have a higher resting metabolic rate and rate of oxygen consumption per unit body weight than adults because they have a larger surface area per unit body weight and because they are growing rapidly.” “The total protein content of human breast milk consists of 13% casein, the lowest casein concentration of any studied species, corresponding to the slow growth rate of human infants.” “Exposure to tobacco smoke in utero was associated with an increased risk of stillbirth (odds ratio = 2.0, 95% confidence interval: 1.4, 2.9), and infant mortality was almost doubled in children born to women who had smoked dur- ing pregnancy compared with children of nonsmokers (odds ratio = 1.8, 95% confidence interval: 1.3, 2.6).” “Human milk oligosaccharides (HMO) also make up a significant fraction of breast milk carbohydrate, but are indigestible by the infant, their function instead is to nourish the gastrointestinal microbiota.” A preferred, ideal deployment of the system, not yet implemented but at a prototype stage, is shown in Figure 5, schematically depicting a recommenda- tion device (e.g. a smartphone) comprising at least one (or more) communication unit(s) and a user interface, and a processing unit embedded in a remote server controlling the suggestions of the most relevant insights to the device. The pro- cessing unit comprises a cognitive system, of which the pipeline is described in Figure 5, and is connected to the recommendation device via the communication unit. In various embodiments, the cognitive system in the processing unit maybe connected to different input data sources, including, but not limited to, a reposi- tory of scientific papers and clinical guidelines. The output information interface could be any device able to provide useful insights to a pregnant woman or a parent, for example a smartphone hosting an appropriate application. It might comprise a user interface for data input/output and a data display. Through a user interface, the user might input his profile details, e.g. sex, age, eventual health diseases (like allergies, asthma, etc.), and others, and then save these de- tails for later usage. The recommendation device sends the user details to the processing unit via the communication unit. Alternatively, the system may include also some automatic trackers, i.e. either devices embedded into the main system able to track and store automatically users’ activities, or also special tracking devices that are external to the main system but directly linked to it. The remote server is connected to the knowledge-base, which is a seman- tic triplestore containing the semantic information in the W3C standard format RDF/OWL, as described previously, and enabling semantic interoperability, rea- soning and inferencing, containing the relationships among conceptual pregnancy and newborn development actors involved in specific situations, including health conditions, sensitivity information (manually set or derived by the system). The cognitive system in the processing unit receives the user profile details and combines this information with the information in the knowledge-base. The system uses this knowledge to finally provide personalized insights to the rec- ommendation device which is then displayed to the user. Fig. 5: Schematic illustration showing the ideal deployment of the system. 4 Conclusion In this short contribution it has been described a method which is able to auto- matically retrieve relevant clinical content on newborn development from scien- tific papers and standard guidelines. The system may be used to feed in real-time a connected application to provide feedbacks and suggestion of useful insights to pregnant women and new parents derived from scientific literature and clin- ical guidelines. This method has the potential to be used in real pregnancy or newborn development recommendation systems. In addition, the method may be generalizable and applicable to other domains after choosing the relevant ontologies and information sources. References [1] S. Consoli and D. Reforgiato Recupero. Using FRED for named entity resolution, linking and typing for knowledge base population. In Gandon F., Cabrio E., Stankovic M., and Zimmermann A., editors, Communications in Computer and Information Science, volume 548, pages 40–50. Springer- Verlag, New York, 2015. [2] S. Consoli and N.I. Stilianakis. A quartet method based on variable neigh- borhood search for biomedical literature extraction and clustering. Inter- national Transactions in Operational Research, 24(3):537–558, 2017. [3] S. Consoli, K. Darby-Dowman, G. Geleijnse, J. Korst, and S. Pauws. Heuris- tic approaches for the quartet method of hierarchical clustering. IEEE Transactions on Knowledge and Data Engineering, 22(10):1428–1443, 2010. [4] S. Consoli, D. Reforgiato Recupero, and M. Petkovic, editors. Data Science for Healthcare: Methodologies and Applications. Springer Nature, 2019. [5] S. Consoli, J. Korst, S. Pauws, and G. Geleijnse. Improved metaheuris- tics for the quartet method of hierarchical clustering. Journal of Global Optimization, 78(2):241–270, 2020. [6] D. Dessı̀, G. Fenu, D. Reforgiato Recupero, and S. Consoli. Exploration of IBM Watson for healthcare applications. Technical report, Philips Research Europe Technical Note, PR-TN 2017/00115, Eindhoven The Netherlands, 2017. [7] D. Dessı̀, D. Reforgiato Recupero, G. Fenu, and S. Consoli. Exploiting cognitive computing and frame semantic features for biomedical document clustering. In CEUR Workshop Proceedings, volume 1948, pages 20–34, 2017. [8] D. Dessı̀, D. Reforgiato Recupero, G. Fenu, and S. Consoli. A recommender system of medical reports leveraging cognitive computing and frame seman- tics. Intelligent Systems Reference Library, 149:7–30, 2019. [9] A. Gangemi. Ontology design patterns for Semantic Web content. In Lecture Notes in Computer Science, volume 3729, pages 262–276, 2005. [10] R.E. Gantenbein. Watson, come here! The role of intelligent systems in health care. In World Automation Congress Proceedings, art num 6935748, pages 165–168, 2014. [11] J.O. Gutierrez-Garcia and E. López-Neri. Cognitive computing: A brief survey and open research challenges. In 3rd International Conference on Applied Computing and Information Technology and 2nd International Con- ference on Computational Science and Intelligence (ACIT-CSI), art num 7336083, pages 328–333, 2015. [12] D. Holten. Hierarchical edge bundles: Visualization of adjacency relations in hierarchical data. IEEE Transactions on Visualization and Computer Graphics, 12(5):741–748, 2006. [13] M. Mongiovı̀, D. Reforgiato Recupero, A. Gangemi, V. Presutti, A.G. Nuz- zolese, and S. Consoli. Semantic reconciliation of knowledge extracted from text through a novel machine reader. In Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015, 2015. [14] M. Mongiovı̀, D. Reforgiato Recupero, A. Gangemi, V. Presutti, and S. Consoli. Merging open knowledge extracted from text with MERGILO. Knowledge-Based Systems, 108:155–167, 2016. [15] A.G. Nuzzolese, A. Gangemi, and V. Presutti. Gathering lexical linked data and knowledge patterns from framenet. In KCAP 2011 - Proceedings of the 2011 Knowledge Capture Conference, pages 41–48, 2011. [16] R.A. Otte, A.J.E. van Beukering, and L.-M. Boelens-Brockhuis. Tracker- based personal advice to support the baby’s healthy development in a novel parenting app: Data-driven innovation. JMIR mHealth and uHealth, 7(7, art num e12666), 2019. [17] V. Presutti, A.G. Nuzzolese, S. Consoli, A. Gangemi, and D. Reforgiato Re- cupero. From hyperlinks to Semantic Web properties using Open Knowledge Extraction. Semantic Web, 7(4):351–378, 2016. [18] D. Reforgiato Recupero, A.G. Nuzzolese, S. Consoli, V. Presutti, S. Peroni, and M. Mongiovı̀. Extracting knowledge from text using SHELDON, a semantic holistic framEwork for LinkeD ONtology data. In WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web, pages 235–238, 2015. [19] M. Van Hartskamp, S. Consoli, W. Verhaegh, M. Petkovic, and A. Van De Stolpe. Artificial intelligence in clinical health care applications: View- point. Journal of Medical Internet Research, 21(4), 2019.