Revealing disease similarities by text mining Alberto Calderone1, Luana Licata1, Elisa Micarelli1, Livia Perfetto1, Gianni Cesareni1 1 Bioinformatics and Computational Biology Unit, Department of Biology, University of Rome ‘Tor Vergata’, Rome, 00133, Italy Abstract. Texts written in human language contain structured information that is not easily parsable by computers. Text mining relies on large text corpora to derive rules which can be used by automatic means to extract automatically such information. Scientific literature represents the main source of information to study any biological phenomenon. While some phenomenon are studied to the point that corpora can actually be build, scientific literature describing rare dis- eases is scarce implying an even bigger challenge for automatic approach- es. In order to tackle this problem the ELIXIR infrastructure is supporting various initiatives for data integration in different field of life sciences, in- cluding rare diseases, which will pave the way to the development of ded- icated pieces of software. In this work we present a tool which applies a text-mining strategy to mul- tiple text sets and merges individual results in order to infer not explicitly written connections. Keywords: Text mining, rare diseases, data integration. 1 Overview In order to get the current understanding of a research topic a scientist needs to read through scientific literature. Studying many articles is the key to making mental con- nections and come up with hypotheses that are not yet explicitly reported. This mental process is far from being a trivial task even for experts. Automatic text analyses can support and facilitate information extraction by speeding up tasks such as keywords identification. Text mining (TM) uses statistical and computational approaches to derive text pat- terns which can in turn extract useful information. Some TM tools aim at highlighting keywords or phrases usually relying on training corpora. While some topics are stud- ied to the point that corpora can actually be build, scientific literature describing rare diseases is scarce implying an even bigger challenge for automatic means. ELIXIR is an European initiative which sustains bioinformatics resources across member states. ELIXIR aims at making Europe’s science institutes and organizations come together under the same hood to manage the increasing amount of data being generated in the field of life science research. It supports various initiatives for data integration and dissemination which will pave the way to the development of dedicat- ed pieces of software. These data can be used to instruct machine learning systems in 2 conjunction with text mining tools in order to extract information from this scarcely explored field of life sciences. While keywords identification in texts is useful when reading articles to spot im- portant words and phrases, much information is usually scattered in many articles and possibly in articles from different domains or topics which can often only be derived by mental reasoning. To support mental reasoning and linking of idea, we developed a tool which aims at analyzing and integrating multiple articles in order to extract not explicitly written information. Publication abstracts and titles mentioning a disease are retrieved from the struc- tured data returned by PubMed. These documents are preprocessed to remove unnec- essary terms, lemmatized and tokenized. There exist several TM approaches to extract keywords from text. For instance, term-frequency, term frequency–inverse document frequency [1], Parts-Of-Speech (POS) tagging algorithms [2], and others. We are currently investigating the best ap- proach to extract gene names from scientific literature. At the moment, we are relying on a method based on statistics using exact word matching. Relevant terms are ranked according to a p-value calculated against a random set of articles and then compared versus a second query results, for instance, a second disease. The most relevant terms identified are represented as vectors of real values whose distances can be calculated as a representation of their semantic distance. Us- ing these distances we build a graph which links diseases according to their similari- ties (see Fig. 1). In particular, we applied two approaches, one based on MeSH terms [3] vocabulary and one on full text analysis. As a preliminary analysis we processed articles about diseases in general, including rare diseases. This preliminary analysis allowed us to cluster diseases according to similarities. In particular, rare diseases turn out to be associated with other well studied conditions triggering possible connections. Fig. 1. A small subgraph of the disease similarity graph built from literature using symptoms extracted from MeSH terms. Noonan, Costello and Leopard syndromes are connected as ex- pected. In particular, we reported the top-five diseases similar to Noonan Syndrome. 3 2 Future Developments Recently we published a database focused on diseases DISNOR [4] . Since one of our main goals is to increase the coverage on rare diseases, we are planning to integrate TM in our data duration pipeline. We plan to improve our TM software whose results will also benefit from two ELIX- IR case studies. 1) The dedicated APIs which are being developed by the ELIXIR nodes involved in the RD-connect [5] will contribute to set-up a common interface to different data sources about rare diseases. 2) The integration of various data sources will pave the way to the development of dedicated TM tools to extract information about rare diseases and eventual- ly to a more precise terms and concepts extraction. References 1. Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2011). 2. Brill, E., Eric: A simple rule-based part of speech tagger. In: Proceedings of the third conference on Applied natural language processing -. p. 152. Association for Computational Linguistics, Morristown, NJ, USA (1992). 3. ROGERS, F.B.: Medical subject headings. Bull. Med. Libr. Assoc. 51, 114–6 (1963). 4. Lo Surdo, P., Calderone, A., Iannuccelli, M., Licata, L., Peluso, D., Castagnoli, L., Cesareni, G., Perfetto, L.: DISNOR: a disease network open resource. Nucleic Acids Res. (2017). 5. Thompson, R., Johnston, L., Taruscio, D., Monaco, L., Béroud, C., Gut, I.G., Hansson, M.G., ’t Hoen, P.-B.A., Patrinos, G.P., Dawkins, H., Ensini, M., Zatloukal, K., Koubi, D., Heslop, E., Paschall, J.E., Posada, M., Robinson, P.N., Bushby, K., Lochmüller, H.: RD-Connect: An Integrated Platform Connecting Databases, Registries, Biobanks and Clinical Bioinformatics for Rare Disease Research. J. Gen. Intern. Med. 29, 780–787 (2014).