<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Revealing disease similarities by text mining</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alberto Calderone</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luana Licata</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elisa Micarelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Livia Perfetto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianni Cesareni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bioinformatics and Computational Biology Unit, Department of Biology, University of Rome 'Tor Vergata'</institution>
          ,
          <addr-line>Rome, 00133</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Texts written in human language contain structured information that is not easily parsable by computers. Text mining relies on large text corpora to derive rules which can be used by automatic means to extract automatically such information. Scientific literature represents the main source of information to study any biological phenomenon. While some phenomenon are studied to the point that corpora can actually be build, scientific literature describing rare diseases is scarce implying an even bigger challenge for automatic approaches. In order to tackle this problem the ELIXIR infrastructure is supporting various initiatives for data integration in different field of life sciences, including rare diseases, which will pave the way to the development of dedicated pieces of software. In this work we present a tool which applies a text-mining strategy to multiple text sets and merges individual results in order to infer not explicitly written connections.</p>
      </abstract>
      <kwd-group>
        <kwd>Text mining</kwd>
        <kwd>rare diseases</kwd>
        <kwd>data integration</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In order to get the current understanding of a research topic a scientist needs to read
through scientific literature. Studying many articles is the key to making mental
connections and come up with hypotheses that are not yet explicitly reported. This mental
process is far from being a trivial task even for experts. Automatic text analyses can
support and facilitate information extraction by speeding up tasks such as keywords
identification.</p>
      <p>Text mining (TM) uses statistical and computational approaches to derive text
patterns which can in turn extract useful information. Some TM tools aim at highlighting
keywords or phrases usually relying on training corpora. While some topics are
studied to the point that corpora can actually be build, scientific literature describing rare
diseases is scarce implying an even bigger challenge for automatic means.</p>
      <p>ELIXIR is an European initiative which sustains bioinformatics resources across
member states. ELIXIR aims at making Europe’s science institutes and organizations
come together under the same hood to manage the increasing amount of data being
generated in the field of life science research. It supports various initiatives for data
integration and dissemination which will pave the way to the development of
dedicated pieces of software. These data can be used to instruct machine learning systems in
conjunction with text mining tools in order to extract information from this scarcely
explored field of life sciences.</p>
      <p>While keywords identification in texts is useful when reading articles to spot
important words and phrases, much information is usually scattered in many articles and
possibly in articles from different domains or topics which can often only be derived
by mental reasoning. To support mental reasoning and linking of idea, we developed a
tool which aims at analyzing and integrating multiple articles in order to extract not
explicitly written information.</p>
      <p>Publication abstracts and titles mentioning a disease are retrieved from the
structured data returned by PubMed. These documents are preprocessed to remove
unnecessary terms, lemmatized and tokenized.</p>
      <p>There exist several TM approaches to extract keywords from text. For instance,
term-frequency, term frequency–inverse document frequency [1], Parts-Of-Speech
(POS) tagging algorithms [2], and others. We are currently investigating the best
approach to extract gene names from scientific literature. At the moment, we are relying
on a method based on statistics using exact word matching.</p>
      <p>Relevant terms are ranked according to a p-value calculated against a random set
of articles and then compared versus a second query results, for instance, a second
disease. The most relevant terms identified are represented as vectors of real values
whose distances can be calculated as a representation of their semantic distance.
Using these distances we build a graph which links diseases according to their
similarities (see Fig. 1).</p>
      <p>In particular, we applied two approaches, one based on MeSH terms [3] vocabulary
and one on full text analysis. As a preliminary analysis we processed articles about
diseases in general, including rare diseases. This preliminary analysis allowed us to
cluster diseases according to similarities. In particular, rare diseases turn out to be
associated with other well studied conditions triggering possible connections.
Recently we published a database focused on diseases DISNOR [4] . Since one of our
main goals is to increase the coverage on rare diseases, we are planning to integrate
TM in our data duration pipeline.</p>
      <p>We plan to improve our TM software whose results will also benefit from two
ELIXIR case studies.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Rajaraman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ullman</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          : Mining of Massive Datasets. Cambridge University Press, Cambridge (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Brill</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eric</surname>
          </string-name>
          :
          <article-title>A simple rule-based part of speech tagger</article-title>
          .
          <source>In: Proceedings of the third conference on Applied natural language processing -</source>
          . p.
          <fpage>152</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Association for Computational</surname>
            <given-names>Linguistics</given-names>
          </string-name>
          , Morristown, NJ, USA (
          <year>1992</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>ROGERS</surname>
            ,
            <given-names>F.B.</given-names>
          </string-name>
          :
          <article-title>Medical subject headings</article-title>
          .
          <source>Bull. Med</source>
          . Libr. Assoc.
          <volume>51</volume>
          ,
          <fpage>114</fpage>
          -
          <lpage>6</lpage>
          (
          <year>1963</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Lo</given-names>
            <surname>Surdo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Calderone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Iannuccelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Licata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Peluso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Castagnoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Cesareni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Perfetto</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          :
          <article-title>DISNOR: a disease network open resource</article-title>
          .
          <source>Nucleic Acids Res</source>
          . (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Thompson</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Johnston</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taruscio</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monaco</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Béroud</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gut</surname>
            ,
            <given-names>I.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hansson</surname>
            ,
            <given-names>M.G.</given-names>
          </string-name>
          , 't Hoen, P.-B.A.,
          <string-name>
            <surname>Patrinos</surname>
            ,
            <given-names>G.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dawkins</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ensini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zatloukal</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koubi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heslop</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paschall</surname>
            ,
            <given-names>J.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Posada</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robinson</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bushby</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lochmüller</surname>
          </string-name>
          , H.:
          <article-title>RD-Connect: An Integrated Platform Connecting Databases, Registries, Biobanks and Clinical Bioinformatics for Rare Disease Research</article-title>
          .
          <source>J. Gen. Intern. Med</source>
          .
          <volume>29</volume>
          ,
          <fpage>780</fpage>
          -
          <lpage>787</lpage>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>