<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Biomedical Question Answering using Extreme Multi- Label Classification and Ontologies in the Multilingual Panorama*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andre Neves</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andre Lamurias</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francisco M. Couto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LASIGE, Faculdade de Ciências, Universidade de Lisboa</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>Deep learning models achieve state-of-the-art results in Natural Language Processing (NLP) tasks, such as Question Answering (QA), across different domains, mostly thanks to pre-trained language models such as BERT [1]. However, there is a lack of models designed for NLP tasks in the multilingual panorama, especially in specific domains such as the biomedical sciences, mostly due to the lack of datasets available in non-English languages. In this short paper, we propose the development of a QA system using stateof-the-art deep learning models and combining it with a deep learning Extreme Multi-Label Classification (XMLC) solution along with ontologies, in order to improve the results achieved by the model. The proposed model shall be able to answer biomedical questions in English, Spanish and Portuguese.</p>
      </abstract>
      <kwd-group>
        <kwd>Deep Learning</kwd>
        <kwd>Question Answering</kwd>
        <kwd>Multilingual</kwd>
        <kwd>Extreme MultiLabel Classification</kwd>
        <kwd>Ontologies</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        QA aims at the automatic retrieval of the most relevant and informative answers to
questions made by the user. For these answers to be precise, the system requires the
capability to understand and process natural language, which can be achieved through
deep learning techniques that automatically discover patterns in the text by giving raw
text data as input [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The choice of datasets and corpora to use is essential when training the model.
However, most deep learning models and corpora are designed exclusively for the English
language, thus being difficult to apply them to QA and other NLP tasks in other
languages. This can be even more significative when dealing with specific domains, like
biomedical sciences, where structured domain specific corpora may not exist in
nonEnglish languages.</p>
      <p>
        Although there are multilingual deep learning models such as BERT [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and
biomedical models such as BioBERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and SciBERT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], multilingual biomedical QA
      </p>
    </sec>
    <sec id="sec-2">
      <title>Objectives</title>
      <p>
        still remains a challenge. However, there might be a possibility to increase the quality
of the results in multilingual question answering by combining it with deep learning
XMLC. This technique consists in assigning to the text multiple labels from several
candidate labels from a dataset that can achieve thousands or even millions [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The
choice of labels to use can be from any language and from any domain. Thus, with the
right choice of labels, it can be applied to multilingual biomedical QA.
We propose to develop a deep learning QA model designed for the biomedical and
multilingual panorama, namely for the Spanish, Portuguese and English languages.
      </p>
      <p>Since XMLC algorithms can classify documents with labels that are related with the
contents of the document, and usually the first step of a QA system is to retrieve
documents relevant to the question, the hypothesis is, that by using these labels, the QA
system will be able to give more accurate answers, since it had a series of multilingual
labels to identify the most related documents to the question. This label classification
can be adapted to the multilingual biomedical panorama thanks the biomedicals terms
chosen as labels, which can be controlled vocabularies that exist in multiple languages
such as DeCS (Descriptores en Ciencias de la Salud) or ICD (International
Classification of Diseases) terms, in addition to the dataset and pre-trained language model used
to train the algorithm.</p>
      <p>An additional objective consists in incorporating biomedical ontologies with the
deep learning model, which is expected to improve the understanding of the context of
the lexical terms in the text, since ontologies provide a structured representation of the
knowledge about a given domain.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Methods</title>
      <p>
        To achieve these objectives, an XMLC algorithm named X-BERT, which consists in a
deep learning approach to scale the BERT model to XMLC [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], will be adapted to the
biomedical multilingual panorama. Our approach will use the BERT multilingual
pretrained model in combination with three datasets of scientific articles, one in each
language (English, Spanish and Portuguese), retrieved from PubMed and the Virtual
Health Library databases.
      </p>
      <p>Then, since the algorithm uses labels to classify the data, one can use as labels the
DeCS terms, which consists in a hierarchy of terms that were developed from MeSH
(Medical Subject Headings), to label biomedical articles in Portuguese, Spanish and
English. Since almost every DeCS term has a corresponding MeSH term, it is possible
to make a conversion between MeSH and DeCS terms and thus label multilingual data.
For example, the MeSH term D006801, which corresponds to the label “Humans”,
corresponds to the DeCS term 21034, which has the same label and a corresponding
description in English, Portuguese and Spanish. Thanks to this conversion between MeSH
and DeCS terms, it is also possible to access the relations between the terms and their
3
ancestors, and even incorporate ontologies in this multilingual solution, as long as they
have MeSH or DeCS terms associated.</p>
      <p>
        Finally, the QA part will use a deep learning model trained with pre-trained language
models for scientific text, such as BioBERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and SciBERT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], along with a QA
dataset that gives scientific documents as answers to the questions. The adapted
XBERT model will also be used to previously label the answers of the dataset. This way,
it is expected that the QA model retrieves more accurate results.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>Deep learning QA models achieved state-of-the-art results across different domains.
However, there is a lack of models designed for the multilingual panorama, especially
in specific domains such as the biomedical sciences. With this proposal, it is expected
that a reliable QA model can be developed for the biomedical and multilingual
panorama, by combining question answering and XMLC.</p>
      <p>
        We have previously explored two types of English biomedical QA models. The first
one consisted in classifying question pairs and question-answer pairs according to their
similarity [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], while the other consisted in retrieving relevant documents to each
question based on posts from a Q&amp;A website [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We intend to adapt these two approaches
to the multilingual panorama and, along with the XMLC approach herein proposed,
consolidate them as a complete multilingual biomedical QA system.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
          </string-name>
          , M.-W.,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          . Retrieved from http://arxiv.org/abs/
          <year>1810</year>
          .04805
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Lecun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Deep learning</article-title>
          .
          <source>Nature</source>
          . https://doi.org/10.1038/nature14539
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yoon</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>So</surname>
            ,
            <given-names>C. H.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>BioBERT: a pre-trained biomedical language representation model for biomedical text mining</article-title>
          .
          <source>Bioinformatics</source>
          . https://doi.org/10.1093/bioinformatics/btz682
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lo</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Cohan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>SciBERT: A Pretrained Language Model for Scientific Text</article-title>
          . Retrieved from http://arxiv.org/abs/
          <year>1903</year>
          .10676
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bhatia</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varma</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Sparse local embeddings for extreme multi-label classification</article-title>
          .
          <source>Advances in Neural Information Processing Systems</source>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <issue>6</issue>
          .
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>W. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>H. F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhong</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Dhillon</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>X-BERT: eXtreme Multi-label Text Classification with BERT</article-title>
          . Retrieved from http://arxiv.org/abs/
          <year>1905</year>
          .02331
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lamurias</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Couto</surname>
            ,
            <given-names>F. M.</given-names>
          </string-name>
          (
          <year>2019</year>
          ). LasigeBioTM at MEDIQA 2019:
          <article-title>Biomedical Question Answering using Bidirectional Transformers and Named Entity Recognition</article-title>
          . https://doi.org/10.18653/v1/w19-
          <fpage>5057</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Lamurias</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sousa</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Couto</surname>
            ,
            <given-names>F. M.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Generating Scientific Question Answering Corpora from Q&amp;A forums</article-title>
          . Retrieved from https://arxiv.org/abs/
          <year>2002</year>
          .02375
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>