<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multilabel-classification task for Medline abstracts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nelson Quiñones</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cesar Canales</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Javier Torres</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dietrich Rebholz-Schuhmann</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leyla Jael Castro</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrés Aristizábal</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Leibniz University of Hannover</institution>
          ,
          <addr-line>Welfengarten 1, Hannover, 30167</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad ICESI</institution>
          ,
          <addr-line>CL 18 122-135, Cali, 760031</addr-line>
          ,
          <country country="CO">Colombia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Cologne</institution>
          ,
          <addr-line>Albertus-Magnus-Platz, Cologne, 50923</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>ZB MED Information Centre for Life Sciences</institution>
          ,
          <addr-line>Gleueler Str 60, Cologne, 50931</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>s, particularly from Medline. Here we propose a multilabel-classification approach to assign major topics to biomedical literature with the purpose of later applying transfer learning to cover conference papers and preprints, as well as the agricultural domain. In this short paper, we present some preliminary results.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Multilabel-classification</kwd>
        <kwd>literature categorization</kwd>
        <kwd>MeSH topics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Materials and Methods</title>
      <p>
        We worked with title, abstract and MeSH descriptors corresponding to a subset of the PubMed Central
Open Access [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] articles retrieved with the Biopython library [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. From the initial set of 7.4 million
abstracts from 2015 to 2022, we retained only 2.8 million corresponding to those with all the
elements, i.e., abstract, title, and MeSH descriptors, available in machine processable form. Data was
further cleaned and transformed to create word embeddings. We then translated the MeSH terms to
UMLS STYs to (i) reduce the number of prediction classes, from 348,860 in MeSH to 127 in UMLS
STY, and to (ii) prioritize those types that could be more meaningful to biomedical researchers. The
dataset creation process took 2 days in an AMD Ryzen 5 3400G. Our method corresponds to a
fine-tuning of transformer models.
      </p>
      <p>
        First, we initialized the models with pre-trained parameters, and then we fine-tuned such
parameters by using labeled data from the downstream tasks. A new layer on top of the based model
abstracts the knowledge enclosed by our dataset. Preliminary exploration was done on the HugginFace
platform. We then perform hyperparameter optimization with an algorithm called Hyperband [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to
find the best configuration to train the final model. We kept track of metrics including Hamming
Score, Accuracy Score, macro F1, micro F1, and Hamming Loss. Still, our main metric to guide the
hyperparameter optimization process was the F1 micro as our corpus exhibits a high imbalance in the
classes. The Mobster algorithm (available in the Syne Tune Library) was used in over 10% of the
articles in the corpus to find the values corresponding to the best hyperparameter configuration. We
allowed the algorithm to run for four days in a virtual machine part of the deNBI Cloud platform,
equipped with an RTX6000 GPU and 128GB of ram. The optimal configuration was used to train our
model on our dataset. In addition, we created a proof-of-concept web application1 to use the model
and display predicted STYs for a given PubMed identifier. We used vanilla JS for the web application
and uploaded the model to HuggingFace2.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Results and Discussion</title>
      <p>The hyperparameters used in the optimization process were the following: Learning rate (LR) between
[5e-6 ∼ 1e-4], dropout rate (DR) between [0 ∼ 1], model selection [biobert-v1.1,
distilbert-base-uncased, scibert_scivocab_uncased, Bio_ClinicalBERT, bert-base-uncased], the
maximum length of input tokens (L) between [100 ~ 512], batch size between (BS) [4 ~ 64], and the
number of threads (NT) used for processing between 1 and 8. The best-performing model had an LR
of 2.0-05, a DR of 0.0, used the scibert_scivocab_uncased model, an L of 403, an BS of 23, and an
NT of 5. After training the previously mentioned model with the training dataset, we obtained the
following results with the validations dataset: an F1 micro of 0.489, an accuracy score of 0.196, an F1
macro of 0.416, Hamming score of 0.389, and Hamming Loss of 0.016. Although the metric scores do
not show high values, our approach can still be further developed and improved. Multi-classification
and multi-labeling with the number of classes, i.e., STY labels, that we are dealing with do not
commonly show high scores as happens with binary classification. Still, pre-trained data opens new
possibilities for this sort of task.</p>
    </sec>
    <sec id="sec-4">
      <title>5. Acknowledgements</title>
    </sec>
    <sec id="sec-5">
      <title>4. References</title>
      <p>This work was partially supported by the BMBF-funded de.NBI Cloud within the German Network
for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A,
031A535A, 031A537A, 031A537B, 031A537C, 031A537D, 031A538A)
1 https://github.com/zbmed-semtec/topic-categorization-system
2 https://wandb.ai/javtor/huggingface and https://huggingface.co/datasets/Javtor/biomedical-topic-categorization</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Dhammi</surname>
            <given-names>IK</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            <given-names>S.</given-names>
          </string-name>
          <article-title>Medical subject headings (MeSH) terms</article-title>
          .
          <source>Indian J Orthop</source>
          .
          <year>2014</year>
          Sep;
          <volume>48</volume>
          (
          <issue>5</issue>
          ):
          <fpage>443</fpage>
          -
          <lpage>4</lpage>
          . doi:
          <volume>10</volume>
          .4103/
          <fpage>0019</fpage>
          -
          <lpage>5413</lpage>
          .139827. PMID: 25298548; PMCID:
          <fpage>PMC4175855</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>[2] National Library of Medicine (US); 2009 Sep-</article-title>
          .
          <string-name>
            <surname>Available</surname>
          </string-name>
          from: https://www.ncbi.nlm.nih.gov/books/NBK9676/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>PMC</given-names>
            <surname>Open Access</surname>
          </string-name>
          <string-name>
            <surname>Subset</surname>
          </string-name>
          [Internet].
          <source>Bethesda (MD): National Library of Medicine</source>
          .
          <year>2003</year>
          - [cited
          <year>2022</year>
          11 20]. Available from https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Cock</surname>
            ,
            <given-names>P.J.A.</given-names>
          </string-name>
          et al.
          <article-title>Biopython: freely available Python tools for computational molecular biology and bioinformatics</article-title>
          .
          <source>Bioinformatics 2009 Jun</source>
          <volume>1</volume>
          ;
          <issue>25</issue>
          (
          <issue>11</issue>
          )
          <fpage>1422</fpage>
          -3 https://doi.org/10.1093/bioinformatics/btp163 pmid:
          <fpage>19304878</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Lisha</given-names>
          </string-name>
          , Kevin Jamieson,
          <string-name>
            <surname>Giulia</surname>
            <given-names>DeSalvo</given-names>
          </string-name>
          , Afshin Rostamizadeh, and Ameet Talwalkar. “Hyperband:
          <string-name>
            <given-names>A Novel</given-names>
            <surname>Bandit-Based Approach</surname>
          </string-name>
          to Hyperparameter Optimization.” arXiv, June 18,
          <year>2018</year>
          . https://doi.org/10.48550/arXiv.1603.06560.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>