<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Recognition of Biodiversity-related Named Entities by Fine-tuning General-domain BERT-based Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Geilah T. Tabanao</string-name>
          <email>geilahtabanao67@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrew Miguel V. Pagdanganan</string-name>
          <email>avpagdanganan@up.edu.ph</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riza Batista-Navarro</string-name>
          <email>riza.batista@manchester.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roselyn S. Gabud</string-name>
          <email>rsgabud@up.edu.ph</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>. Named Entity Recognition models</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Named Entity Recognition, Biodiversity, Transformers, Information Extraction</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>15th International SWAT4HCLS Conference</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, University of Manchester</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Computer Science, University of the Philippines Diliman</institution>
          ,
          <addr-line>Quezon City</addr-line>
          ,
          <country country="PH">Philippines</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Institute of Computer Science, University of the Philippines Los Baños</institution>
          ,
          <addr-line>Laguna</addr-line>
          ,
          <country country="PH">Philippines</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <abstract>
        <p>Named Entity Recognition (NER) is crucial for various Natural Language Processing (NLP) tasks, including uncovering insights from vast textual datasets. We evaluated Bidirectional Encoder Representations from Transformers (BERT) models pre-trained on general data, fine-tuning them on the COPIOUS dataset for biodiversity NER. Achieving the most optimal performance, our DeBERTa NER model was employed in a biodiversity Information Extraction pipeline, which was applied on the forestry compendium of the Centre for Agricultural and Biosciences International Digital Library. We demonstrate that the pipeline enables the enrichment of descriptive information on reproductive conditions and habitats of tree species.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>that this is the one dataset where pre-training a BERT model on domain-specific data, did not
lead to any improved performance, thus prompting the question of whether other BERT-based
models could perform better, even when pre-trained on general-domain data only. Amongst
our fine-tuned models, DeBERTa obtained the best performance, with an F1-score of 84.18%.
This is impressive, considering that this model was not pre-trained on domain-specific data.
2. Knowledge Graph Curation
A popular application of NER is the extraction of fine-grained information from text, that
can then be leveraged to populate or curate structured databases. In this vein, we set out
to explore the extent to which an Information Extraction pipeline underpinned by NER and
relation extraction (RE), can curate a biodiversity-focused database, based on information buried
within textual descriptions of various tree species in the Centre for Agricultural and Biosciences
International (CABI) Digital Library.1 Specifically, we integrated our best performing NER
model into the pipeline, and applied an existing RE model to extract information on the habitats
and reproductive conditions of species in the CABI Library forestry compendium.</p>
      <p>
        Taking a corpus of CABI textual descriptions, our pipeline: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) applies NER to extract
mentions of geographic locations, habitats and temporal expressions; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) applies RE to identify
related habitats and geographic locations (i.e., habitat-geographic location relations) and
related reproductive conditions and temporal expressions (i.e., reproductive condition-temporal
expression relations); and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) populates a graph database to store the related entities, to allow
for querying and visualisation.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          ,
          <year>2019</year>
          . URL: http://arxiv.org/abs/
          <year>1810</year>
          .04805. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1810</year>
          .
          <volume>04805</volume>
          , arXiv:
          <year>1810</year>
          .04805 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Kalyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rajasekharan</surname>
          </string-name>
          , S. Sangeetha,
          <source>AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing</source>
          ,
          <year>2021</year>
          . URL: http://arxiv.org/abs/2108. 05542. doi:
          <volume>10</volume>
          .48550/arXiv.2108.05542, arXiv:
          <fpage>2108</fpage>
          .05542 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N. T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Gabud</surname>
          </string-name>
          , S. Ananiadou,
          <article-title>COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature</article-title>
          ,
          <source>Biodiversity Data Journal</source>
          (
          <year>2019</year>
          )
          <article-title>e29626</article-title>
          . URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6351503/. doi:
          <volume>10</volume>
          .3897/BDJ.7.e29626.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Abdelmageed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Löfler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>König-Ries</surname>
          </string-name>
          ,
          <article-title>BiodivBERT: a Pre-Trained Language Model for the Biodiversity Domain</article-title>
          , in: A.
          <string-name>
            <surname>Yamaguchi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Splendiani</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          <string-name>
            <surname>Marshall</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>J. T.</given-names>
          </string-name>
          <string-name>
            <surname>Bolleman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Burger</surname>
            ,
            <given-names>L. J.</given-names>
          </string-name>
          <string-name>
            <surname>Castro</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Eigenbrod</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Österle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Romacker</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Waagmeester (Eds.),
          <source>14th International Conference on Semantic Web Applications and Tools for Health Care and Life Sciences (SWAT4HCLS</source>
          <year>2023</year>
          ), Basel, Switzerland,
          <source>February 13-16</source>
          ,
          <year>2023</year>
          , volume
          <volume>3415</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>62</fpage>
          -
          <lpage>71</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3415</volume>
          /paper-7.pdf.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>