<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>with Terms from Scientific Publications: Extended Abstract</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Umayer Reza</string-name>
          <email>a.reza@maine.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Torsten Hahmann</string-name>
          <email>torsten.hahmann@maine.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Named Entity Recognition, Cellulose Ontology, Knowledge Graph, Scientific Publication</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computing and Information Science, University of Maine</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Cellulose is a highly versatile biopolymer with numerous applications, such as paper and paperboard production, textiles, packaging, biofuels, and biomedical applications. Though, the scattered nature of cellulose knowledge with ambiguous terms and datasets presents significant obstacles to its optimal utilization. This project seeks to address these challenges by systematically accumulating scattered knowledge about cellulose, enabling it to be modifiable, extensible, and reusable. The objective of the project is to develop an automated system to extract relevant cellulosic terms from scientific publications which will show an improved performance in named entity classification by taking additional context and disambiguous information from an existing cellulose ontology. An incremental training process will be utilized to train a ScispaCy language model, which is specifically designed for analyzing scientific, clinical, and biomedical texts, in order to accomplish this task. The system will also generate new terms for the ontology by taking the existing ontology into account. Therefore, the proposed system will facilitate the extension of the ontology, while simultaneously benefiting from the ontology to enhance performance in named entity classification. By meeting these objectives, the project aims to contribute to the development of a sustainable bioproduct-based society by providing a resource of state-of-the-art knowledge in cellulose materials that can facilitate material science research.</p>
      </abstract>
      <kwd-group>
        <kwd>Publications</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>Motivation</title>
      <p>
        nEvelop-O
LGOBE
(T. Hahmann)
Cellulose, the most abundant and versatile biopolymer on earth found in plant cells and some
bacteria, is the building block of cellulosic materials, which have many applications in various
domains because of their sustainable, renewable, and biodegradable nature. One of the most
significant applications of cellulosic materials is the production of paper and paperboard. The
unique dimensions and characteristics of cellulose nanofibrils (CNFs) make them crucial in
papermaking for enhancing the strength properties of paper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In addition to their use in
paper production, several nanocelluloses (NCs) are alternatives for the textile industry because
of their higher mechanical resistance [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and as a substitute for petroleum-based packaging [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
FOIS 2023 Early Career Symposium (ECS), held at FOIS 2023, co-located with 9th Joint Ontology Workshops (JOWO
CEUR
Workshop
Proceedings
      </p>
      <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
and as a natural polymer with low toxicity, high crystallinity, biocompatibility, and biosafety
for biomedical applications [4].</p>
      <p>However, the knowledge on cellulose, much of which is stored only in scientific publication
and, even when available in digital formats like PDF or HTML, is not easily processed at a large
scale due to the scattered and ambiguous nature of text. Information extraction (IE) approaches
are needed to extract this information from scientific publications and make it accessible
in structured formats. By putting the information into an ontology, the knowledge can be
subsequently queried and reasoned with more eficiently. One of the key steps in IE is NER,
or Named Entity Recognition, which is the task of extracting nouns and noun chunks, called
entities or named entities, from text. Identification and classification of named entities is the
key part of the information extraction process. Though, the current NER approaches are limited
in their ability to recognize cellulosic named entities and to handle variations in the naming
conventions of them. ChemSpot [5] is a hybrid chemical named entity recognizer that utilized a
CRF (Conditional Random Field) model to identify chemical named entities in natural language
texts. In the biomedical domain, tmChem [6] employed a model combination approach using
two diferent CRF models to recognize chemical mentions, properties, and their relationships.
Akkasi et al. (2016) introduced ChemTok [7], a rule-based tokenizer specifically designed for
chemical named entity recognition. Swain et al. (2016) presented ChemDataExtractor [8], a
toolkit capable of extracting chemical entities along with their properties, measurements, and
relationships. Corbett and Boyle (2018) developed Chemlistem [9], a chemical named entity
recognizer based on recurrent neural networks. Although there are other methods available
for extracting material entities from text, they also struggle in recognizing the diverse range of
entities encountered in the cellulosic domain. Zhao et al. (2021) introduced a fine-tuned BERT
model [10] specifically designed for materials named entity recognition. Similarly, Miah and
Sulaiman (2023) proposed a deep neural network-based model [11] tailored for materials named
entity recognition. Shetty et al. (2023) presented an alternative approach [12] for extracting
material property data. Furthermore, Weston et al. (2019) presented a comprehensive approach
[13] that not only extracts material properties but also captures their applications and mentions
of inorganic materials.</p>
      <p>In order to comprehensively capture cellulosic knowledge, it is crucial to extract a wide range
of relevant entities beyond just chemicals and materials. This includes extracting properties
associated with materials and chemicals, manufacturing processes, as well as names of products
and equipment. Therefore, existing methods often fail to accurately identify a significant portion
of cellulosic entities due to their limited familiarity with cellulosic data. The ultimate goal of
the project is to contribute in growing an ontology-guided knowledge body about cellulose
by extracting relevant terms from the scientific literatures which are the preferred source of
knowledge. Initially, a manually made cellulose ontology will play a significant role in NER by
providing a structured representation of the knowledge and relationships among entities. It
can also assist to improve NER performance by providing additional context and disambiguous
information, as well as enabling more sophisticated reasoning and inference. Later on the
cellulose ontology itself will grow by adding newly identified cellulosic entities to speed up
ontology development. The manual amendment of ontologies can be a time-consuming and
costly process which limits their usefulness in practice. In contrast, an automated process can
help to overcome these limitations and enable more eficient and efective use of ontologies in
NER and other NLP tasks. Additionally, the automatic amendment of ontologies can help to
ensure that they are up-to-date and reflect the latest developments in the domain, but there is
little work in leveraging the synergies between NER and ontologies: in (1) utilizing ontologies
for NER and (2) using NER to amend and populate ontologies.</p>
      <p>In scientific domains where accurate organization of terms is important to avoid
misrepresentation of knowledge, relying solely on an automatic ontology construction method that builds
the ontology from scratch may not be efective. Instead, a semi-automated method, where
domain experts contribute initial concepts and relationships to establish a core domain ontology,
can be utilized to amend the ontology with additional terms. The purpose of incorporating an
ontology in the named entity recognition process is to improve its performance in the cellulosic
domain. This integration will enable the system to identify the named entities that align with
the concepts and relationships of the ontology and classify them accordingly.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Research Questions</title>
      <p>The proposed dissertation aims to leverage the synergies between ontologies and NER by
specifically addressing the following three research questions:
1. Under what conditions can pre-trained language models be incrementally re-trained for
improved named entity recognition of terms in the cellulosic domain?
2. How can cellulose-related terms that are identified by such NER approaches be categorized
more efectively and precisely by leveraging a small hand-curated ontology of cellulose
materials?
3. What methods are suitable for determining whether a particular identified term refers to
a concept that already exists in the ontology, or a new concept that requires amending
the ontology?</p>
    </sec>
    <sec id="sec-4">
      <title>3. Objective</title>
      <p>The objective of the project is to develop an automated system that will identify cellulosic terms
from given text with a higher accuracy. Additionally, the system aims to classify these identified
terms, as much as possible, with the most relevant concepts available in a cellulose ontology
which is currently being developed. To accomplish this, the proposed system will establish
internal communication with the ontology, enabling the identification and classification of new
cellulosic terms that are not yet incorporated in the ontology. The resulting set of new terms
will be shared with domain experts for review, allowing them to assess the relevance of those
terms and determine their appropriate placement within the taxonomy. If a new term is found
to carry a more refined semantic meaning than an existing term in the ontology, the ontology
will be amended accordingly. Furthermore, the identification and exclusion of irrelevant terms
in every NER process will accelerate the continual assessment process for recognized cellulosic
terms and their association with the ontology over time.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Research Methodology</title>
      <p>A ScispaCy [14] language model will be selected considering its current performance on the
cellulosic data. This performance will be measured based on the number of correctly recognized
cellulosic terms from a curated corpus of evaluation data. The best performing model will go
through an incremental training process using spaCy [15] model training pipeline. In the first
phase of training, the selected model will be introduced with CHEMDNER corpus [16] which is
currently the largest corpus for chemical terms. Then the model will undergo training using
various corpus of training data, including materials, properties, processes, and more, across
diferent training phases. This iterative process will enable the model to progressively enhance
its understanding and knowledge on diverse terms of the cellulosic domain. After each phase
of training the performance of the model will be evaluated using standard metrics such as
precision, recall, and F1-score. To assess the improvement, a comparison will also be conducted
between the performance of the model in the current training phase and the performance of
the models in the previous training phases using an identical evaluation dataset.</p>
      <p>Finally, the improved language model will be employed to extract cellulosic named entities
from a vast collection of text documents. The extracted cellulosic terms will be comprehensively
compared to the existing terms in the ontology, generating a set of candidate terms to be
forwarded for further verification by domain experts. The domain experts will assess the
suitability of these terms and make informed decisions regarding their rejection or integration
into the ontology. This collaborative process will ensure that the ontology is enriched with
relevant and accurate terms, enhancing its overall efectiveness and comprehensiveness. The
efectiveness of the approach will be measured by calculating the percentage of terms accepted
by domain experts. This metric will provide valuable insights into the success and acceptance
of the proposed method in enriching the ontology with relevant and authoritative terms.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research was supported in part by the U.S. Department of Agriculture:
• Forest Service, Project 20-JV-11111124-055
• USDA Agricultural Research Service (ARS), Project 0204-41510-001-98S
• National Institute of Food and Agriculture (NIFA), Award 2021-67022-34366
[4] S. Gopi, P. Balakrishnan, D. Chandradhara, D. Poovathankandy, S. Thomas, General
scenarios of cellulose and its use in the biomedical field, Materials Today Chemistry 13
(2019) 59–78. doi:10.1016/j.mtchem.2019.04.012.
[5] T. Rocktäschel, M. Weidlich, U. Leser, Chemspot: a hybrid system for chemical named entity
recognition, Bioinformatics 28 (2012) 1633–1640. doi:10.1093/bioinformatics/bts183.
[6] R. Leaman, C.-H. Wei, Z. Lu, tmchem: a high performance approach for chemical named
entity recognition and normalization, Journal of Cheminformatics 7 (2015) S3. doi:10.
1186/1758-2946-7-S1-S3.
[7] A. Akkasi, E. Varoğlu, N. Dimililer, Chemtok: A new rule based tokenizer for chemical
named entity recognition, BioMed Research International 2016 (2016) 1–9. doi:10.1155/
2016/4248026.
[8] M. C. Swain, J. M. Cole, Chemdataextractor: A toolkit for automated extraction of chemical
information from the scientific literature, Journal of Chemical Information and Modeling
56 (2016) 1894–1904. doi:10.1021/acs.jcim.6b00207.
[9] P. T. Corbett, J. Boyle, Chemlistem: chemical named entity recognition using recurrent
neural networks, Journal of Cheminformatics 10 (2018) 59. doi:10.1186/s13321-018-0313-8.
[10] X. Zhao, J. Greenberg, Y. An, X. T. Hu, Fine-tuning bert model for materials named entity
recognition, in: 2021 IEEE International Conference on Big Data (Big Data), IEEE, 2021,
pp. 3717–3720. doi:10.1109/BigData52589.2021.9671697.
[11] M. S. U. Miah, J. Sulaiman, Material named entity recognition (mner) for knowledge-driven
materials using deep learning approach, in: M. S. Kaiser, S. Waheed, A. Bandyopadhyay,
M. Mahmud, K. Ray (Eds.), Proceedings of the Fourth International Conference on Trends
in Computational and Cognitive Engineering, Springer Nature Singapore, Singapore, 2023,
pp. 199–208. doi:10.1007/978-981-19-9483-8_17.
[12] P. Shetty, A. C. Rajan, C. Kuenneth, S. Gupta, L. P. Panchumarti, L. Holm, C. Zhang,
R. Ramprasad, A general-purpose material property data extraction pipeline from large
polymer corpora using natural language processing, npj Computational Materials 9 (2023)
52. doi:10.1038/s41524-023-01003-w.
[13] L. Weston, V. Tshitoyan, J. Dagdelen, O. V. Kononova, A. Trewartha, K. A. Persson, G. Ceder,
A. Jain, Named entity recognition and normalization applied to large-scale information
extraction from the materials science literature, Journal of chemical information and
modeling 59 (2019) 3692–3702. doi:10.1021/acs.jcim.9b00470.
[14] M. Neumann, D. King, I. Beltagy, W. Ammar, Scispacy: Fast and robust models for
biomedical natural language processing, in: Proceedings of the 18th BioNLP Workshop
and Shared Task, Association for Computational Linguistics, Florence, Italy, 2019, pp.
319–327. doi:10.18653/v1/W19-5034.
[15] M. Honnibal, I. Montani, S. V. Landeghem, A. Boyd, spacy: Industrial-strength natural
language processing in python (2020). doi:10.5281/zenodo.1212303.
[16] M. Krallinger, O. Rabal, F. Leitner, M. Vázquez, D. Salgado, Z. Lu, R. Leaman, Y. Lu, D.-H. Ji,
D. M. Lowe, R. A. Sayle, R. T. Batista-Navarro, R. Rak, T. Huber, T. Rocktäschel, S. Matos,
D. Campos, B. Tang, H. Xu, T. Munkhdalai, K. H. Ryu, S. V. Ramanan, P. S. Nathan, S. Žitnik,
M. Bajec, L. Weber, M. Irmer, S. A. Akhondi, J. A. Kors, S. Xu, X. An, U. K. Sikdar, A. Ekbal,
M. Yoshioka, T. M. Dieb, M. Choi, K. M. Verspoor, M. Khabsa, C. L. Giles, H. Liu, K. E.
Ravikumar, A. Lamurias, F. M. Couto, H.-J. Dai, R. T.-H. Tsai, C. Ata, T. Can, A. Usie,
R. Alves, I. Segura-Bedmar, P. Martínez, J. Oyarzábal, A. Valencia, The chemdner corpus
of chemicals and drugs and its annotation principles, Journal of Cheminformatics 7 (2015)
S2. doi:10.1186/1758-2946-7-S1-S2.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T. B.</given-names>
            <surname>Jele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lekha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. B.</given-names>
            <surname>Sithole</surname>
          </string-name>
          ,
          <article-title>Role of cellulose nanofibrils in improving the strength properties of paper: a review</article-title>
          ,
          <source>Cellulose</source>
          <volume>29</volume>
          (
          <year>2021</year>
          )
          <fpage>55</fpage>
          -
          <lpage>81</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10570- 021- 04294- 8.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Felgueiras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. G.</given-names>
            <surname>Azoia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M. A.</given-names>
            <surname>Gonçalves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dourado</surname>
          </string-name>
          ,
          <article-title>Trends on the cellulose-based textiles: Raw materials and technologies</article-title>
          ,
          <source>Frontiers in Bioengineering and Biotechnology</source>
          <volume>9</volume>
          (
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          .3389/fbioe.
          <year>2021</year>
          .
          <volume>608826</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lutes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Prospects for replacement of some plastics in packaging with lignocellulose materials: A brief review</article-title>
          ,
          <source>Bioresources</source>
          <volume>13</volume>
          (
          <year>2018</year>
          )
          <fpage>4550</fpage>
          -
          <lpage>4576</lpage>
          . doi:
          <volume>10</volume>
          .15376/biores.13.2.Su.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>