Populating and Refining an Ontology of Cellulose
                                Materials with Terms from Scientific Publications:
                                Extended Abstract
                                Umayer Reza1 , Torsten Hahmann1
                                1
                                    School of Computing and Information Science, University of Maine


                                                                         Abstract
                                                                         Cellulose is a highly versatile biopolymer with numerous applications, such as paper and paperboard
                                                                         production, textiles, packaging, biofuels, and biomedical applications. Though, the scattered nature of
                                                                         cellulose knowledge with ambiguous terms and datasets presents significant obstacles to its optimal
                                                                         utilization. This project seeks to address these challenges by systematically accumulating scattered
                                                                         knowledge about cellulose, enabling it to be modifiable, extensible, and reusable. The objective of the
                                                                         project is to develop an automated system to extract relevant cellulosic terms from scientific publications
                                                                         which will show an improved performance in named entity classification by taking additional context
                                                                         and disambiguous information from an existing cellulose ontology. An incremental training process will
                                                                         be utilized to train a ScispaCy language model, which is specifically designed for analyzing scientific,
                                                                         clinical, and biomedical texts, in order to accomplish this task. The system will also generate new terms
                                                                         for the ontology by taking the existing ontology into account. Therefore, the proposed system will
                                                                         facilitate the extension of the ontology, while simultaneously benefiting from the ontology to enhance
                                                                         performance in named entity classification. By meeting these objectives, the project aims to contribute
                                                                         to the development of a sustainable bioproduct-based society by providing a resource of state-of-the-art
                                                                         knowledge in cellulose materials that can facilitate material science research.

                                                                         Keywords
                                                                         Named Entity Recognition, Cellulose Ontology, Knowledge Graph, Scientific Publication


                                1. Motivation
                                Cellulose, the most abundant and versatile biopolymer on earth found in plant cells and some
                                bacteria, is the building block of cellulosic materials, which have many applications in various
                                domains because of their sustainable, renewable, and biodegradable nature. One of the most
                                significant applications of cellulosic materials is the production of paper and paperboard. The
                                unique dimensions and characteristics of cellulose nanofibrils (CNFs) make them crucial in
                                papermaking for enhancing the strength properties of paper [1]. In addition to their use in
                                paper production, several nanocelluloses (NCs) are alternatives for the textile industry because
                                of their higher mechanical resistance [2], and as a substitute for petroleum-based packaging [3],
                                FOIS 2023 Early Career Symposium (ECS), held at FOIS 2023, co-located with 9th Joint Ontology Workshops (JOWO
                                2023), 19-20 July, 2023, Sherbrooke, Québec, Canada
                                Envelope-Open a.reza@maine.edu (U. Reza); torsten.hahmann@maine.edu (T. Hahmann)
                                GLOBE https://umaine.edu/scis/people/rezaumayer (U. Reza); https://umaine.edu/scis/people/torsten-hahmann
                                (T. Hahmann)
                                Orcid 0000-0003-4013-3513 (U. Reza); 0000-0002-5331-5052 (T. Hahmann)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
and as a natural polymer with low toxicity, high crystallinity, biocompatibility, and biosafety
for biomedical applications [4].
   However, the knowledge on cellulose, much of which is stored only in scientific publication
and, even when available in digital formats like PDF or HTML, is not easily processed at a large
scale due to the scattered and ambiguous nature of text. Information extraction (IE) approaches
are needed to extract this information from scientific publications and make it accessible
in structured formats. By putting the information into an ontology, the knowledge can be
subsequently queried and reasoned with more efficiently. One of the key steps in IE is NER,
or Named Entity Recognition, which is the task of extracting nouns and noun chunks, called
entities or named entities, from text. Identification and classification of named entities is the
key part of the information extraction process. Though, the current NER approaches are limited
in their ability to recognize cellulosic named entities and to handle variations in the naming
conventions of them. ChemSpot [5] is a hybrid chemical named entity recognizer that utilized a
CRF (Conditional Random Field) model to identify chemical named entities in natural language
texts. In the biomedical domain, tmChem [6] employed a model combination approach using
two different CRF models to recognize chemical mentions, properties, and their relationships.
Akkasi et al. (2016) introduced ChemTok [7], a rule-based tokenizer specifically designed for
chemical named entity recognition. Swain et al. (2016) presented ChemDataExtractor [8], a
toolkit capable of extracting chemical entities along with their properties, measurements, and
relationships. Corbett and Boyle (2018) developed Chemlistem [9], a chemical named entity
recognizer based on recurrent neural networks. Although there are other methods available
for extracting material entities from text, they also struggle in recognizing the diverse range of
entities encountered in the cellulosic domain. Zhao et al. (2021) introduced a fine-tuned BERT
model [10] specifically designed for materials named entity recognition. Similarly, Miah and
Sulaiman (2023) proposed a deep neural network-based model [11] tailored for materials named
entity recognition. Shetty et al. (2023) presented an alternative approach [12] for extracting
material property data. Furthermore, Weston et al. (2019) presented a comprehensive approach
[13] that not only extracts material properties but also captures their applications and mentions
of inorganic materials.
   In order to comprehensively capture cellulosic knowledge, it is crucial to extract a wide range
of relevant entities beyond just chemicals and materials. This includes extracting properties
associated with materials and chemicals, manufacturing processes, as well as names of products
and equipment. Therefore, existing methods often fail to accurately identify a significant portion
of cellulosic entities due to their limited familiarity with cellulosic data. The ultimate goal of
the project is to contribute in growing an ontology-guided knowledge body about cellulose
by extracting relevant terms from the scientific literatures which are the preferred source of
knowledge. Initially, a manually made cellulose ontology will play a significant role in NER by
providing a structured representation of the knowledge and relationships among entities. It
can also assist to improve NER performance by providing additional context and disambiguous
information, as well as enabling more sophisticated reasoning and inference. Later on the
cellulose ontology itself will grow by adding newly identified cellulosic entities to speed up
ontology development. The manual amendment of ontologies can be a time-consuming and
costly process which limits their usefulness in practice. In contrast, an automated process can
help to overcome these limitations and enable more efficient and effective use of ontologies in
NER and other NLP tasks. Additionally, the automatic amendment of ontologies can help to
ensure that they are up-to-date and reflect the latest developments in the domain, but there is
little work in leveraging the synergies between NER and ontologies: in (1) utilizing ontologies
for NER and (2) using NER to amend and populate ontologies.
   In scientific domains where accurate organization of terms is important to avoid misrepresen-
tation of knowledge, relying solely on an automatic ontology construction method that builds
the ontology from scratch may not be effective. Instead, a semi-automated method, where
domain experts contribute initial concepts and relationships to establish a core domain ontology,
can be utilized to amend the ontology with additional terms. The purpose of incorporating an
ontology in the named entity recognition process is to improve its performance in the cellulosic
domain. This integration will enable the system to identify the named entities that align with
the concepts and relationships of the ontology and classify them accordingly.


2. Research Questions
The proposed dissertation aims to leverage the synergies between ontologies and NER by
specifically addressing the following three research questions:
   1. Under what conditions can pre-trained language models be incrementally re-trained for
      improved named entity recognition of terms in the cellulosic domain?
   2. How can cellulose-related terms that are identified by such NER approaches be categorized
      more effectively and precisely by leveraging a small hand-curated ontology of cellulose
      materials?
   3. What methods are suitable for determining whether a particular identified term refers to
      a concept that already exists in the ontology, or a new concept that requires amending
      the ontology?


3. Objective
The objective of the project is to develop an automated system that will identify cellulosic terms
from given text with a higher accuracy. Additionally, the system aims to classify these identified
terms, as much as possible, with the most relevant concepts available in a cellulose ontology
which is currently being developed. To accomplish this, the proposed system will establish
internal communication with the ontology, enabling the identification and classification of new
cellulosic terms that are not yet incorporated in the ontology. The resulting set of new terms
will be shared with domain experts for review, allowing them to assess the relevance of those
terms and determine their appropriate placement within the taxonomy. If a new term is found
to carry a more refined semantic meaning than an existing term in the ontology, the ontology
will be amended accordingly. Furthermore, the identification and exclusion of irrelevant terms
in every NER process will accelerate the continual assessment process for recognized cellulosic
terms and their association with the ontology over time.
4. Research Methodology
A ScispaCy [14] language model will be selected considering its current performance on the
cellulosic data. This performance will be measured based on the number of correctly recognized
cellulosic terms from a curated corpus of evaluation data. The best performing model will go
through an incremental training process using spaCy [15] model training pipeline. In the first
phase of training, the selected model will be introduced with CHEMDNER corpus [16] which is
currently the largest corpus for chemical terms. Then the model will undergo training using
various corpus of training data, including materials, properties, processes, and more, across
different training phases. This iterative process will enable the model to progressively enhance
its understanding and knowledge on diverse terms of the cellulosic domain. After each phase
of training the performance of the model will be evaluated using standard metrics such as
precision, recall, and F1-score. To assess the improvement, a comparison will also be conducted
between the performance of the model in the current training phase and the performance of
the models in the previous training phases using an identical evaluation dataset.
   Finally, the improved language model will be employed to extract cellulosic named entities
from a vast collection of text documents. The extracted cellulosic terms will be comprehensively
compared to the existing terms in the ontology, generating a set of candidate terms to be
forwarded for further verification by domain experts. The domain experts will assess the
suitability of these terms and make informed decisions regarding their rejection or integration
into the ontology. This collaborative process will ensure that the ontology is enriched with
relevant and accurate terms, enhancing its overall effectiveness and comprehensiveness. The
effectiveness of the approach will be measured by calculating the percentage of terms accepted
by domain experts. This metric will provide valuable insights into the success and acceptance
of the proposed method in enriching the ontology with relevant and authoritative terms.


Acknowledgments
This research was supported in part by the U.S. Department of Agriculture:
    • Forest Service, Project 20-JV-11111124-055
    • USDA Agricultural Research Service (ARS), Project 0204-41510-001-98S
    • National Institute of Food and Agriculture (NIFA), Award 2021-67022-34366


References
 [1] T. B. Jele, P. Lekha, B. B. Sithole, Role of cellulose nanofibrils in improving the strength prop-
     erties of paper: a review, Cellulose 29 (2021) 55–81. doi:10.1007/s10570- 021- 04294- 8 .
 [2] C. Felgueiras, N. G. Azoia, C. M. A. Gonçalves, M. Gama, F. Dourado, Trends on the
     cellulose-based textiles: Raw materials and technologies, Frontiers in Bioengineering and
     Biotechnology 9 (2021). doi:10.3389/fbioe.2021.608826 .
 [3] Y. Su, B. Yang, J. Liu, B. Sun, C. Cao, X. Zou, R. Lutes, Z. He, Prospects for replacement of
     some plastics in packaging with lignocellulose materials: A brief review, Bioresources 13
     (2018) 4550–4576. doi:10.15376/biores.13.2.Su .
 [4] S. Gopi, P. Balakrishnan, D. Chandradhara, D. Poovathankandy, S. Thomas, General
     scenarios of cellulose and its use in the biomedical field, Materials Today Chemistry 13
     (2019) 59–78. doi:10.1016/j.mtchem.2019.04.012 .
 [5] T. Rocktäschel, M. Weidlich, U. Leser, Chemspot: a hybrid system for chemical named entity
     recognition, Bioinformatics 28 (2012) 1633–1640. doi:10.1093/bioinformatics/bts183 .
 [6] R. Leaman, C.-H. Wei, Z. Lu, tmchem: a high performance approach for chemical named
     entity recognition and normalization, Journal of Cheminformatics 7 (2015) S3. doi:10.
     1186/1758- 2946- 7- S1- S3 .
 [7] A. Akkasi, E. Varoğlu, N. Dimililer, Chemtok: A new rule based tokenizer for chemical
     named entity recognition, BioMed Research International 2016 (2016) 1–9. doi:10.1155/
     2016/4248026 .
 [8] M. C. Swain, J. M. Cole, Chemdataextractor: A toolkit for automated extraction of chemical
     information from the scientific literature, Journal of Chemical Information and Modeling
     56 (2016) 1894–1904. doi:10.1021/acs.jcim.6b00207 .
 [9] P. T. Corbett, J. Boyle, Chemlistem: chemical named entity recognition using recurrent neu-
     ral networks, Journal of Cheminformatics 10 (2018) 59. doi:10.1186/s13321- 018- 0313- 8 .
[10] X. Zhao, J. Greenberg, Y. An, X. T. Hu, Fine-tuning bert model for materials named entity
     recognition, in: 2021 IEEE International Conference on Big Data (Big Data), IEEE, 2021,
     pp. 3717–3720. doi:10.1109/BigData52589.2021.9671697 .
[11] M. S. U. Miah, J. Sulaiman, Material named entity recognition (mner) for knowledge-driven
     materials using deep learning approach, in: M. S. Kaiser, S. Waheed, A. Bandyopadhyay,
     M. Mahmud, K. Ray (Eds.), Proceedings of the Fourth International Conference on Trends
     in Computational and Cognitive Engineering, Springer Nature Singapore, Singapore, 2023,
     pp. 199–208. doi:10.1007/978- 981- 19- 9483- 8_17 .
[12] P. Shetty, A. C. Rajan, C. Kuenneth, S. Gupta, L. P. Panchumarti, L. Holm, C. Zhang,
     R. Ramprasad, A general-purpose material property data extraction pipeline from large
     polymer corpora using natural language processing, npj Computational Materials 9 (2023)
     52. doi:10.1038/s41524- 023- 01003- w .
[13] L. Weston, V. Tshitoyan, J. Dagdelen, O. V. Kononova, A. Trewartha, K. A. Persson, G. Ceder,
     A. Jain, Named entity recognition and normalization applied to large-scale information
     extraction from the materials science literature, Journal of chemical information and
     modeling 59 (2019) 3692–3702. doi:10.1021/acs.jcim.9b00470 .
[14] M. Neumann, D. King, I. Beltagy, W. Ammar, Scispacy: Fast and robust models for
     biomedical natural language processing, in: Proceedings of the 18th BioNLP Workshop
     and Shared Task, Association for Computational Linguistics, Florence, Italy, 2019, pp.
     319–327. doi:10.18653/v1/W19- 5034 .
[15] M. Honnibal, I. Montani, S. V. Landeghem, A. Boyd, spacy: Industrial-strength natural
     language processing in python (2020). doi:10.5281/zenodo.1212303 .
[16] M. Krallinger, O. Rabal, F. Leitner, M. Vázquez, D. Salgado, Z. Lu, R. Leaman, Y. Lu, D.-H. Ji,
     D. M. Lowe, R. A. Sayle, R. T. Batista-Navarro, R. Rak, T. Huber, T. Rocktäschel, S. Matos,
     D. Campos, B. Tang, H. Xu, T. Munkhdalai, K. H. Ryu, S. V. Ramanan, P. S. Nathan, S. Žitnik,
     M. Bajec, L. Weber, M. Irmer, S. A. Akhondi, J. A. Kors, S. Xu, X. An, U. K. Sikdar, A. Ekbal,
     M. Yoshioka, T. M. Dieb, M. Choi, K. M. Verspoor, M. Khabsa, C. L. Giles, H. Liu, K. E.
     Ravikumar, A. Lamurias, F. M. Couto, H.-J. Dai, R. T.-H. Tsai, C. Ata, T. Can, A. Usie,
R. Alves, I. Segura-Bedmar, P. Martínez, J. Oyarzábal, A. Valencia, The chemdner corpus
of chemicals and drugs and its annotation principles, Journal of Cheminformatics 7 (2015)
S2. doi:10.1186/1758- 2946- 7- S1- S2 .