Populating and Refining an Ontology of Cellulose Materials with Terms from Scientific Publications: Extended Abstract

Populating and Refining an Ontology of Cellulose Materials with Terms from Scientific Publications: Extended Abstract UmayerReza School of Computing and Information Science University of Maine TorstenHahmann School of Computing and Information Science University of Maine Populating and Refining an Ontology of Cellulose Materials with Terms from Scientific Publications: Extended Abstract 1613-0073 BA9C9BA5E8D124B8FFF9569D536DD020 GROBID - A machine learning software for extracting information from scholarly documents Named Entity Recognition Cellulose Ontology Knowledge Graph Scientific Publication

Cellulose is a highly versatile biopolymer with numerous applications, such as paper and paperboard production, textiles, packaging, biofuels, and biomedical applications. Though, the scattered nature of cellulose knowledge with ambiguous terms and datasets presents significant obstacles to its optimal utilization. This project seeks to address these challenges by systematically accumulating scattered knowledge about cellulose, enabling it to be modifiable, extensible, and reusable. The objective of the project is to develop an automated system to extract relevant cellulosic terms from scientific publications which will show an improved performance in named entity classification by taking additional context and disambiguous information from an existing cellulose ontology. An incremental training process will be utilized to train a ScispaCy language model, which is specifically designed for analyzing scientific, clinical, and biomedical texts, in order to accomplish this task. The system will also generate new terms for the ontology by taking the existing ontology into account. Therefore, the proposed system will facilitate the extension of the ontology, while simultaneously benefiting from the ontology to enhance performance in named entity classification. By meeting these objectives, the project aims to contribute to the development of a sustainable bioproduct-based society by providing a resource of state-of-the-art knowledge in cellulose materials that can facilitate material science research.

Cellulose, the most abundant and versatile biopolymer on earth found in plant cells and some bacteria, is the building block of cellulosic materials, which have many applications in various domains because of their sustainable, renewable, and biodegradable nature. One of the most significant applications of cellulosic materials is the production of paper and paperboard. The unique dimensions and characteristics of cellulose nanofibrils (CNFs) make them crucial in papermaking for enhancing the strength properties of paper [1]. In addition to their use in paper production, several nanocelluloses (NCs) are alternatives for the textile industry because of their higher mechanical resistance [2], and as a substitute for petroleum-based packaging [3], and as a natural polymer with low toxicity, high crystallinity, biocompatibility, and biosafety for biomedical applications [4].

However, the knowledge on cellulose, much of which is stored only in scientific publication and, even when available in digital formats like PDF or HTML, is not easily processed at a large scale due to the scattered and ambiguous nature of text. Information extraction (IE) approaches are needed to extract this information from scientific publications and make it accessible in structured formats. By putting the information into an ontology, the knowledge can be subsequently queried and reasoned with more efficiently. One of the key steps in IE is NER, or Named Entity Recognition, which is the task of extracting nouns and noun chunks, called entities or named entities, from text. Identification and classification of named entities is the key part of the information extraction process. Though, the current NER approaches are limited in their ability to recognize cellulosic named entities and to handle variations in the naming conventions of them. ChemSpot [5] is a hybrid chemical named entity recognizer that utilized a CRF (Conditional Random Field) model to identify chemical named entities in natural language texts. In the biomedical domain, tmChem [6] employed a model combination approach using two different CRF models to recognize chemical mentions, properties, and their relationships. Akkasi et al. (2016) introduced ChemTok [7], a rule-based tokenizer specifically designed for chemical named entity recognition. Swain et al. (2016) presented ChemDataExtractor [8], a toolkit capable of extracting chemical entities along with their properties, measurements, and relationships. Corbett and Boyle (2018) developed Chemlistem [9], a chemical named entity recognizer based on recurrent neural networks. Although there are other methods available for extracting material entities from text, they also struggle in recognizing the diverse range of entities encountered in the cellulosic domain. Zhao et al. ( 2021) introduced a fine-tuned BERT model [10] specifically designed for materials named entity recognition. Similarly, Miah and Sulaiman (2023) proposed a deep neural network-based model [11] tailored for materials named entity recognition. Shetty et al. (2023) presented an alternative approach [12] for extracting material property data. Furthermore, Weston et al. (2019) presented a comprehensive approach [13] that not only extracts material properties but also captures their applications and mentions of inorganic materials.

In order to comprehensively capture cellulosic knowledge, it is crucial to extract a wide range of relevant entities beyond just chemicals and materials. This includes extracting properties associated with materials and chemicals, manufacturing processes, as well as names of products and equipment. Therefore, existing methods often fail to accurately identify a significant portion of cellulosic entities due to their limited familiarity with cellulosic data. The ultimate goal of the project is to contribute in growing an ontology-guided knowledge body about cellulose by extracting relevant terms from the scientific literatures which are the preferred source of knowledge. Initially, a manually made cellulose ontology will play a significant role in NER by providing a structured representation of the knowledge and relationships among entities. It can also assist to improve NER performance by providing additional context and disambiguous information, as well as enabling more sophisticated reasoning and inference. Later on the cellulose ontology itself will grow by adding newly identified cellulosic entities to speed up ontology development. The manual amendment of ontologies can be a time-consuming and costly process which limits their usefulness in practice. In contrast, an automated process can help to overcome these limitations and enable more efficient and effective use of ontologies in NER and other NLP tasks. Additionally, the automatic amendment of ontologies can help to ensure that they are up-to-date and reflect the latest developments in the domain, but there is little work in leveraging the synergies between NER and ontologies: in (1) utilizing ontologies for NER and (2) using NER to amend and populate ontologies.

In scientific domains where accurate organization of terms is important to avoid misrepresentation of knowledge, relying solely on an automatic ontology construction method that builds the ontology from scratch may not be effective. Instead, a semi-automated method, where domain experts contribute initial concepts and relationships to establish a core domain ontology, can be utilized to amend the ontology with additional terms. The purpose of incorporating an ontology in the named entity recognition process is to improve its performance in the cellulosic domain. This integration will enable the system to identify the named entities that align with the concepts and relationships of the ontology and classify them accordingly.

Research Questions

The proposed dissertation aims to leverage the synergies between ontologies and NER by specifically addressing the following three research questions:

1. Under what conditions can pre-trained language models be incrementally re-trained for improved named entity recognition of terms in the cellulosic domain? 2. How can cellulose-related terms that are identified by such NER approaches be categorized more effectively and precisely by leveraging a small hand-curated ontology of cellulose materials? 3. What methods are suitable for determining whether a particular identified term refers to a concept that already exists in the ontology, or a new concept that requires amending the ontology?

Objective

The objective of the project is to develop an automated system that will identify cellulosic terms from given text with a higher accuracy. Additionally, the system aims to classify these identified terms, as much as possible, with the most relevant concepts available in a cellulose ontology which is currently being developed. To accomplish this, the proposed system will establish internal communication with the ontology, enabling the identification and classification of new cellulosic terms that are not yet incorporated in the ontology. The resulting set of new terms will be shared with domain experts for review, allowing them to assess the relevance of those terms and determine their appropriate placement within the taxonomy. If a new term is found to carry a more refined semantic meaning than an existing term in the ontology, the ontology will be amended accordingly. Furthermore, the identification and exclusion of irrelevant terms in every NER process will accelerate the continual assessment process for recognized cellulosic terms and their association with the ontology over time.

Research Methodology

A ScispaCy [14] language model will be selected considering its current performance on the cellulosic data. This performance will be measured based on the number of correctly recognized cellulosic terms from a curated corpus of evaluation data. The best performing model will go through an incremental training process using spaCy [15] model training pipeline. In the first phase of training, the selected model will be introduced with CHEMDNER corpus [16] which is currently the largest corpus for chemical terms. Then the model will undergo training using various corpus of training data, including materials, properties, processes, and more, across different training phases. This iterative process will enable the model to progressively enhance its understanding and knowledge on diverse terms of the cellulosic domain. After each phase of training the performance of the model will be evaluated using standard metrics such as precision, recall, and F1-score. To assess the improvement, a comparison will also be conducted between the performance of the model in the current training phase and the performance of the models in the previous training phases using an identical evaluation dataset. Finally, the improved language model will be employed to extract cellulosic named entities from a vast collection of text documents. The extracted cellulosic terms will be comprehensively compared to the existing terms in the ontology, generating a set of candidate terms to be forwarded for further verification by domain experts. The domain experts will assess the suitability of these terms and make informed decisions regarding their rejection or integration into the ontology. This collaborative process will ensure that the ontology is enriched with relevant and accurate terms, enhancing its overall effectiveness and comprehensiveness. The effectiveness of the approach will be measured by calculating the percentage of terms accepted by domain experts. This metric will provide valuable insights into the success and acceptance of the proposed method in enriching the ontology with relevant and authoritative terms.

FOIS 20232023Early Career Symposium (ECS), held at FOIS 2023, co-located with 9th Joint Ontology Workshops (JOWO 2023), 19-20 July, 2023, Sherbrooke, Québec, Canada Envelope a.reza@maine.edu (U. Reza); torsten.hahmann@maine.edu (T. Hahmann) GLOBE https://umaine.edu/scis/people/rezaumayer (U. Reza); https://umaine.edu/scis/people/torsten-hahmann (T. Hahmann) Orcid 0000-0003-4013-3513 (U. Reza); 0000-0002-5331-5052 (T. Hahmann)

Acknowledgments

This research was supported in part by the U.S. Department of Agriculture:

• Forest Service, Project 20-JV-11111124-055 • USDA Agricultural Research Service (ARS), Project 0204-41510-001-98S • National Institute of Food and Agriculture (NIFA), Award 2021-67022-34366

Role of cellulose nanofibrils in improving the strength properties of paper: a review TBJele PLekha BBSithole 10.1007/s10570-021-04294-8 Cellulose 29 2021 Trends on the cellulose-based textiles: Raw materials and technologies CFelgueiras NGAzoia CM AGonçalves MGama FDourado 10.3389/fbioe.2021.608826 Frontiers in Bioengineering and Biotechnology 9 2021 Prospects for replacement of some plastics in packaging with lignocellulose materials: A brief review YSu BYang JLiu BSun CCao XZou RLutes ZHe 10.15376/biores.13.2.Su Bioresources 13 2018 General scenarios of cellulose and its use in the biomedical field SGopi PBalakrishnan DChandradhara DPoovathankandy SThomas 10.1016/j.mtchem.2019.04.012 Materials Today Chemistry 13 2019 Chemspot: a hybrid system for chemical named entity recognition TRocktäschel MWeidlich ULeser 10.1093/bioinformatics/bts183 Bioinformatics 28 2012 tmchem: a high performance approach for chemical named entity recognition and normalization RLeaman C.-HWei ZLu 10.1186/1758-2946-7-S1-S3 Journal of Cheminformatics 7 S3 2015 Chemtok: A new rule based tokenizer for chemical named entity recognition AAkkasi EVaroğlu NDimililer 10.1155/2016/4248026 BioMed Research International 2016 2016 Chemdataextractor: A toolkit for automated extraction of chemical information from the scientific literature MCSwain JMCole 10.1021/acs.jcim.6b00207 Journal of Chemical Information and Modeling 56 2016 Chemlistem: chemical named entity recognition using recurrent neural networks PTCorbett JBoyle 10.1186/s13321-018-0313-8 Journal of Cheminformatics 10 59 2018 Fine-tuning bert model for materials named entity recognition XZhao JGreenberg YAn XTHu 10.1109/BigData52589.2021.9671697 2021 IEEE International Conference on Big Data (Big Data) IEEE 2021 Material named entity recognition (mner) for knowledge-driven materials using deep learning approach MS UMiah JSulaiman 10.1007/978-981-19-9483-8_17 Proceedings of the Fourth International Conference on Trends in Computational and Cognitive Engineering MSKaiser SWaheed ABandyopadhyay MMahmud KRay the Fourth International Conference on Trends in Computational and Cognitive Engineering

Singapore, Singapore

Springer Nature 2023 A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing PShetty ACRajan CKuenneth SGupta LPPanchumarti LHolm CZhang RRamprasad 10.1038/s41524-023-01003-w npj Computational Materials 9 52 2023 Named entity recognition and normalization applied to large-scale information extraction from the materials science literature LWeston VTshitoyan JDagdelen OVKononova ATrewartha KAPersson GCeder AJain 10.1021/acs.jcim.9b00470 Journal of chemical information and modeling 59 2019 Scispacy: Fast and robust models for biomedical natural language processing MNeumann DKing IBeltagy WAmmar 10.18653/v1/W19-5034 Proceedings of the 18th BioNLP Workshop and Shared Task, Association for Computational Linguistics the 18th BioNLP Workshop and Shared Task, Association for Computational Linguistics

Florence, Italy

2019 spacy: Industrial-strength natural language processing in python MHonnibal IMontani SVLandeghem ABoyd 10.5281/zenodo.1212303 2020 The chemdner corpus of chemicals and drugs and its annotation principles MKrallinger ORabal FLeitner MVázquez DSalgado ZLu RLeaman YLu D.-HJi DMLowe RASayle RTBatista-Navarro RRak THuber TRocktäschel SMatos DCampos BTang HXu TMunkhdalai KHRyu SVRamanan PSNathan SŽitnik MBajec LWeber MIrmer SAAkhondi JAKors SXu XAn UKSikdar AEkbal MYoshioka TMDieb MChoi KMVerspoor MKhabsa CLGiles HLiu KERavikumar ALamurias FMCouto H.-JDai RT .-HTsai CAta TCan AUsie RAlves ISegura-Bedmar PMartínez JOyarzábal AValencia 10.1186/1758-2946-7-S1-S2 Journal of Cheminformatics 7 S2 2015