Application of Artificial Intelligence to Knowledge Discovery for the Maintenance and Conservation of Botanical Gardens Claudia Aguilar-Rajme1,2 1 GPLSI research group, University of Alicante, Spain 2 Department of Computer Science, Technical Sciences Faculty, Agricultural University of Havana, Cuba Abstract Botanical gardens play a key role in the conservation of biodiversity and generate large amounts of data related to the management and maintenance of plants on a daily basis. In this sense, the Network of Botanical Gardens in Cuba stands out, made up of 12 institutions that, among other things, have in common that they manage these data by dividing them into three main registers: Introduction Register, Living Plant Register and Herbarium. The diversity of formats and the heterogeneity of the information make it difficult to manage by the Center’s specialists, as well as to make it accessible to researchers and students in search of references. In this context, there is an opportunity to use artificial intelligence and natural language processing techniques to facilitate access to the implicit and explicit information contained in these records. To this end, it is proposed to evaluate the results of their use in order to finally obtain a product capable of using the extracted knowledge to provide recommendations and guidelines for the management and conservation of plants. Keywords Natural Language Processing, Information Extraction, Botany, Gardens 1. Justification of the proposed research According to Smith and Harvey-Brown [1], the most widely accepted definition of botanic gardens is that expressed by Jackson in 1999 [2], who stated that they are "institutions containing documented collections of living plants for the purposes of scientific research, conservation, exhibition and education". Among the main functions of these centers are : • proper documentation of collections, including wild origin, • monitoring of collected individuals, • communication and information to and from other Gardens, other institutions, and the public; and • promotion of conservation through extension and environmental education activities [3]. These tasks involve the management of large amounts of data, which then become sources of frequent consultation for researchers, students, and the general public. The work of botanic gardens is essential to the preservation of the planet’s biodiversity, and this is one of the reasons why botanic garden institutions exist all over the world. In Cuba, there is a network of botanical gardens made up of 12 institutions, including the Botanical Garden of Cienfuegos, the oldest in the country, and the National Botanical Garden of Havana, the largest in the Cuban territory. In these institutions, as in the rest of the network, large amounts of data related to the processes that take place in them are generated daily. Of these data, those related to the management and maintenance of the plants are mainly divided into 3 groups, the first of which is the introduction record, where the information about the plant is stored when it arrives at the garden, by whatever means, and is kept there for a period of time after which its destination within the institution Doctoral Symposium on Natural Language Processing, 26 September 2024, Valladolid, Spain. $ claudia.aguilarrajme@gmail.com (C. Aguilar-Rajme)  0000-0002-7447-4458 (C. Aguilar-Rajme) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings is decided. Those specimens that are selected to become part of the collections are transferred to the Register of Living Plants, where they are tracked for the rest of their lives in the Garden. The third main repository of the Botanical Garden is the Herbarium, defined by the RAE as "a collection of dried and classified plants used as material for the study of botany". [4]. In addition to all this information generated within the institutions, various bibliographic sources are constantly consulted in search of information on the identifying characteristics of the plants, both from the physical point of view for their correct identification in nature, as well as the best practices in the management and conservation of specimens and the different uses that can be given to each variety. All this results in a large amount of information, all related to botany and the plant species present in the garden, but in different formats and very heterogeneous, which makes it difficult to handle by the specialists of the center and even more difficult for researchers and students who are looking for references for consultation and research. This is where artificial intelligence and the various techniques available for handling large amounts of information come into play, since the use of models and algorithms can facilitate access to the implicit and explicit information contained in these records. 2. Background and related work Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on the interaction between computers and human language. In the last decade, NLP techniques have experienced signifi- cant growth and are being applied in a variety of fields, including botany and plant data management. These applications include information extraction (IE) from scientific texts, databases, and other relevant sources to facilitate the integration and analysis of large amounts of data related to plant species. In this research, we will focus on the task of Named Entity Recognition (NER), as it is a fundamental step to identify species names, morphological characteristics, geographic locations, and other key concepts in botanical texts. There are several techniques for extracting information from texts, one of which is the use of rule- based models. These models rely on specific dictionaries and linguistic rules to identify entities. For example, projects such as FloraQuest [5] and Planteome [6] have used dictionaries of botanical terms to perform NER. These approaches are highly accurate in specific domains, but lack scalability and flexibility. Another technique used for information extraction is deep learning based models. In recent years, models such as Recurrent Neural Networks (RNN), especially LSTM (Long Short-Term Memory) architectures, and transformer-based models such as BERT (Bidirectional Encoder Representations from Transformers) have demonstrated superior performance in NER tasks. An example of the use of large language models can be found at [7], where they are being tested for efficiency of different language models for the identification of single and multi-word names of flowers and plants. The analysis was done for both Spanish and English, with a significantly smaller dataset in Spanish, but still obtaining relevant results. Thirteen models were tested for English and four for Spanish, demonstrating in both cases the superiority of the discriminative models over the generative ones in the task to be evaluated. For English the model with the best results was BERT-LARGE-CASED and for Spanish BERT-BASE- MULTILINGUAL-CASED. A similar study [8] was also carried out for the identification of plant names, but in this case metaphorical names, not textual ones, and the superiority of the discriminative models over the generative ones was maintained. Despite significant progress, information extraction in botany presents unique challenges. The diversity and complexity of the language used in botanical texts, the ambiguity of scientific and common names, and the scarcity of annotated datasets limit the effectiveness of traditional approaches. 3. Description of the proposed research The main objective of this research is the identification and application of NLP techniques that contribute to the extraction of knowledge from different scientific sources available in Spanish in the botanical field. A number of specific objectives are planned to achieve the main objective: 1. Determine the state of the art of information extraction in the field of botany. 2. Identify reliable data sources in Spanish in the field of botany. 3. Select NLP techniques for information extraction. 4. Design a knowledge structure where the knowledge will be stored. 5. Apply the identified techniques to the data to extract and store the information. 6. Implement a recommendation system based on the extracted knowledge. At the end of the thesis, it is expected to have identified and evaluated NLP techniques that allow obtaining efficient results in the discovery of knowledge in the field of botany in Spanish. This knowledge will later be used for a better management and conservation of the plant specimens within the collections of each garden, through a product that allows users to make queries about certain elements of the plants and provide recommendations for their conservation and maintenance. 4. Methodology The methodology proposed to achieve the objectives set in the research is based on the fulfillment of tasks planned throughout the years of the doctoral program. We will start with a search and reading of the state of the art to know the techniques currently used to extract information from unstructured texts in the field of botany or similar domains and that can be applied to it. Then, since there are data from different sources (described in the next section), it is necessary to organize, integrate and standardize them to achieve a correct curation process and that the resulting dataset is as complete as possible and of great value in the field of botany. For this purpose, it will be necessary to validate the sources with experts in this field of knowledge and, once the sources to be used have been confirmed, to apply web scraping techniques to obtain the data available online. The next step would be to process this data to extract and represent the knowledge it contains, for which we can use Named Entity Recognition (NER) techniques and semantic representation of the data. We will do this by following the next steps: 1. Text pre-processing: This stage removes unwanted words or characters, performs tokenisation and text normalisation, such as converting everything to lower case. 2. Grammatical tagging: This step involves tagging each word in the text with its appropriate grammatical category, using techniques such as POS (part-of-speech) tagging or dependency analysis. We should also explore ways of working with untagged text, as most of the data sources we have identified are untagged. 3. Entity identification: In this phase, we search for and identify named entities in the text. We will do this using machine learning models, rule systems and a combination of both to see which gives us better results. 4. Entity classification: Once entities have been identified, they are assigned a specific category or classification. For example, entities can be classified as plant name, life cycle, medicinal use, etc. 5. Feature extraction: Relevant features are extracted from the identified entities for storage, analysis and processing. 6. Validation and refinement: This is where the results of the analysis of the named entities are evaluated and adjustments or refinements are made where necessary. This involves verifying the accuracy and quality of the identified and classified entities. This will allow us to build a knowledge structure that we can then use to support decision making. In this way, we will be able to build a recommendation system based on questions and answers that can interact with users and provide them with information and recommendations on various areas of botany and agriculture. This can be very useful for the staff of the botanical garden, as well as for students and researchers, or even for the general public, since it will be a tool to accompany the design, creation and maintenance of a garden, with a scientific basis on the different plant species, soil types, seasons of the year and any other knowledge that we are able to integrate into the knowledge base from the data sources we are working with. 5. Data available The data initially available came from the institutions of the Network of Botanical Gardens of Cuba, mostly records of the specimens present in each center, as an inventory or control. There were no records with textual descriptions of the plants, nor of the norms to be followed for the care and conservation of the different types of plants, nor of the uses that can be given to them, which is the information that a system intended to help the user in the process of creating and maintaining gardens would need. However, this information is available in several online sources that are consulted daily by professionals and that have scientific validity, which guarantees the quality of the knowledge extracted from them. The sources identified in Spanish are: • Revista del Jardín Botánico Nacional (https://revistas.uh.cu/rjbn): aims to disseminate the results of Cuban and foreign scientific work in the field of botany and mycology, and publishes original articles and short communications in Spanish and English. It has 28 volumes available online and is published annually with high international visibility. • Records of adventitious plants in the province of Alicante [9]: This is a reference handbook with the necessary information to design ecological gardens with adventitious plants. It contains a total of 45 plant files with scientific and common names of the species, life cycle, as well as textual descriptions of their physical characteristics (leaves, stems, flowers, fruits), agronomic needs and possible uses (Figure 1a). • Flora Ibérica (http://www.floraiberica.es/): a website that collects definitions and synthesizes current knowledge about the vascular plants that grow spontaneously in the Iberian Peninsula and the Balearic Islands. Its aim is to facilitate the identification of the plants, and for this purpose it has a .pdf file for each species that contains the correct scientific name and its synonyms, a description that highlights the morphological peculiarities, the habitat in which it can be found, its geographical distribution in the world, its flowering period, its chromosome number, etc. • Virtual Herbarium of the Western Mediterranean (http://www.floraiberica.es/): contains informa- tion and an extensive gallery of images of the vascular plants of the countries of the Western Mediterranean. It is structured in tabs or pages for each plant species treated, the main reason for each tab are the images of the plants, but also a brief information in text form about it (which is why it is a valuable source of data in this research) (Figure 1b). • Spanish Invasive Alien Species Catalog - Plants (https://www.miteco.gob.es/es/biodiversidad/ temas/conservacion-de-especies/especies-exoticas-invasoras/ce_eei_flora.html): official website of the Spanish government, where the list of species considered invasive in the territory can be found, and for each of them can be obtained a .pdf file with textual descriptions of the physical appearance of the plant, the main distinguishing characteristics compared to other similar plants, the ecological and health impact and the main routes of entry (Figure 1c). All of these sources have been reviewed by experts in the field of botany and have been found to be a reliable source of reference in this area. They are therefore considered suitable for the purposes of the research and web scraping techniques will be used to extract the information they contain to create the dataset from which the information extraction will be carried out. (a) Records of adventitious plants in the province of Ali- cante (b) Virtual Herbarium of the Western Mediterranean (c) Spanish Invasive Alien Species Catalog - Plants Figure 1: Data Source Examples 6. Specific Issues of Research to be Discussed In this section, after explaining the proposed research, some questions are posed for discussion: Q1. Ambiguity in the data: Since we have different sources of data and the same species may appear in more than one of them, it is possible that we will come across cases where it is necessary to disambiguate the information for one or more characteristics; in these cases, how should this issue be approached? would it be wise to choose one of the different options found or to reflect all of them with the appropriate reference? Q2. Evaluation: What metrics would be most appropriate to evaluate the results of the information extraction task? The conclusions of the debate generated by the questions presented, as well as other aspects that may emerge, would be of great value for the development of research. References [1] P. P. Smith, Y. Harvey-Brown, BGCI Technical Review: Defining the botanic garden, and how to measure performance and success, volume 2 of kkhui, jhhghhgj ed., Botanic Gardens Conservation International, hjhbjh, 2017. Uygfyuu. [2] P. S. W. Jackson, Experimentation on a large scale-an analysis of the holdings and resources of botanic gardens, Botanic Gardens Conservation News (1999). [3] W. IUCN-BGCS, The botanic gardens conservation strategy, IUCN-BGCS, WWF Gland, Swit-zerland (1989). [4] R. A. ESPAÑOLA, Diccionario de la lengua española, 23.ª ed. URL: https://dle.rae.es. [5] N. C. B. Garden, Floraquest (????). [6] L. Cooper, J. Elser, M.-A. Laporte, E. Arnaud, P. Jaiswal, Planteome 2024 update: Reference ontologies and knowledgebase for plant biology, Nucleic Acids Research 52 (2024). doi:10.1093/ nar/gkad1028. [7] D. Premasiri, A. Haddad, T. Ranasinghe, R. Mitkov, Deep learning methods for identification of multiword flower and plant names, Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing (2023). [8] T. Ranasinghe, R. Mitkov, A. Haddad, D. Premasiri, Métodos de aprendizaje profundo para la extracción de nombres metafóricos de flores y plantas, Sociedad Española para el Procesamiento del Lenguaje Natural, Section: Procesamiento del lenguaje natural (2023). [9] J. A. Mateu Brotons, Fichas de plantas adventicias de la provincia de Alicante para su uso en el diseño de jardines ecológicos, Master’s thesis, Universidad Miguel Hernandez de Elche Escuela Politécnica Superior de Orihuela, 2018.