A Global-Scale Plant Identification using Deep Learning: NEUON Submission to PlantCLEF 2022

A Global-Scale Plant Identification using Deep Learning: NEUON Submission to PlantCLEF 2022 SophiaChulif schulif@swinburne.edu.my Swinburne University of Technology Sarawak Campus

93350 Sarawak Malaysia

Department of Artificial Intelligence NEUON AI

94300 Sarwak Malayisa

SueHanLee shlee@swinburne.edu.my Swinburne University of Technology Sarawak Campus

93350 Sarawak Malaysia

YangLoongChang Department of Artificial Intelligence NEUON AI

94300 Sarwak Malayisa

Evaluation Forum

September 5-8 2022 Bologna Italy

A Global-Scale Plant Identification using Deep Learning: NEUON Submission to PlantCLEF 2022 1613-0073 EBDE987004499DC9E11E36518BDDE7D0 GROBID - A machine learning software for extracting information from scholarly documents plant classification convolutional neural network deep learning machine learning computer vision

With the increasing knowledge of plants globally, it is becoming difficult for human experts to identify plants manually and systematically. Vascular plants alone are estimated to be more than 300,000 species. However, deep learning methods have recently made progress in automating plant identification. The PlantCLEF 2022 challenge this year aims to tackle the problems faced in global plant identification. With the aggregation of various data from different sources, it is a real problem to deal with big data consisting of many classes, unbalanced classes, inaccuracies, duplications, and a diversity of visual contents and quality. Given a training dataset of 4 million images and 80,000 species, the task of the challenge was to identify the correct plant species from 26,868 multi-image plant observations. This paper describes the submissions made by our team to PlantCLEF 2022. We trained several deep learning models based on the Inception-v4 and Inception-ResNet-v2 architectures. The types of networks constructed were a single convolutional neural network (CNN) and a triplet network. They were either initialised on weights pre-trained from the ImageNet dataset or the weights pre-trained from PlantCLEF 2022 dataset. Although we intended to compare the performance between our single CNN and triplet models, unfortunately, we did not manage to obtain the complete results due to resource and time constraints. Nevertheless, we submitted nine runs and our best submission achieved a Macro Averaged Mean Reciprocal Rank score of 0.6078, placing 4th among the 45 submitted runs. In addition, we have shown that web or noisy data does improve generalisation in the identification. Moreover, the ensemble of models from different network architectures, i.e., Inception-v4 and Inception-ResNet-v2, give higher accuracy than a single model.

Introduction

The LifeCLEF Plant Identification Challenge (PlantCLEF) [1] is part of the Conference and Labs of the Evaluation Forum (CLEF), which tackles various multilingual and multimodal information access evaluations [2]. This year, the focus of PlantCLEF 2022 was to classify 80,000 plant species. Compared to its past editions, PlantCLEF 2022 offered the largest ever number of classes for training, making it resource-intensive, which is often the case in real-world applications. In the context of global plant identification, the aggregation of various data from different sources poses many challenges. It is a real problem to deal with big data consisting of many classes, unbalanced classes, inaccuracies, duplications, and a diversity of visual contents and quality. The total training datasets provided in this challenge consisted of 4 million images and were grouped as "trusted" or "web". In addition, the metadata included more levels of plant taxonomy, i.e., class, order, family, genus, and species. This paper presents our approach, the network architectures used, the training setup implemented, and the results obtained from our submissions to PlantCLEF 2022.

Methodology

Data

Two training datasets were downloaded: "trusted" and "web". After removing some duplicates with the same name, the trusted dataset totalled 2,885,052 images. It is derived from academic sources and collaborative platforms, signifying a higher certainty of quality. Meanwhile, the web dataset totalled 1,071,627 images. It is based on search engine queries and suffered from notable errors, which were then semi-automatically cleaned by the organisers. In addition, from the trusted dataset, we segregated 63,119 images to serve as our validation dataset. This validation dataset consists of unique species belonging to one plant observation id from our trusted train dataset. Lastly, the test set consisted of 55,306 images from 26,868 plant observations. The details of the datasets used are represented in Table 1.

Network Architecture and Models

The networks implemented in our approach were based on the Inception-v4 and Inception-ResNet-v2 architectures [3]. Two types of networks were constructed: a single convolutional neural network (CNN) and a triplet network. They were either initialised on weights pre-trained from the ImageNet dataset [4] or the weights pre-trained from PlantCLEF 2022 dataset.

Motivation for Triplet Network Convolutional Neural Networks (CNNs) have effectively solved classification tasks in various domains and even perform comparably equal or better than humans. However, a pre-defined number of classes is required before training. If the classification task is assigned with new labels, the network has to be retrained. In addition, CNNs work best when there is sufficient training data.

In a global-scale plant identification task, it is impractical to retrain the network every time a new species is discovered. Furthermore, many plant species, especially those in remote areas, are rarely photographed. More often than not, the available data for training follows a long-tail Considering these concerns, we opted to implement the metric-learning-based triplet network. Its goal is to learn the similarity and dissimilarities between classes instead of directly classifying them. The network accomplishes this by minimising the embedding distance of the same species while maximising the embedding distance of different species. A small embedding distance indicates the same species. Meanwhile, a large embedding distance indicates different species. Besides, it does not require a pre-defined number of classes before the training. Moreover, as shown in the previous PlantCLEF editions [5,6], our triplet networks [7,8] can generalise plant species equally well with or without less training data than a conventional CNN.

Single CNN This network resembles a conventional Inception-v4 and Inception-ResNet-v2 neural network. Similarly, it consists of convolutional layers, pooling layers, dropout layers and fully-connected layers, which return the softmax probabilities of its prediction. The multi-task classification is adopted in this network by utilising the five taxonomy labels: Class, Order, Family, Genus, and Species. Table 2 shows the multi-task classification labels and their number of classes. The network architecture is visualised in Figure 1 (A).

Triplet Network This network resembles the single CNN mentioned above. However, it consists of two streams and instead of using its fully-connected layer for its predictions, it is used to compute the plants' image embedding representation. In addition, a batch normalisation layer is added, followed by L2-normalisation, and finally, a triplet loss layer1 to train the optimum embedding representation of the plants. Furthermore, instead of its original 1536 features in the fully-connected layer, we reduced its final feature vector to 500. Due to resource limitation, we did not adopt the multi-classification approach in this training. Only the Species taxonomy label is utilised. The network architecture is visualised in Figure 2 (A).

Training Setup

The networks were set up using Tensorflow 1.12 [9] and TF-Slim [10] library with the hyperparameters described in Table 3. Random cropping, colour distortion, and horizontal flipping were also applied to the images during training of the networks. The scripts and lists used are available at https://github.com/NeuonAI/plantclef2022_challenge.

Inference Procedure

The two main evaluation methods of the models used the Argmax function and embedding dictionary similarity comparison. Argmax is used in the single CNN, while the embedding dictionary similarity comparison is used in the triplet network. The inference procedure for the single CNN and triplet network are visualised in Figures 1 (B) and 2 (B), respectively. The following steps describe the overall inference procedure.

Single CNN

1. Group the images based on the same observation id (if the test set is used). image, there will be 10 predictions for each image for each label (Class, Order, Family, Genus, Species). 5. Average the 10 prediction probabilities of each image classification. 6. Obtain the Top-1 and Top-5 accuracy (if the validation set is used). 7. Obtain the Top-30 accuracy (if the test set is used).

Triplet Network Before inference, a sample from the training dataset (trusted) is randomly chosen to create a dictionary list. The dictionary list totalled 592,258 images which consist of 80,000 species. A maximum of ten images represent each species in the dictionary. 1. Group the dictionary images based on the same species. 2. Augment the dictionary images to 10 variations through cropping and flipping. Note that the 10 different variations include the cropping of the top-right, top-left, bottom-right, bottom-left, and centre of the image. Then, these 5 images are horizontally-flipped to obtain a total of 10 image variations. 3. Feed the augmented dictionary images into the network. 4. Obtain the image embeddings of each image. Note that since there are 10 image variations, there will be 10 embeddings for each image. 5. Average the 10 embeddings and save them as a dictionary reference. 6. Repeat steps 2 to 5 until all the 80,000 species embeddings are collected. 7. Group the test image(s) based on the same observation id (if the test set is used). 8. Augment the validation / test image(s) to 10 variations through cropping and flipping as previously. 9. Feed the augmented validation / test images into the network. 10. Obtain the image embeddings of each image. Note that since there are 10 image variations, there will be 10 embeddings for each image. 11. Average the 10 embeddings to obtain the single embedding of each validation / test image. 12. Compute the cosine similarity between the single image embedding and the saved dictionary. 13. Obtain the cosine distance by subtracting the computed cosine similarity from the value of 1. 14. Employ inverse distance weighting on the cosine distance. 15. Acquire the probabilities of the single image embedding mapped to the dictionary. 16. The species mapped to the highest probability denotes the class of the species. 17. Obtain the Top-1 and Top-5 accuracy (if the validation set is used). 18. Obtain the Top-30 accuracy (if the test set is used). 19. Repeat steps 8 to 18 until all the images are evaluated.

Experiments

The networks we experimented with were variations of single CNN and triplet networks. They differ in their network architecture, training data, and weights initialised. Table 4 shows the details of our experimented networks. As seen in Table 4, models 1 to 5 are single CNNs, models 6 to 8 are triplet networks, and models 9 to 12 are single CNN whose weights were initialised from the triplet network.

Results

From the validation dataset, we computed the Top-1 and Top-5 accuracy of the models and tabulated them in Tables 5 and 6, respectively. Due to the large training dataset and time constraints, most of the models trained were not saturated before we evaluated them. In addition, not all the models experimented with were used in the submissions.

Based on our experiments, most of the models trained with a higher number of iterations performed better in the validation dataset. Nevertheless, the higher number of iterations does not necessarily depict higher performance. Comparing Model 2 (421,517 steps) and Model 4 (522,583 steps), Model 2 performed better with a Top-1 accuracy of 0.462 compared to Model 4 of 0.4545. Since the validation set was built from the trusted dataset, which was what Model 2 was trained on, it may have resulted in this bias. Furthermore, we find that the single CNN initialised from the triplet network (Model 12b) performed slightly better than the single CNN initialised from ImageNet (Model 5b). Since our models did not saturate, we cannot give a definite answer

Conclusion

We trained several Inception-v4 and Inception-ResNet-v2 single and triplet deep learning models for the plant identification task in PlantCLEF 2022. Due to its large number of species and training data, it was indeed resource-intensive to experiment. Due to resource and time constraints, our models were not rightfully saturated, and we did not experiment as intended. Therefore, we would like to look into the performance between our single CNN and our triplet models for future work. It is worth looking into their performance when they are both saturated and compared with the same evaluation methods and on different validation sets focusing on unbalanced classes. Nevertheless, we submitted nine runs and our best submission achieved a Macro Averaged Mean Reciprocal Rank score of 0.6078, placing 4th among the 45 submitted runs. In addition, we have shown that web or noisy data does improve generalisation in the identification. Furthermore, the ensemble of models from different network architectures, i.e., Inception-v4 and Inception-ResNet-v2, give higher accuracy than a single model.

2 .2Augment the validation / test image(s) to 10 variations through cropping and flipping. Note that the 10 different variations include the cropping of the top-right, top-left, bottomright, bottom-left, and centre of the image. Then, these 5 images are horizontally-flipped to obtain a total of 10 image variations. 3. Feed the augmented validation / test images into the network. 4. Obtain the prediction results of the images. Note that since there are 10 variations of each.

Figure 1 :1Figure 1: The training and inference procedure of the Single CNN.Figure 1 (A) illustrates the training process of the network. Figure 1 (B) illustrates the inference process of the network.

Figure 2 :2Figure 2: The training and inference procedure of the Triplet Network.Figure 2 (A) illustrates the training process of the network. Figure 2 (B) illustrates the inference process of the network.

Figure 2: The training and inference procedure of the Triplet Network.Figure 2 (A) illustrates the training process of the network. Figure 2 (B) illustrates the inference process of the network.

Trusted) + Single-IR (Trusted) + Single-I (Trusted + Web) + 0.6038 Single-IR (Trusted + Web) 5 Single-I (Trusted + Web) + Single-IR (Trusted + Web) + Triplet-IR (Trusted + Web) (Trusted) + Single-I (Trusted + Web) + Single-IR (Trusted + Web) + 0.6011 Single-T-IR (Trusted + Web) + Single-T-IR (Trusted) 9 Single-IR (Trusted) + Single-IR (Trusted) + Single-I (Trusted + Web) + 0.603 Single-IR (Trusted + Web)

Table 11Details of the Train, Validation, and Test datasets.DatasetNo. of images No. of speciesTrain (Trusted)2,821,93380,000Train (Trusted + Web) 3,893,56080,000Validation63,11963,119Test55,306-

Table 22Details of the multi-task classification labels in the Train dataset. , resulting in the CNN performing well in classes with many training data and poorer in classes with few or no training data.LabelNo. of classesClass8Order84Family 483Genus9,603Species 80,000

distribution

Table 33Details of the Network Training Hyperparameters.HyperparameterSingle CNNTriplet NetworkBatch Size128128Input Image Size299 × 299 × 3299 × 299 × 3OptimizerAdam Optimizer [11]Adam Optimizer [11]Initial Learning Rate0.00010.0001End-layers Learning Rate 0.00010.00001Weight Decay0.000040.00004Learning Dropout rate0.20.2Loss FunctionSoftmax Cross Entropy Triplet Loss

Table 77Performance of our Submissions.The triplet loss is computed using triplet_semihard_loss function provided in Tensorflow 1.12[9]

Acknowledgments

The resources of this project is supported by NEUON AI SDN. BHD., Malaysia.

if it improves the model. However, if we were to compare both models in the same training iteration, the model initialised from the triplet network weights (Model 12b) did perform better than the model initialised on ImageNet weights (Model 5b). Among the different networks trained, the triplet network performed the worst, but this is because their evaluation methods differ from one another (Argmax vs dictionary similarity comparison). Moreover, they were trained significantly less than their single CNN counterpart. In addition, the triplet network relies on the image dictionary for inference. The classes with fewer image samples, especially those with a single sample, may not have provided enough features in the dictionary.

Submissions

We submitted nine runs to PlantCLEF 2022, and its details are tabulated in Table 7. Apart from Run 1, our submissions were constructed from the ensemble of various models. Run 7, which constitutes three single CNNs based on the Inception-v4 and Incpetion-ResNet-v2 architectures, and trained on the trusted and web data, performed the best among our submissions.

Results

The test set was evaluated based on the average Mean Reciprocal Rank (MRR) per species, also known as the Macro-Averaged Mean Reciprocal Rank (MA-MRR) score. Our best Run (7) achieved an MA-MRR score of 0.6078 and was the fourth-highest among the 45 submissions. On the other hand, the top submission scored an MA-MRR of 0.6269. The results of the overall submissions are summarised in Figure 3.

Comparing our Run 2 (trusted only) and Run 3 (trusted and web), the models trained on both trusted and web datasets performed better. Once again, showing that web or noisy data does improve the generalisation of deep learning as in PlantCLEF 2017 [12]. Furthermore, combining models of different architectures, i.e., Inception-v4 and Inception-ResNet-v2 did slightly boost the performance from 0.5461 (Run 1: Inception-v4 only) to 0.5536 (Run2: Inception-v4 and Inception-ResNet-v2). Since our triplet models and single CNN initialised on triplet weights did not saturate, it did not help the ensemble models. Consequently, Run 3 and 7 dropped in performance when added with the triplet or single CNN triplet initialised models, as observed in Run 5 and 8.

Overview of PlantCLEF 2022: Image-based plant identification at global scale HGoëau PBonnet AJoly Working Notes of CLEF 2022 -Conference and Labs of the Evaluation Forum 2022. 2022 Figure 3: The Official Results of PlantCLEF Overview of lifeclef 2022: an evaluation of machine-learning based species identification and species distribution prediction AJoly HGoëau SKahl LPicek TLorieul ECole BDeneu MServajean ADurso HGlotin RPlanqué W.-PVellinga ANavine HKlinck TDenton IEggel PBonnet MŠulc MHruz International Conference of the Cross-Language Evaluation Forum for European Languages Springer 2022 Inception-v4, inception-resnet and the impact of residual connections on learning CSzegedy SIoffe VVanhoucke AAAlemi Thirty-First AAAI Conference on Artificial Intelligence 2017 Imagenet large scale visual recognition challenge ORussakovsky JDeng HSu JKrause SSatheesh SMa ZHuang AKarpathy AKhosla MBernstein International journal of computer vision 115 2015 Overview of lifeclef plant identification task HGoëau PBonnet AJoly CLEF 2020-Conference and labs of the Evaluation Forum 2020. 2020 Overview of plantclef 2021: cross-domain plant identification HGoëau PBonnet AJoly Working Notes of CLEF 2021-Conference and Labs of the Evaluation Forum 2021 2936 Herbarium-field triplet network for cross-domain plant identification SChulif YLChang neuon submission to lifeclef 2020 plant 2020 CLEF (Working Notes Improved herbarium-field triplet network for cross-domain plant identification: Neuon submission to lifeclef 2021 plant SChulif YLChang CLEF (Working Notes) 2021 MAbadi AAgarwal PBarham EBrevdo ZChen CCitro GSCorrado ADavis JDean MDevin SGhemawat IGoodfellow AHarp GIrving MIsard YJia RJozefowicz LKaiser MKudlur JLevenberg DMané RMonga SMoore DMurray COlah MSchuster JShlens BSteiner ISutskever KTalwar PTucker VVanhoucke VVasudevan FViégas OVinyals PWarden MWattenberg MWicke YYu XZheng TensorFlow: Large-scale machine learning on heterogeneous systems 2015 SergioGuadarrama NathanSilberman TensorFlow-Slim: A lightweight library for defining, training and evaluating complex models in tensorflow 29-June-2019 DPKingma JBa arXiv:1412.6980 Adam: A method for stochastic optimization 2014 arXiv preprint Plant identification based on noisy web data: the amazing performance of deep learning (lifeclef HGoëau PBonnet AJoly CLEF: Conference and Labs of the Evaluation Forum 2017. 2017 1866