1. Introduction

A Framework for Biodiversity Image Analysis using Machine Learning and Crowdsourcing Knowledge

Loukas Chatzivasili

Georgia Charalambous

Maria Papoutsoglou

Georgia Kapitsaki

Ioustina Harasim

Eva Chatzinikolaou

evachatz@hcmr.gr 2

Georgia Sarafidou

g.sarafidou@hcmr.gr 2

Ioannis Rallis

i.rallis@hcmr.gr 2

Markos Digenis

m.digenis@hcmr.gr 1 2 0 Department of Computer Science, University of Cyprus , Nicosia , Cyprus 1 Department of Environment, Faculty of Environment, Ionian University , Zakynthos , Greece 2 Hellenic Centre for Marine Research (HCMR), Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC) , Heraklion, Crete , Greece

This paper proposes a data-driven methodology framework to enrich existing biodiversity data by collecting crowd-sourced knowledge contributed by citizens in the form of photographs. The framework aims to provide the design of an easy-to-use web interface tool that clusters biodiversity images from multiple data sources. Through its design it will provide the users the ability to identify species by uploading photos to this tool.

eol>Machine Learning Big Data Citizen science Biodiversity

1. Introduction

purposes [2]. In recent years the field advanced further into using machine learning algorithms to automate im“All we need is a smartphone with an internet connec- age processing and analysis for environmental purposes. tion.” In our days, an Internet-connected mobile phone In our approach, we introduce a framework that uses is an indispensable, highly useful tool for all citizens. an annotated dataset of biodiversity images from coastal A crucial advantage of ubiquitous connectivity is the areas (acquired through pertinent data sources and from ability to dynamically create and share content. Our citizen scientists in the field) and proposes the use of daily experience testifies to the widespread sharing of best-practice image processing tools to automate key content that users carry out just for fun; many people processes. We also describe a dynamic web interface however take the extra step and contribute data they cap- through which users can visualize a categorization of ture around them towards wider community causes and images or upload their images to be classified in clusters into crowd-sourcing knowledge [1]. In our work, we focus using the images from the aforementioned dataset. on crowd-sourced knowledge in the form of biodiversity Why is such a framework needed? The proposed biodata provided by citizens. More specifically, visual con- diversity image processing framework, part of the Sotent captured by citizens, such as the photographs they cioCoast project,1 is part of a larger efort that aims to take from visits to diferent ecosystems, have significant drastically improve participation in environmental issues value and can prove beneficial for various environmental by motivating users to share their biodiversity-related purposes, such as understanding and protecting species, images captured in visits to the coast through a versatile, particularly endangered ones. user-friendly mobile app developed and made available

Collecting crowd-sourcing knowledge towards a sci- by the project. By realizing value through sharing of their entific goal is directly linked to the concept of citizen images, citizens and tourists will be further incentivized science, namely the voluntary process by which a per- to increase the collected crowd-sourcing knowledge and son acts as a “human sensor” that collects and/or pro- enhance the image database, one of the main components cesses data, mainly for ecological and environmental of the project’s overall framework [3]. As more and more images are uploaded into the database we will face the challenge of handling a very large volume of data. In order to provide accurate and dynamic data analytics as the volume of image-storing and processing gets bigger, the proposed framework leverages scalable (big-data) image analytics technologies.

2. Proposed Framework

In this section, we analyze the main components of our proposed framework, which is depicted in Figure 1.

The iNaturalist API was used as an additional data source for biodiversity data as mentioned above. In our analysis, we included only citizen observations with a species photograph. In addition with the photographs, we also collected the taxonomic information of the species including the kingdom, phylum, and class. Kingdom, phylum, and class are among the variables that are common between the two diferent datasets we considered. 2http://www.inaturalist.org/

Data Sources

Our proposed framework employs a combination of two types of data sources: (1) institutional (expert) observations and (2) biodiversity data from open-source platforms such as iNaturalist.2

The institutional observations utilized in this frame- Image Pre-processing work relate to biodiversity species detected in coastal The second stage in our framework is the image preareas and beaches, as recorded by expert personnel from processing. This stage is important for improving the the Hellenic Centre for Marine Research (HCMR) using quality of images we will use in the next phase of the established measurement tools. The dataset includes 130 framework, the data processing. We will use a variety species, represented by 196 photographs, collected from 6 of techniques such as image resizing, normalization, endistinct coastal areas in Crete. For a subset of the species, hancement, and filtering to guarantee that the images multiple photographs were obtained. Along with the are consistent in size and format, and that any noise or species name, a list of properties including: kingdom, distortions are removed. Additionally, we will perform phylum, class, category, habitat, and pressures was also “Randaugment”, a recent approach which has been impleprovided for each species. These properties were added mented to improve accuracy in image data [4]. as tags/labels to the corresponding images and will be used in later stages of the analysis.

To get an overview of our dataset and its important characteristics, we performed a brief descriptive analysis.

For example, we identified that 80.76% of the species belong to the “Animalia” kingdom while 15.38% and 3.84% belong to “Plantae” and “Chromista” kingdoms respectively. It can be observed that species that belong to the kingdom “Animalia” are the most diverse in our dataset.

We also found that 39% of our species belong to the “Chordata” phylum which suggests that this Phylum has the most diverse representation of species in our dataset.

Processing

The third phase of our proposed framework is the analysis of the collected data. We will start by grouping our species images into clusters based on their similarity. We will apply diferent clustering methods and then we will compare their performance. The performance of the examined clustering approaches will be evaluated by comparing the HCMR dataset’s results to the results collected from the iNaturalist dataset. We will use accuracy, precision, and recall metrics to evaluate the performance of the algorithms. Based on the evaluation results, the best clustering method will be selected and will be fine-tuned to improve its performance.

Once the image clustering algorithms have been ap- dataset from biodiversity experts and the importance plied to the two datasets, we will use the results to per- that human categorization brings to enhance the value form tags and image variables segmentation following of well-known biodiversity data sources such as the iNatthese steps: Firstly, we will identify the clusters by com- uralist. Furthermore, the design of the proposed frameparing the clustering results of the images to the labels work shows, after detailed image pre-processing steps, provided in the HCMR’s dataset. Next, we will assign how the use of machine learning clustering algorithms tags to each cluster based on the labels assigned to the could group two diferent image sources and enhance the images. For example, a cluster of images of fish could be one with the other’s variables. assigned the tag “fish”. Then, we will analyze the features Our future research directions follow mainly four direcof the images within each cluster and we will perform tions: Firstly, we plan to test diferent machine-learning image segmentation using variables such as color and clustering algorithms in order to detect the most accurate shape. After that, we will compare the results of the seg- for our datasets not only to cluster the available images mentation to the HCMR’s dataset to see if there are any but also to connect the variables from experts with the similarities or diferences in the clusters. Finally, we will variables from iNaturalist’s database. Secondly, using be able to draw useful conclusions about the accuracy the results from the cluster analysis we aim to propose and reliability of the clustering algorithm’s performance. some additional possible variables which could be added to iNaturalist’s database. Additionally, through the imClassification plementation of the web interface component, we will As added value to the image clustering, our framework give the ability to experts or simple users to upload their will ofer an optional phase for image classification, using own datasets and categorize their new content with the the results of the cluster analysis and photos uploaded currently available one. Finally, another important future from users optionally. As a matter of fact, unsupervised aim is to evaluate the implemented proposed framework learning techniques in machine learning like clustering, with a user experience questionnaire during the training can be used as a prep step to supervised learning tech- sessions which will take place under the dissemination niques like classification. This can be achieved by using results near the end of SocioCoast project. the resulting clusters to train a classifier e.g. an artificial neural network or a K-nearest neighbor classifier.

Acknowledgments

Web Interface Research for this paper was undertaken in the course As a final step, the proposed framework will provide a of the SocioCast project of the Cooperation Programme dedicated web interface to visualize the results of im- Interreg V-A Greece-Cyprus 2014-2020, co-funded by the age clustering. The interface will enable users to eas- European Union (ERDF) and national funds of Greece ily browse and view the resulting information. We will and Cyprus under the grant agreement No 5050709. have a variety ways of displaying each cluster including photo galley, scatter plots and heatmaps. Users can navigate through the clusters, and they will able to view the References images within each cluster along with their associated labels. Users can also upload photos of species via the [1] M. G. Martinez, Solver engagement in knowledge web interface, which will be analyzed and assigned to a sharing in crowdsourcing communities: Exploring cluster based on their similarity. Once a photo has been the link to creativity, Research Policy 44 (2015) 1419– assigned to a cluster, users will be able to successfully 1430. identify the species’ category and access additional infor- [2] J. Silvertown, A new dawn for citizen science, Trends mation based on its cluster’s labels. A mobile version of in ecology & evolution 24 (2009) 467–471. this interface will also be accessible via the SocioCoast’s [3] M. Papoutsoglou, K. Markakis, L. Chatzivasili, mobile app to facilitate participation by citizen scientists G. Kapitsaki, K. Magoutis, L. Katelaris, C. Bekiari, in the field. A framework to enhance smart citizen science in coastal areas, in: Companion Proceedings of the Web Conference 2022, WWW ’22, Association for 3. Conclusions Computing Machinery, New York, NY, USA, 2022, p. 1260–1265.

In this paper we describe a framework proposed within [4] E. D. Cubuk, B. Zoph, J. Shlens, Q. V. Le, Ranthe SocioCoast project that aims to bring together crowd- daugment: Practical automated data augmentation sourcing knowledge, image sharing, and institutional with a reduced search space, in: Proceedings of (expert) image observations and annotation. Through the IEEE/CVF conference on computer vision and this framework we highlight the value of an annotated pattern recognition workshops, 2020, pp. 702–703.