1. Introduction

ACM Conference on Recommender Systems, Amsterdam, The Netherlands " indre.zliobaite@helsinki.fi (I. Zliobaite) ~ https://www.zliobaite.com/ (I. Zliobaite)

Recommender systems meet species distribution modelling

Indre Zliobaite

0 0 Department of Computer science, University of Helsinki , Finland

2021

000 0 0003

Recommender systems techniques can naturally lend themselves to species distribution modelling if biological species are treated as items and places where they occur are treated as users. In this setting recommendation scores can reflect which habitats are suited for which species. Recommendation scores can also be used for reconstructing relative abundances of species, and analysing their rises and declines over millions of years in the past. Analysis of such predictions can shed light on the efects of changing environments on the biosphere now and in the past, as well as help to make predictions for the future. The major potential advantage of the recommender systems treatment over many existing solutions is the large spatial and temporal scale at which such analysis can be done within a single model. A single model makes predictions easier to compare globally in space and over time. While algorithmic application of recommender systems techniques to species distribution modelling is relatively straightforward, model selection and evaluation is particularly challenging, as there is no possibility for online tests or on-demand sampling, since the past worlds are long gone. Explainability is paramount in these tasks. Here we highlight the main challenges and promising directions of evaluation of such modelling, which is still in early stages of development. We show how aggregated prediction statistics and constraints may help for reliable model selection and evaluation. We illustrate the approaches on a case study of the mammalian fossil record from Europe around 8-17 millions of years ago.

eol>matrix factorization implicit feedback species distribution modeling NOW database

1. Introduction

tems. While multispecies distribution models, modelling several species at a time, are coming about [ 7, 8 ], their primary focus is still on modeling ecological niches across environmental gradients. This relies on high resolution climatic data, which realistically is not available for the record of distant past.

Collaborative filtering techniques of recommender systems can be used to model preferences of organisms for diferent habitats without explicitly characterizing those habitats. This approach would extract patterns of species co-occurrence and extrapolate them over large spatial and temporal scales. Modelling the dynamics of an ecosystem with a single model makes the predictions directly comparable across diferent species and diferent biodiversity spots. The inferred model then can be used: 1. for identifying species that would do well but are likely to be missing at sites; 2. for reconstructing relative population sizes from species lists (this is similar to predicting product ratings from transactional data); 3. for tracking co-occurrences and co-evolution over time and space, and for analyzing macroevolutionary processes.

Technical research in recommender systems for species distribution modeling is in early stages, but has already shown promising results [ 9 ].

2. Recommender systems for species distribution modelling

The recommender system task naturally lends itself to analysis of ecosystems if we consider species as items and sites where they occur as users. The relationship is not the other way around, because each user can consume only a limited number of products, but each product can be consumed by potentially infinite number of users. Similarly, each site can accommodate a limited number of species, but each species potentially can occur in an infinite number of sites. While both species and sites can potentially be described by features, here we consider the simplest scenario where only occurence information is available. Thus, we are in the collaborative filtering task setting.

Data for such analysis of ecosystems can come from many sources. Many biodiversity databases maintain records of species diversity today, the majority of them aggregate data from many sources1[ 10 ]. Fossil databases, describing ecosystems of the past, are also widely available [ 11 ]. There are also databases that aggregate biodiversity databases, for instance, GBIT2. Some data may come from expert surveys, research expeditions or professional wildlife monitoring projects, another portion of data may come from citizen observations. Naturally, even within a single database data can be of varying quality, which is not unlike user data typically used for recommender systems. Most notably, the uncertainty of absences is higher than the uncertainty of presences. If a species has been not been reported at a place it is uncertain whether it does not occur there, or it just has not been encountered yet. This applies to species occurrence data at present as well as in the fossil record. Similarly, presence of a transaction in typical data for

1https://en.wikipedia.org/wiki/List_of_biodiversity_databases 2https://www.gbif.org/

recommender systems signals that a user preferred a particular product, while absence of a transaction might mean either that the user did not like the product or has never came across it.

Further challenges arise due to sparsity of data, especially that of the past ecosystems. Each fossil site presents only a tiny fraction of all species that have ever lived. Similarly, one user can realistically watch only a small fraction of movies that have ever been made. The total number of movies can vary from user to user as the number of species can vary from site to site.

Last but not least, synonymy is a challenge for modelling user preferences as it is for analysing the fossil record. Just as the same movie may appear under diferent titles in diferent contexts (or countries), the same species can appear under diferent names in diferent research communities. Recommender system techniques are generally equipped to be robust to these shared challenges and hopefully can lend their perspectives to species distribution modelling.

Sometimes information on abundances of species at sites might be available. This corresponds to availability of user ratings. Yet availability of relative abundances at large scales is rare. Typically, only lists of species that have occurred at sites are given. This corresponds to transactional user data without ratings. The latter setting calls for recommender systems solutions with implicit feedback [ 12, 13 ]. Such solutions may draw on the repetitiveness of a transaction or the certainty associated with a transaction in general. In the species modelling world certainty can be quantified via qualifiers associated with species identification. Fortunately, neither certainty nor presence-absence information has to be complete; recommender systems typically operate on incomplete information and such is the nature of information about species occurrences.

3. Evaluation challenges and evaluation criteria

Latent factor models [ 12, 14, 15, 16 ] largely dominate the collaborative filtering research for over a decade due to their simplicity and efectiveness. For our case study consider a weighted latent factor model (WFM) [ 12 ] for collaborative filtering with implicit feedback.

Let D× be a binary matrix of observed presence or absence of taxa at sites. WFM defines a confidence matrix as C = 1 + D, where is a parameter that accounts for asymmetry of uncertainties about presence and absence. The higher the more certainty is put on presences in contrast to absences.

WFM factorizes the occurrence matrix into two preference matrices taking into account confidences of the transactions D→− C X× × YT× . Here is a parameter specifying the dimensionality of the projection.

WFM minimizes cost function min⋆,⋆ ∑︀, ( − T)2 + (︀ ∑︀ ||||2 + ∑︀ ||||2)︀ . Here and are elements of matrices C and D (defined earlier), and and are rows of matrices X and Y. a regularisation parameter.

All in all WFM requires setting four parameters: , , and the number of iterations for minimizing the cost function. Next we need to define quantitative evaluation criteria and and a testing procedure for choosing the parameter values.

While many indirect evaluation approaches exist for recommender systems [ 17 ], usually the most reliable is online testing, where users are exposed to diferent recommender solutions at random. Yet, since species occurrence data is almost always exclusively observational, online evaluation is not an option and we are left to evaluate the model fit based on the observational data used for modelling.

If we wanted the model to reconstruct the observational data as closely as possible, the best approach would be to set the number of internal dimensions as high as possible and to set the regularisation parameter to zero. Such a model would memorise and reconstruct underlying data perfectly but it would not have predictive power, since it would overfit.

Cross-validation, that would normally be used in predictive modeling to avoid overfitting, is not an option since there is no easy way to hold out a separate testing set. Variants of cross-validation have been used for testing autoencoder-based collaborative filtering [ 18, 19]. They would leave out some users for testing, which is possible with autoencoders, since they have explicit inputs to the model and outputs. That does not straightforwardly apply to latent factor models, however.

For latent factor models we can do pseudo-cross-validation, where individual occurrences are nullified at random [ 16 ], and check which parameter settings best reproduce the nullified occurrences. Yet, this is not suficient either. If we were simply to maximise this leave-one-out accuracy, the optimal solution would be to predict everything as ones, that is to predict all species to occur everywhere. Clearly, this is not an informative outcome either.

Ideally, we want the model not only to reproduce observed occurrences but also to identify the species that are most likely to be missing at sites, as well as flag potential misidentifications. Thus, predictions must be inaccurate with respect to the training data in order to produce meaningful predictions. Our approach [ 9 ] thus is to push the model (1) to predict more occurrences than in the original data while at the same time (2) reproduce the occurrences in the original data reasonably accurately.

The first criterion – pushing the model to predict more positives than in the original data is easy to achieve by increasing . However, at the same time it is important not to overshoot the carrying capacity of the environments. Species-area relationships are restrictive in a sense that environments can accommodate only a limited number of species [20], which is somewhat predictable from the climatic conditions [21]. While a movie recommender system could potentially keep recommending highly scored movies to the user for as long as the user keeps watching them, an informative species distribution model should recommend a finite number of species that can exist on a site, and this number will certainly vary from site to site. An evaluation criteria that can be used to control for the model realism from this perspective could be requiring that the total-number of recommended species does not exceed, say, 20% of the species that are already there.

The second criterion – keeping the occurrences in the original data accurately reconstructed should rely on a subset of data points for which we have high confidence of both positive occurrences and absences. Repetitive presences (if any of those are reported) can be considered as true positives. Absences out of the time range when the species is known to have been extinct (or has not originated yet) can be considered as true negatives. The latter requires a temporal information in the meta data and thus is primarily suitable for fossil data.

With these two targets in mind one can aim at maximising a conventional evaluation metric, for example, the area under curve (ROC), on a subset of the data that only includes true positives and true negatives and let the aggregated statistics of positive predictions over sites and species take care of not deviating too far from the carrying capacity limits.

4. A case study

Our case study shows an application of matrix factorisation with implicit feedback to reconstructing relative abundances of large plant eating mammals in Europe from about 17 to about 8 million years ago. This time range captures sites assigned to the European Land Mammal biozones from MN4 to MN12 [22], that include a major faunal turnover. Species occurrence data comes from a public fossil mammal database called NOW [23]. The database records sites where fossils have been found. Age information is assigned to sites, not to individual fossils. Each site has a list of species that have been found there. Some identifications of species may be uncertain, the database records uncertainty qualifiers. The database also records features that characterize each species, but we have not used this information in this study. We aggregated the data at the genus level rather than analyzing it at the species level. There is no diference from the algorithmic perspective, but this way the results are easier to interpret from the ecology perspective. Details of preprocesing can be found in [ 9 ]. The preprocessed dataset contains 104 genera (items), 351 sites (users) and 2616 occurrences (transactions). Sparsity is 93%. The preprocessed dataset used for this case study is available on GitHub3.

Following the principles outlined in the previous section we monitored the following quantitative performance measures: 1. ˆ all mean prediction score over all the data (the prior probability over all the data is = 0.063) [we want ˆ to be slightly higher than all , but not too much higher]; 2. MAE all mean absolute error over all the data [we want it to be small, but not zero]; 3. MAE animals mean absolute error over the numbers of occurrences for the animals [we want it to be small, but not zero]; 4. MAE sites mean absolute error over how many animals each site hosts [this relates to the carrying capacity, we want the error to be small, but not zero]; 5. AUC all area under ROC on all training data [we want it to be close to one]; 6. ˆ pos+ is mean prediction score over true positives) [ideally, we want ˆ + = 1]; 7. (ˆ pos+) the standard deviation; 8. ˆ neg+ is mean prediction score over true negatives) [ideally, we want ˆ + = 0]; 9. (ˆ neg+) the standard deviation; 10. + area under ROC on selected true positives and true negatives [we want it to be close to one]; 11. (AUC +) the standard deviation.

We tested around 200 parameter setting variants via a grid search in the 3-dimensional model parameter space ( , , ). We kept the number of model fitting iterations fixed to 10. Instead of testing on all true positives and true negatives we randomly selected 10 of each and repeated 10 times for each model. This saved computational costs and sidestepped the challenge of class imbalance. We initialised the factor matrices by drawing random values from the normal distribution with zero mean and unit variance. It took a couple of minutes to fit one model using ad hoc implementation in R suite on a commodity laptop.

3https://github.com/zliobaite/fossilrec

where is the preference score for species to occur at site , coming from the model; is the presence-absence matrix, where > 0 means that we only sum taxa that are reported to be present at site . The subtraction of 0.5 from the probability score is an arbitrary cutof implying that preference scores below 0.5 signal absence. In this study we only analyse the relative abundances of animals that are present, but in principle, the recommender systems approach would allow the estimation of potential relative abundances of animals that are absent as well. The challenge is how to keep the total number of recommended animals contained and in line with the carrying capacity of the environment, as discussed earlier. This is an open question for further research.

Table 1 shows the results. We see that the order from the most abundant to the least abundant animals (genera) is not too far of, but the predictions for rare animals are quite too high.

Animal genus Gomphotherium Anchitherium Prosantorhinus Tethytragus Micromeryx

Heteroprox % fragments found recommendation score % predicted

5. Concluding remarks

Recommender systems approaches open new perspectives for analysing ecosystems and species distribution modelling. Reliable evaluation of such approaches is an open challenge. Here we outline several evaluation criteria that are based on domain knowledge about ecosystems. Hopefully similar solutions can potentially be useful in user modelling applications of recommender systems as well. Curious is to learn that the two settings are more similar than it may look from the first sight.

Our case study showed that an of-the-shelf matrix factorisation approach already works reasonably well for fossil species distribution modelling, but many methodological challenges remain. Open directions for future research include taking time and energy constraints into the models. As not all species are alive at all times, models could take constraints of species being alive into their optimisation criteria. As ecosystems vary in energy (for example, tropical forests produce much more edible biomass than semi-deserts), models could incorporate such constraints as well. In the product world this would correspond to one customer having much more purchasing power than another. At a larger scale, diferent epochs with diferent climates may be considered as diferent contexts, where context-aware recommender systems [26] can ofer better treatment. Finally, there is a large potential for blending occurrence information with descriptive features of animals and sites, drawing on recent works in reconstructing past environments [27, 28, 29, 30, 31]. The ultimate purpose is to understand how the living world was in the past, when is it livable, and how it works in general.

6. Acknowledgments

The author is grateful to two anonymous reviewers for insightful feedback. Research leading to these results was partially supported by the Academy of Finland (grants no. 314803, 341623). [18] D. Liang, R. Krishnan, M. Hofman, T. Jebara, Variational autoencoders for collaborative ifltering, in: Proc. of the 2018 World Wide Web Conference, WWW’18, 2018, pp. 689–698. [19] H. Steck, Embarrassingly shallow autoencoders for sparse data, in: Proc. of the 2019 World

Wide Web Conference, WWW’19, 2019, pp. 3251–3257. [20] S. Cain, The species-area curve, The American Midland Naturalist 19 (1938) 573–581. [21] H. Hillebrand, On the generality of the latitudinal diversity gradient, The American

Naturalist 163 (2004). [22] F. Hilgen, L. Lourens, J. van Dam, The neogene period, in: F. Gradstein, J. Ogg, M. Schmitz,

G. Ogg (Eds.), The Geologic Time Scale 2012, Elsevier, 2012, pp. 923–978. [23] The NOW Community, New and old worlds database of fossil mammals (now), Licensed under CC BY 4.0, http://www.helsinki.fi/science/now/, 2020. [24] M. Domingo, D. Martin-Perea, L. Domingo, E. Cantero, J. Cantalapiedra, B. Garcia-Yelo, A. Gomez-Cano, G. Alcalde, O. Fesharaki, M. Hernandez-Fernandez, Taphonomy of mammalian fossil bones from the debris-flow deposits of somosaguas-north (middle miocene, madrid basin, spain), Palaeogeography, Palaeoclimatology, Palaeoecology 465 (2017) 103–121. [25] A. K. Behrensmeyer, S. M. Kidwell, R. A. Gastaldo, Taphonomy and paleobiology, Paleobiology 26 (2000) 103–147. [26] G. Adomavicius, B. Mobasher, F. Ricci, A. Tuzhilin, Context-aware recommender systems,

AI Magazine 32 (2011) 67–80. [27] L. Liu, K. Puolamaki, J. Eronen, M. Mirzaie Ataabadi, E. Hernesniemi, M. Fortelius, Dental functional traits of mammals resolve productivity in terrestrial ecosystems past and present, Proceedings of the Royal Society B 279 (2012) 2793–2799. [28] M. Fortelius, I. Zliobaite, F. Kaya, F. Bibi, R. Bobe, L. Leakey, et al., An ecometric analysis of the fossil mammal record of the turkana basin, Philosophical Transactions of the Royal Society: Biological Sciences 371 (2016). [29] A. Barr, Ecomorphology: Reconstructing cenozoic terrestrial environments and ecological communities, in: D. Croft, D. Su, S. Simpson (Eds.), Methods in Paleoecology: Reconstructing Cenozoic Terrestrial Environments and Ecological Communities, Springer, 2018, pp. 339–349. [30] W. Vermillion, J. Head, P. Polly, J. Eronen, A. Lawing, Ecometrics: A trait-based approach to paleoclimate and paleoenvironmental reconstruction, in: D. Croft, D. Su, S. Simpson (Eds.), Methods in Paleoecology: Reconstructing Cenozoic Terrestrial Environments and Ecological Communities, Springer, Cham, 2018, pp. 373–394. [31] T. Faith, A. Du, J. Rowan, Addressing the efects of sampling on ecometric-based paleoenvironmental reconstructions, Palaeogeography, Palaeoclimatology, Palaeoecology 528 (2019) 175–185. Performance evaluation of models with diferent parameter settings. 10 best results within each evaluation criterion are highlighted in bold.

^ 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 30 30 30 30 30 30 30 30 30 30 30 30 30 30 all 0.173 0.182 all 0.235 0.163 0.162 animals 9.74 17.769 19.846 sites 5.194 5.265 5.869 all 0.9894 0.9999 0.529 0.579 0.106 0.068 0.028 0.027 -0.006 0.05 0.045 0.941 0.976 0.066 0.029

[1]

Elith ,

Leathwick , Species distribution models: Ecological explanation and prediction across space and time , Annual Review of Ecology, Evolution, and Systematics 40 ( 2009 ) 677 - 697 .

[2]

Pearson , Species? distribution modeling for conservation educators and practitioners, Lessons in Conservation 3 ( 2010 ) 54 - 89 .

[3]

Pollock ,

Tingley ,

Morris ,

Golding ,

R. O

'Hara ,

Parris ,

Vesk , M. McCarthy, Understanding co-occurrence by modelling species simultaneously with a joint species distribution model (JSDM) , Methods in Ecology and Evolution 5 ( 2014 ) 397 - 406 .

[4]

Tikhonov ,

Opedal ,

Abrego ,

Lehikoinen , M. d. Jonge,

Oksanen ,

Ovaskainen , Joint species distribution modelling with the r-package Hmsc , Methods in Ecology and Evolution 11 ( 2020 ) 442 - 447 .

[5]

Myers ,

Stigall , B. Lieberman, PaleoENM: applying ecological niche modeling to the fossil record , Paleobiology 41 ( 2015 ) 226 - 244 .

[6]

Varela ,

Lobo ,

Hortal , Using species distribution models in paleobiogeography: A matter of data, predictors and concepts , Palaeogeography , Palaeoclimatology, Palaeoecology 310 ( 2011 ) 451 - 463 .

[7]

Dunstan ,

Foster ,

Darnell , Model based grouping of species across environmental gradients , Ecological Modelling 222 ( 2011 ) 955 - 963 .

[8]

Hui ,

Taskinen ,

Pledger ,

Foster ,

Warton , Model-based approaches to unconstrained ordination , Methods in Ecology and Evolution 6 ( 2015 ) 399 - 411 .

[9] I. Zliobaite , Recommender systems for fossil species distribution modelling, under review ( 2021 ).

[10]

Ball-Damerow ,

Brenskelle ,

Barve ,

Soltis ,

Sierwald ,

Bieler ,

LaFrance , A. Arino, R. Guralnick, Research applications of primary biodiversity databases in the digital age , PLoS ONE 14 ( 2019 ) e0215794 .

[11]

Uhen ,

Barnosky ,

Bills ,

Blois ,

Carrano ,

Carrasco , et al, From card catalogs to computers: databases in vertebrate paleontology , Journal of Vertebrate Paleontology 33 ( 2013 ) 13 - 28 .

[12]

Hu ,

Koren ,

Volinsky , Collaborative filtering for implicit feedback datasets , in: Proceedings of the 8th IEEE International Conference on Data Mining , IEEE ICDM, 2008 , pp. 263 - 272 .

[13]

Verstrepen , Collaborative Filtering with Binary, Positive-only Data , Ph.D. thesis , Universiteit Antwerpen, 2015 .

[14]

Gopalan ,

Hofman ,

Blei , Scalable recommendation with hierarchical poisson factorization , in: Proc. of the Thirty-First Conference on Uncertainty in Artificial Intelligence , 2015 , pp. 326 - 335 .

[15]

Salakhutdinov ,

Mnih , Probabilistic matrix factorization , in: Advances in neural information processing systems , 2008 , pp. 1257 - 1264 .

[16]

Ning , G. Karypis, Slim: Sparse linear methods for top-n recommender systems , in: IEEE International Conference on Data Mining , ICDM, 2011 , pp. 497 - 506 .

[17]

Herlocker ,

Konstan ,

Terveen ,

Riedl , Evaluating collaborative filtering recommender systems , ACM Trans. Inf. Syst . 22 ( 2004 ) 5 - 53 .