Analyzing water monitoring data with
                RCA-based approaches

    Xavier Dolques1 , Agnès Braud1 , Corinne Grac2,3 , and Florence Le Ber1

(1) Université de Strasbourg, CNRS, ENGEES, ICube UMR 7357, F67000 Strasbourg
     {dolques,agnes.braud}@unistra.fr florence.leber@engees.unistra.fr
     (2) Université de Strasbourg, CNRS, LIVE UMR 7362, F67000 Strasbourg
                           (3) ENGEES, F67000 Strasbourg
                          corinne.grac@engees.unistra.fr


       Abstract. This paper is a short feedback on a collaborative research
       work by computer scentists and hydroecologists. We have applied Rela-
       tional Concept Analysis on complex data about running water character-
       istics (physical, biological and chemical parameters), to answer various
       questions. Two approaches are presented and discussed: the first one ex-
       tracts patterns from temporal data, the second one extracts rules from
       a multi-relational dataset.


1    Introduction

For almost ten years, we have been working on watercourse monitoring data
and, among other approaches, we have exploited Relational Concept Analysis
(RCA) [8]. We have focused on temporal characteristics, that are inherent in
monitoring data, but also on relations betweeen the different parts of the system
being monitored (biological as well as physical and chemical characteristics of
the watercourses). We have developed RCA-based approaches to extract tem-
poral patterns from sequential data, or to extract rules from complex relational
data, results being always assessed by domain experts. This paper is a short
feedback on this collaborative research work. It is organized as follows. Section
2 describes the datasets, Sect. 3 gives a short explanation on RCA functioning.
The approaches used are described in Sect. 4 and discussed in Sect. 5.


2    Data characteristics

Data were collected during the ANR Fresqueau project (2011-2015)1 and or-
ganized into a database (60 million records and 20 gigabytes). The study area
covers 161,100 km² in the east of France, for the 2000-2010 period. We collected
five categories of water data from 21 different sources (mainly public data banks
or research projects): 1) river quality data; 2) sampling site data; 3) hydrographic
network data; 4) human activities data and 5) driving data.
1
  http://dataqual.engees.unistra.fr/fresqueau_presentation_gb
RealDataFCA'2021: Analyzing Real Data with Formal Concept Analysis, June 29,
2021, Strasbourg, France
       Copyright © 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
    River quality data are temporal data, and the most complex part of the data
as detailed below. They are divided into three sub-categories: physical (e.g.,
the dimensions and shape of the river bed, characteristics of the substrate),
physico-chemical (like pH, nitrate and phosphate, pesticides in the water or
sediments) and biological data (i.e. lists of fauna and flora taxa, and metrics
on these taxa, e.g., total abundance, diversity, and biological indices) for four
groups: macroinvertebrates, diatoms, macrophytes and fishes. Biological indices
are defined according to French standards (e.g., the French macroinvertebrate
index, IBGN [1] for invertebrates). Taxa are associated to their life traits (like
the type of respiration, or the habitat preference).
    The other four categories of data are mostly geographic data. Sampling site
data give the location of the sites where the river quality data were sampled
and their main features. A sampling site is considered as a point. Hydrographic
data comprise the different segments of running waters or waterbodies, the size
of watersheds, administrative regions, and additional information on the hydro-
graphic network. Human activity data allow to estimate anthropogenic pressures
on running waters: land use, impediments to flow, location of discharges. Driving
data concern forcing or context variables such as climate (for instance, average
atmospheric temperature, precipitations), flows, geology or administrative infor-
mation. They allow to characterize the environment of the running waters and
sampling sites.
    Building a database integrating data coming from different sources was not
easy. For example, concerning taxa, although the data are based on a standard,
their identifiers may evolve over time so that it is difficult to combine data from
different years. Besides, geometric data may be imprecise: it may be hard to
determine whether a sample site is placed on a watercourse or another one.
    Then these data are highly heterogeneous in their values (quantitative con-
tinuous or discrete, semi-quantitative or qualitative), temporal variability (fre-
quency and duration of sampling) and topological structure (with a geometry
or not). They may be simple measures of a parameter (e.g., a pH value) or a
complex index using different metrics (e.g., IBGN) or based on expert knowl-
edge. For instance physico-chemical measures are collected 4 to 6 times per year,
while biological samples are done at most once a year, and physical measures
even more rarely. Moreover, some sample sites may require stronger monitoring,
based on more parameters, in particular pesticides, so that some parameters are
less abundant in the database. Let us also notice that depending on the area,
taxa may differ.
    We have proposed some approaches based on RCA in order to deal with
some of these problems when analyzing the data, starting from questions asked
by experts.


3   RCA basics

Relational Concept Analysis [8] is an extension of Formal Concept Analysis [6]
which considers relational data, formalized within a relational context family
                     Table 1. Relational Context Family example.

               object-attribute contexts               object-object contexts
                                                             Atheri- Bithy- Boreob-
            taxons ≤1 year > 1 year                 taxon-
                                                  -Presence -cidae -nia -della
           Athericidae x
                                                  BREI0001                     x
            Bithynia   x      x
                                                   BRUN001     x       x
           Boreobdella        x
                                                   FECH001     x               x
                small   fresh, running phreatic
    stations
             watercourse watercourse stream
    BREI0001      x            x
    BRUN001                               x
    FECH001                    x


(RCF), F = {K, R} where K is a set of object-attribute contexts (each context
corresponding to an object category) and R is a set of object-object contexts
(relations between objects of various categories).
    The principle of RCA consists in integrating object-object relations as new
attributes (called relational attributes) in the formal contexts of K thanks to
scaling quantifiers, such as the existential (exist) or universal strict (exist+forall )
scaling quantifiers. It produces iteratively a set of concept lattices (one lattice per
object category) interconnected through relational information. The concepts
in a given lattice group objects according to the shared attributes and to the
connections they have with objects of another category. The result is a family
of concept lattices where concepts of a lattice are linked to concepts of other
lattices. We illustrate this with a small example (from [4]). Table 1 introduces
two object-attribute contexts, one about taxa and their life traits, one about river
sites and their physical characteristics, and an object-object context linking taxa
to the sites where they have been found.
    First, the RCA process builds lattices on the two object-attribute contexts,
stations and taxons (Fig. 1). Then the stations context is extended by re-
lational attributes linking stations objects to taxons concepts, based on the
taxonPresence context. For example, ∃taxonPresence : Concept 2 means that
at least one object of Concept 2 in the taxons lattice is present on each stations
object that owns this attribute. The lattice built on this extended context is
shown in Fig. 1 (right). In this example, the process stops here since there is
only one (one-way) relation linking the two contexts.


4     Questions and Approaches

For domain experts, the general question is to link physical and physico-chemical
data to biological ones, the first ones giving an instantaneous information, the
second one giving a long term integrative information. It covers more specific
questions, for example, how can values of biological indices be explained by
                                                        Concept_4                                                  Concept_4
                                                                                                           ∃ taxonPresence : Concept_0
                                                                                                           ∃ taxonPresence : Concept_3


           Concept_0                        Concept_6                 Concept_8
                                                                   Concept_0                       Concept_6                        Concept_10
                                      fresh and running water        phreatic stream         fresh and running water     ∃ taxonPresence : Concept_1
                                            FECH001                     BRUN001

   Concept_1              Concept_3            Concept_5                                                                                     Concept_8
                                                           Concept_1            Concept_3      Concept_5           Concept_9
    ≥ 1 year              < 1 year         small watercourse1 year              < 1 year    small watercourse
                                                                                                                                           phreatic stream
                                                                                                                                     ∃ taxonPresence : Concept_2
   Athericidae        Boreobdella              BREI0001Athericidae          Boreobdella        BREI0001            FECH001
                                                                                                                                             BRUN001
            Concept_2                                Concept_7
                                                                   Concept_2                                       Concept_7

               Bithynia
                                                                     Bithynia
                 taxons                                 stations       taxons                                            stations


Fig. 1. Lattices of the two object-attribute contexts of Table 1 – initialization (left)
and first step (right) of RCA – Only stations lattice changes


the preceding successive measures of physico-chemical parameters? What is the
relation between the values of physico-chemical or physical parameters and the
life traits of taxa living in a site? The first question raises the problem of dealing
with temporality and it has been undertaken with a pattern mining approach
[5] and then by RCA [9]. Moreover, working on biological quality requires to
overcome the difficulty of analyzing sites with different taxa. This problem has
been tackled by working with biological indices in the pattern mining approach,
one of their aims being to make biological quality comparable between sites.
For the second question, it has been tackled by working on life traits, and the
question has been undertaken by a RCA-based rule mining approach [3].

Analyzing sequences of physico-chemical and biological measures. In [9], we fo-
cused on sequences of physico-chemical measures (6 measures per year) ending
with biological samples (one per year). The selection and preprocessing of the
data were done under the supervision of domain experts. Physico-chemical mea-
sures were discretized into qualitative scales. Biological samples were synthetized
into qualitative indices (five levels from red to blue, corresponding to quality).
The question was to explain the biological quality wrt the physico-chemical
quality assessed during the last months. Sequences were encoded into an RCF,
according to the schema shown in Fig. 2: each rectangle corresponds to an object-
attribute context, while the arrows correspond to object-object contexts.
    Then data are processed as follows. Firstly, RCA is applied to the RCF in
order to obtain a family of concept lattices. Secondly, the interrelated concepts
from the RCA result, are navigated to extract a set of sequential patterns for each
concept of the BioSamples lattice. A pattern is actually a directed graph, where
the various paths represent sequences of parameter values preceding a biological
sample. Iceberg lattices [11] have been used to select patterns with the highest
support (i.e. represented in many sample sites). Figure 3 shows an example
of an extracted pattern: it summarizes a set of sequences of physico-chemical
      Fig. 2. Modeling sequences of physico-chemical and biological samples [9]


                     Fig. 3. An example of a sequential pattern


parameter values measured before a biological index (IBGN) with red value. The
pattern is read as follows: in all sequences, an orange value for AZOT (nitrogen
except nitrate) and a red value for PEST (pesticides) have been measured before
a green value for NITR (nitrate) and a red value for AZOT occurring at the
same moment. Also, a red value for PHOS (phosphorous) has been measured
after the red value for PEST. According to expert domains, this pattern can
be interpreted as follows: the quality values of physico-chemical parameters are
consistent with the biological value, macroinvertebrates being sensitive to high
rates of pesticides and ammonium.


Extracting rules linking taxa life traits and site physical and physico-chemical
characteristics. In [3], we tried to connect physical characteristics of sample sites,
physico-chemical parameters, and the traits of taxa living in sample sites. We
therefore developed an RCA-based method using AOC-posets to deal with large
datasets. This approach allowed to provide a reasonable number of concepts and
to extract meaningful implication rules (association rules whose confidence is 1).
In order to offer more flexibility on the quantification of taxa on sample sites than
with the existing exist and forall scaling operators, new scaling operators were
defined and experimented, providing different semantics for the rules. Data were
modeled as shown in Fig. 4. Detailed temporal information (physico-chemical
parameters) was aggregated into annual values and then discretized into five
levels. Taxa numbers were also discretized (taxa weakly to highly represented).
An example of rule is given below, where a percent-quantifier is used [3]: such
a quantifier allows to build relational attributes with for example, the form
∃∀>n% r(C); an object owning this attribute is linked to at least n% of C objects
                         Level of Physico-
                      chemical parameter
                                             Physico-chemical
                                               parameters

                           Stream
                            Sites

                                        Macro-                   Life
                      population of invertebrates life traits of Traits
                macro-invertebrates           macro-invertebrates

Fig. 4. Modeling the links between physico-chemical parameters and the life traits of
taxa (macro-invertebrates) living in sample sites [3]


with the r relation [2].

            S>50% high population(∃strong affinity(slow current))
            →     ∃bad state(hydrology)

This rule means that a sample site having more than half of its highly represented
taxons preferring slow current, has a bad hydrological state. According to the
experts, it corresponds to small disconnected phreatic streams, with almost no
current, that are specific of the Alsace plain.


5   Discussion and Conclusion
In the following we describe some problems we have faced with the data and
novelties that stemmed from these works. Note that all these experiments led to
new proposals to make the use of RCA easier [10].
    First of all, ecosystems are very complex entities, influenced by many pa-
rameters. We have tried to integrate many of these parameters in our database,
but handling all of them in a single analysis is not really possible, and even for
experts to fully understand such results involving many information types. We
have thus worked on subproblems, handling a smaller but consistent part of pa-
rameters, based on an expert question. Also, data like biological indices already
integrate several parameters in one measure. Their definition has been proposed
by groups of specialists and it is pertinent to use them when studying water
quality, in particular to overcome the difference of taxa between areas, but at
the same time they are not raw data and represent a kind of bias.
    Regarding the pattern mining approach, given the sequence set of biological
and physico-chemical samples, we found hierarchies of multilevel cpo-patterns
that summarize the impact of physico-chemistry to biology. The hierarchical
representation allows to enhance the analysis of the extracted set of sequential
patterns. Nevertheless, there are too many patterns, and relevant interestingness
measures must be chosen. Another problem is due to the irregular distribution
of biological index values, leading to more or less frequent, complex, and infor-
mative patterns for the different values. Regarding the rule mining approach:
using AOC-posets causes loss of concepts, and some interesting rules will thus
not be found. Other techniques should be explored for processing the complexity
of this relational dataset.
    With respect to classical approaches in the hydroecological domain, our ap-
proaches are original since most work are based on statistical analysis or super-
vised machine learning methods (see e.g., [7, 12]).


References
 1. AFNOR: Qualité de l’eau : détermination de l’Indice Biologique Global Normalisé
    (IBGN). NF T90-350 (1992)
 2. Braud, A., Dolques, X., Huchard, M., Le Ber, F.: Generalization effect of quantifiers
    in a classification based on relational concept analysis. Knowl.-Based Syst. 160,
    119–135 (2018)
 3. Dolques, X., Le Ber, F., Huchard, M., Grac, C.: Performance-friendly rule extrac-
    tion in large water data-sets with AOC posets and relational concept analysis. Int.
    J. General Systems 45(2), 187–210 (2016)
 4. Dolques, X., Le Ber, F., Huchard, M., Nebut, C.: Relational Concept Analysis for
    Relational Data Exploration. Adv. in Know. Disc. and Manag. 5, 55–77 (2016)
 5. Fabrègue, M., Braud, A., Bringay, S., Grac, C., Le Ber, F., Levet, D., Teisseire,
    M.: Discriminant temporal patterns for linking physico-chemistry and biology in
    hydro-ecosystem assessment. Ecological Informatics 24, 210–221 (2014)
 6. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations.
    Springer Verlag (1999)
 7. Goethals, P.L., Dedecker, A., Gabriels, W., Lek, S., Pauw, N.: Applications of
    artificial neural networks predicting macroinvertebrates in freshwaters. Aquatic
    Ecology 41(3), 491–508 (2007)
 8. Hacene, M.R., Huchard, M., Napoli, A., Valtchev, P.: Relational concept analysis:
    mining concept lattices from multi-relational data. Ann. Math. Artif. Intell. 67(1),
    81–108 (2013)
 9. Nica, C., Braud, A., Dolques, X., Le Ber, F., Huchard, M.: Exploring temporal
    data using relational concept analysis – an application to hydroecology. In: Concept
    lattices and their applications (CLA 2016). pp. 1–13. CEUR Workshop Proc. (2016)
10. Ouzerdine, A., Braud, A., Dolques, X., Huchard, M., Le Ber, F.: Adjusting the
    exploration flow in Relational Concept Analysis – An experience on a watercourse
    quality dataset. Adv. in Know. Disc. and Manag. 9 (2021)
11. Stumme, G., Taouil, R., Bastide, Y., Pasquier, N., Lakhal, L.: Computing iceberg
    concept lattices with Titanic. Data & Know. Eng. 42(2), 189 – 222 (2002)
12. Villeneuve, B., Souchon, Y., Usseglio-Polatera, P., Ferréol, M., Valette, L.: Can
    we predict biological condition of stream ecosystems? a multi-stressors approach
    linking three biological indices to physico-chemistry, hydromorphology and land
    use. Ecol. Indic. 48, 88–98 (2015)