<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating Formal Concept Analysis Software for Anomaly Detection and Correction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nassif Saab</string-name>
          <email>nassif.saab@lirmm.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marianne Huchard</string-name>
          <email>marianne.huchard@lirmm.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierre Martin</string-name>
          <email>pierre.martin@cirad.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CIRAD, UPR AIDA</institution>
          ,
          <addr-line>F-34398 Montpellier</addr-line>
          ,
          <country country="FR">France;</country>
          <institution>AIDA, Univ Montpellier, CIRAD</institution>
          ,
          <addr-line>Montpellier</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LIRMM, Univ Montpellier</institution>
          ,
          <addr-line>CNRS, Montpellier</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <abstract>
        <p>Data cleaning is a process that precedes data mining. Particularly, in our dataset on pesticidal plant use, several types of anomalies were identified, ranging from incorrect values to a lack of data susceptible of causing users to draw wrong conclusions during its exploration. Literature presents three methods based on Formal Concept Analysis (FCA), i.e. implication rules computation, association rules computation, and attribute exploration, that may allow the detection and correction of anomalies. This paper evaluates 30 FCA-based software and their apposite features to the development of an anomaly detection and correction method applicable to our dataset. Results show that only ConExp and its reimplementations provide all three methods. Since the data model on plant use is relational but ConExp only allows formal contexts as input, this paper concludes on the importance of integrating Relational Concept Analysis (RCA) with ConExp in future work.</p>
      </abstract>
      <kwd-group>
        <kwd>Correction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Data cleaning is a basic operation of the knowledge discovery process [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Particularly, the
Knomana Knowledge Base (KKB) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] includes 45,000 descriptions of pesticidal plant use extracted
from literature. In Knomana, a description of plant use is made of 71 data types, for example,
the pesticidal plant name, the location and part of the plant used to build essential oils. Various
anomalies were observed within a description, such as incorrect spellings of words, e.g. plant
names, and wrong types of values, e.g. integers where strings are expected.
      </p>
      <p>
        Additionally, computing the Duquenne-Guigues basis of implications [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for Knomana allowed
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to identify another type of anomaly, i.e. the lack of data that may cause users of KKB to
draw wrong conclusions regarding plant use. For instance, the presence of only one pesticidal
plant to treat a disease may prompt a user to conclude that this disease can only be treated
using this plant. Thereupon, anomalies should be detected and corrected before providing the
user with software to explore KKB.
      </p>
      <p>To this end, we plan to develop an Anomaly Detection and Correction (ADC) software. Our
ADC method will be based on Formal Concept Analysis (FCA) [5]. Since the Knomana data
model employs ternary relationships [6], this method will rely on Relational Concept Analysis
(RCA) [7], an extension of FCA intended for datasets conforming to the entity-relationship
model. A preliminary step in this development process is to assess whether existing FCA-based
software support ADC. Software that fit this description and meet other requirements, including
being cross-platform, will be adapted for RCA in future work.</p>
      <p>The objective of this paper is to identify such software through an evaluation of several
criteria. Section 2 presents FCA-based ADC methods found in literature. Section 3 introduces
the evaluated software, the dataset and the criteria used for the evaluation. Section 4 shows the
results of the evaluation. Section 5 concludes the paper and describes the next step in our work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. FCA-based methods in literature</title>
      <p>This section reviews the literature on ADC and introduces three ADC methods based on FCA.
Literature presents various ADC approaches, including statistics [8], neural networks [9], deep
learning [10], clustering [11], nearest neighbor search [12], and the following three FCA-based
ADC methods.</p>
      <p>The authors in [13] improve Resource Description Framework data by employing implications
rules. Since the confidence percentage of an implication is 100%, the confidence of its inversion
informs the user of its proximity to being a definition. Thus, an inverted rule with a high
confidence suggests a need for additional data to complete the set of triples.</p>
      <p>The method presented in [14] combines FCA and association rules to spot faults during the
software debugging process. Execution traces are first used to compute association rules, which
are then transformed into a formal context (with the source code line numbers as attributes)
and the corresponding concept lattice is computed. Concepts located at the bottom of the lattice
contain rules that are too specific to explain the error, and the ones at the top contain concepts
common to all failed executions.</p>
      <p>In [15], attribute exploration (detailed in [16]) is applied to a human healthcare dataset
in order to show that knowledge acquired through the exploration process and knowledge
provided by domain experts, if combined, can improve the classification accuracy. This is an
interactive method based on the computation of the Duquenne-Guigues basis of implications.
The role of the expert in this process is to validate each rule. When an invalid rule is presented,
the expert provides a counterexample.</p>
      <p>Each of these papers explores a method adapted to managing potential anomalies: [13] and
[14] compute implication and association rules, respectively. The authors in [15] perform
attribute exploration, thus supplementing the computed implication rules with user knowledge
to correct the anomalies. Since we intend to include these three methods in our future ADC
software, they are considered in the evaluation conducted in the sections that follow.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>This section introduces the evaluated software, the dataset and the criteria used for the evaluation.
The evaluated software are 30 of the listed FCA software on U. Priss’ website 1. Table 1 describes</p>
      <sec id="sec-3-1">
        <title>1https://upriss.github.io/fca/fcasoftware.html</title>
        <p>the assessed versions. Having a cross-platform software 2 facilitates its distribution, and therefore,
verifying this criterion is important for the ADC software development described in Section 1.</p>
        <p>The evaluation was conducted using the example of formal context provided by [16] on eight
European countries (Albania, Vatican, Switzerland, Austria, Cyprus, San Marino, Liechtenstein,
and Sweden) and their memberships regarding seven organisations (European Union, Council
of Europe, European Economic Area, European Free Trade Association, EU Custom Union,
Eurozone, and Schengen area). Interest in this dataset derives from the size of the formal context,
small enough not to crash the software while still outputting meaningful results. Additionally,
the implications and the attribute exploration process are fully described by the authors.</p>
        <p>Software were evaluated according to five criteria: the first two, i.e. context editing and lattice
building, consider FCA capabilities. Each of the last three, i.e. computing implication rules,
computing association rules, and performing attribute exploration, corresponds respectively to
an ADC method presented in Section 2, i.e. a method introduced by [13], [14], or [15].
2A software executable on most Linux distributions, macOS, and Windows NT.
3Except the SpringGraph component of the software.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>This section presents the results of the evaluation. Table 2 lists 20 software that were tested, thus
omitting ten software from Table 1 for the following reasons. In March 2022 when the evaluation
was conducted, concepts.py was not available, The Coron System required the purchase of a
license to access its source code, and the license of MIW was not found. FcaBedrock, FCART,
Lattice navigator, and OpenFCA were excluded as they do not appear to be cross-platform.
The three FCA algorithms are command-line tools that compute formal concepts and maximal
frequent itemsets which appear in mining nonredundant association rules. Nonetheless, these
algorithms do not generate rules per se, nor do they allow context editing, lattice building, or
attribute exploring.
4Additional information can be found at https://kodovnash.github.io/FCA_software/</p>
      <p>Software</p>
      <p>Camelis
colibri-concepts
concepts by S. Bank</p>
      <p>ConExp
conexp-fx
conexp-clj
ConExp-NG</p>
      <p>FCA4J
fcaR
FcaStone
GALACTIC</p>
      <p>Galicia
Grif (Sarl3 library)</p>
      <p>In-Close4</p>
      <p>Lattice Miner
Python FCA Tool</p>
      <p>qdfca</p>
      <p>RCAExplore
Tockit (CASS toolkit)</p>
      <p>ToscanaJ
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
introduces the computation of implications with negation and the calculation of triadic
implications. fcaR performs fuzzy formal concept analysis, thus computing implications and semantic
closures of fuzzy sets. Galicia deduces implications by filtering the set of association rules and
only keeping those with a confidence of 100%. Finally, ConExp and its reimplementations, i.e.
-fx, -clj, and -NG, enable attribute exploration. ConExp relies on binary contexts to compute
the implications and associations and perform attribute exploration.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This paper is a step prior to developing a software intended for the detection and correction of
anomalies in the Knomana Knowledge Base. The evaluation of FCA-based software listed on
U. Priss’ website showed that ConExp and its reimplementations are the only cross-platform
software allowing the computation of implication and association rules, as well as attribute
exploration. In the literature on data cleaning, these three methods are used as basis to develop
dataset-specific methods for anomaly detection and correction.</p>
      <p>In the context of Knomana, anomalies are defined by missing data, as well as incorrect values
that describe plant uses. We need to investigate the extent to which ConExp is appropriate for
the detection and correction of the diferent types of anomalies found in Knomana. Since the
Knomana data model employs ternary relationships to describe plant uses, we plan to mine
this dataset through Relational Concept Analysis (RCA). Accordingly, we plan to integrate RCA
with ConExp to obtain a data cleansing system applicable to Knomana.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work is supported by Montpellier University KIM (Key Initiatives MUSE) DATA &amp; LIFE
SCIENCES through an interdisciplinary internship grant5.</p>
      <sec id="sec-6-1">
        <title>5https://muse.edu.umontpellier.fr/key-initiatives-muse/</title>
        <p>[5] B. Ganter, R. Wille, Formal Concept Analysis, Springer Berlin Heidelberg, 1999. doi:10.</p>
        <p>1007/978-3-642-59830-2.
[6] L. Mahrach, A. Gutierrez, M. Huchard, P. Keip, P. Marnotte, P. Silvie, P. Martin, Combining
implications and conceptual analysis to learn from a pesticidal plant knowledge base, in:
Graph-Based Representation and Reasoning, Springer International Publishing, 2021, pp.
57–72. doi:10.1007/978-3-030-86982-3\_5.
[7] M. Rouane-Hacene, M. Huchard, A. Napoli, P. Valtchev, Relational concept analysis:
mining concept lattices from multi-relational data, Annals of Mathematics and Artificial
Intelligence 67 (2013) 81–108. doi:10.1007/s10472-012-9329-3.
[8] C. C. Aggarwal, Probabilistic and statistical models for outlier detection, in: Outlier
Analysis, Springer International Publishing, 2016, pp. 35–64. doi:10.1007/978-3-319-47578-3\
_2.
[9] V. Hodge, J. Austin, A survey of outlier detection methodologies, Artificial Intelligence</p>
        <p>Review 22 (2004) 85–126. doi:10.1023/b:aire.0000045502.10941.a9.
[10] I. Kakanakova, S. Stoyanov, Outlier detection via deep learning architecture, in:
Proceedings of the 18th International Conference on Computer Systems and Technologies, ACM,
2017. doi:10.1145/3134302.3134337.
[11] E. M. Knorr, R. T. Ng, V. Tucakov, Distance-based outliers: algorithms and applications,
The VLDB Journal The International Journal on Very Large Data Bases 8 (2000) 237–253.
doi:10.1007/s007780050006.
[12] K. Yamanishi, J. ichi Takeuchi, G. Williams, P. Milne, On-line unsupervised outlier detection
using finite mixtures with discounting learning algorithms, Data Mining and Knowledge
Discovery 8 (2004) 275–300. doi:10.1023/b:dami.0000023676.72185.7c.
[13] M. Alam, A. Buzmakov, V. Codocedo, A. Napoli, Mining Definitions from RDF Annotations
Using Formal Concept Analysis, in: International Joint Conference in Artificial Intelligence,
Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence,
Buenos Aires, Argentina, 2015. URL: https://hal.archives-ouvertes.fr/hal-01186204.
[14] P. Cellier, Formal concept analysis applied to fault localization, in: Companion of the 13th
international conference on Software engineering - ICSE Companion '08, ACM Press, 2008.
doi:10.1145/1370175.1370220.
[15] J. Annapurna, A. K. Cherukuri, Exploring attributes with domain knowledge in formal
concept analysis, Journal of Computing and Information Technology 21 (2013) 109. doi:10.
2498/cit.1002114.
[16] B. Ganter, S. Obiedkov, Conceptual Exploration, Springer Berlin Heidelberg, 2016. doi:10.
1007/978-3-662-49291-8.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>U.</given-names>
            <surname>Fayyad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Piatetsky-Shapiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Smyth</surname>
          </string-name>
          ,
          <article-title>The KDD process for extracting useful knowledge from volumes of data</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>39</volume>
          (
          <year>1996</year>
          )
          <fpage>27</fpage>
          -
          <lpage>34</lpage>
          . doi:
          <volume>10</volume>
          .1145/240455.240464.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Silvie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huchard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Keip</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarter</surname>
          </string-name>
          ,
          <article-title>Prototyping a knowledgebased system to identify botanical extracts for plant health in sub-saharan africa</article-title>
          ,
          <source>Plants</source>
          <volume>10</volume>
          (
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          .3390/plants10050896.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Guigues</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Duquenne</surname>
          </string-name>
          ,
          <article-title>Famille minimale d'implications informatives résultant d'un tableau de données binaires</article-title>
          ,
          <source>Math. et Sci. Hum</source>
          .
          <volume>24</volume>
          (
          <year>1986</year>
          )
          <fpage>5</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Saoud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huchard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Marnotte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Silvie</surname>
          </string-name>
          , P. Martin,
          <article-title>Explicit versus Tacit Knowledge in Duquenne-Guigues Basis of Implications: Preliminary Results</article-title>
          ,
          <year>2021</year>
          . URL: https://hal.archives-ouvertes.fr/hal-03274757.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>