<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploratory Data Analysis of Multi-label Classification Tasks with Formal Context Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Francisco J. Valverde-Albacete</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carmen Pel´aez-Moreno</string-name>
          <email>carmen@tsc.uc3m.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Inma P. Cabrera</string-name>
          <email>ipcabrera@uma.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pablo Cordero</string-name>
          <email>pcordero@uma.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel Ojeda-Aciego</string-name>
          <email>aciego@uma.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Depto. Teor ́ıa de Sen ̃al y Comunicaciones, Univ. Carlos III de Madrid</institution>
          ,
          <addr-line>Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dpt. Matem ́atica Aplicada, Univ. de Ma ́laga</institution>
          ,
          <addr-line>Ma ́laga</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <fpage>171</fpage>
      <lpage>184</lpage>
      <abstract>
        <p>We introduce a new framework, Formal Context Analysis (FxA), for the exploratory analysis of data tasks cast in the guise of formal contexts. FxA gathers a number of results from Formal Concept Analysis, Formal Independence Analysis and Formal Equivalence Analysis to enhance the establishment and processing of hypothesis about data. We apply this framework to the study of the Multi-label Classification (MLC) task and obtain a number of results of technical nature about how the induction mechanism for MLC classifiers should proceed. The application is based on an analysis of multilabel classification from the standpoint of FxA.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Exploratory Data Analysis (EDA) was introduced by Tukey as a complement
to Confirmatory Data Analysis, or Predictive Modeling, as is often called
nowadays [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It preconizes the analysis of data prior to issuing hypotheses, that may
lead to a better model for it, the premise being that by investigating the data
and gaining intuitions about it, better informed hypotheses and models may
arise.
      </p>
      <p>
        Formal Concept Analysis (FCA) is eminently suited for EDA of binary table
data—collected in formal contexts —, as proposed by Wille in his “Landscapes
of Knowledge” (LofK) metaphor [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], but concentrates in the hierarchical type
of knowledge explicited by the concept lattice and, especially, attribute
implications [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. An extensive analysis of LofK affordances vis-`a-vis other metaphors in
Data Mining can be found in [4, § 4.1].
      </p>
      <p>
        Ever since Wille himself cautioned against only reading hierarchical
knowledge from it, there have been attempts at “other readings” from the information
collected in a formal context, e.g. [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. More recently, Formal Independence
Analysis (FIA) [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] and Formal Equivalence Analysis (FEA) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] have tried to
? Corresponding author.
cast representation theorems for formal contexts as lattices of anti-chains of a
poset and partitions of a base set, respectively, in the guise of FCA-like theorems.
Being a representation in terms of anti-chains, FIA describes the independence
between parts of the formal context, while FEA takes a closer look into the
refinements of the standard congruences on objects and attributes imposed by the
polars of the context. These results are collected in Section 2 for future reference.
      </p>
      <p>In this position-paper-with-application we propose Formal Context Analysis
(FxA) as an umbrella term to gather the results and affordances of FCA, FIA,
FEA and other types of formal analysis of tabular data bound to emerge.
Specifically, to show the affordances of such a framework, we interweave it with the
process of designing a valid strategy for carrying out a multi-label classification
(MLC) task.
1.1</p>
      <p>
        The Multi-label Classification Task
MLC is a relatively recently-formalized task in Machine Learning [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] with
applications in text categorization, gene expression analysis, etc. A recent tutorial
explains the progress in methods and concerns [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], while a more up-to-date
exposition with special emphasis on software tools is [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. We describe here a
variant of the MLC task setting to accommodate feature transformations, as
depicted in Fig. 1.
      </p>
      <p>L</p>
      <p>PL</p>
      <p>PX
observe
transform</p>
      <p>PY
classify</p>
      <p>PLb
ˆ
L</p>
      <p>The following way of solving the problem is based on the theory of Statistical
Learning: consider a binary multivariate source L, emitting binary label vectors
or labelsets from a label space L ≡ 2mL , by virtue of the isomorphism between
sets and their characteristic vectors3. Suppose that the labelsets are hidden and
we can only access the result of an observation mechanism of the labelsets in
terms of visible instance or observation in a feature space X ≡ RmX . The
multilabel classification problem is to tag any (feature) vector x ∈ X , with a labelset
l ∈ L .</p>
      <p>The usual way to solve the problem in the simpler, supervised setting is:
1. Model the source of labelsets as random label vectors L ∼ PL and that of
the observations as the feature vectors X ∼ PX over their respective spaces.
2. Collect a sample or task data, D = {(xj , lj )}jn=1 of labelsets and their
observed features so you can induce a classifier from observations to labelsets
h : X → L .
3 We assign to each of the labels a certain “meaning” but this is out of this
mathematical model.
3. Choose the classifier type and induction scheme for it.
4. In order to assess the classifiers, choose an adequate measure of performance,
and implement (any of a number of) schemes of iterated sampling of the
data into a set of training examples DT = {(xj , lj )}jn=T1 and a set of test
examples DE = {(xk, lk)}kn=E1 so that the training data are used to induce
the classifiers and the training data to test them, and validate these results,
e.g. using n-fold validation.</p>
      <p>
        If we were to tackle the tagging of observed features in the considerably
harder unsupervised setting, we might first have to discern what the set of
possible tags would be—e.g. the clustering problem [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]—but even if we knew the
labels, probabilistic methods would have to be invoked to reliably relate them
to observations.
      </p>
      <p>
        The development of Machine Learning has proven that the intricacies of the
above mentioned process can best be undertaken with Statistical Learning [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
For instance, several issues have to be dealt with in an informed way when trying
to solve the task:
– Data preparation. Due to limitations of classifier (theory and) technology
it is better to obtain from the observations another representation better
suited for classification by defining a transformation g : X → Y to improve
classifier performance.
– Classifier type selection. Several competing classifier architectures need to be
chosen and tested in parallel, since demonstrably, no single type of classifier
can best the rest in all possible tasks.
– Performance measure selection. Not all measures are made equal, nor do
they measure the same issues of a classifier, hence it is healthy to collect
several of them in parallel so as not to bias the analysis.
– Classifier construction and evaluation. Again, demonstrably, unless you carry
out validation in a repeatable manner your classifier(s) will be cheating on
the data.
      </p>
      <p>In the present paper we show how some of these problems can be tackled
with the use of FxA. For this purpose, in Section 2 we present the basis of FEA
and FIA, while on Section 3 we apply FxA—including FCA, FIA and FEA—to
the MLC task. We end with a discussion and some avenues of future research.
2
2.1</p>
      <p>Preliminary notions and theoretical basis</p>
      <p>
        Basic Methods in MLC
Since MLC can be considered a strict generalization of the binary and
multiclass classification tasks in that instances may have more than one label (class)
assigned to them [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], most of the techniques for classifier selection and
construction have been imported therefrom—albeit with a limited success, e.g k-Nearest
Neighbors [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. This is the basis of the Algorithm Adaptation approach [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        On the other hand, performance measure selection, data preparation and
classifier evaluation have required extensions to cater for the peculiarities of MLC:
since the theory of machine learning is firmly-grounded only for the
mutuallyexclusive labelling cases, dealing with labelsets poses a challenge usually solved
by means of Problem Transformation. Two extreme cases of these
transformations are [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]:
– Binary relevance (BR) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], a method that learns L binary classifiers—
one for each different label in L—and then transforms the original data set
into L data sets Dlj ; j = 1 . . . L that contain all examples of the original
data set, labeled positively if the labelset of the original example contained
lj and negatively otherwise. To classify a new instance BR outputs the union
of the labels lj that are positively predicted by the L classifiers.
– Label Powerset (LP), a simple but effective problem transformation method
that considers each unique set of labels in a multi-label training set as one
of the classes of a new single-label classification task. Given a new instance,
the single-label classifier of LP outputs the most probable class, which is
actually a set of labels.
– Modeling the dependency between labels by adding some labels to the
set of predictors when predicting some other labels and further proceed by
using either BR or LP. The initial and best example of this are Classifier
Chains (CC) [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
2.2
      </p>
      <p>
        FIA: Formal Independence Analysis
For a poset P = hP, ≤i, a set of pairwise incomparable elements of P is called
an anti-chain [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. We denote the set of anti-chains of a poset as A(P). FIA is
based on an analysis of incomparability in terms of anti-chains.
      </p>
      <p>Definition 1. Given hP, ≤i a poset, it is possible to lift the ordering structure
to the powerset 2P by defining</p>
      <p>X ?4 Y ⇐⇒ for all x ∈ X there exists y ∈ Y such that x ≤ y
X 4? Y ⇐⇒ for all y ∈ Y there exists x ∈ X such that x ≤ y
X 4 Y ⇐⇒ X ?4 Y and X 4? Y
(1)
(2)
(3)</p>
      <p>In general ?4 and 4? are both simply preordering relations in h2P , ⊆i.
Observe that in the set of anti-chains A(P), the relations ?4 and 4? are also
antisymmetric.</p>
      <p>There exists a relationship with the inclusion ordering of ideals and filters
since given S, T ⊆ P then S ?4 T ⇐⇒ ↓ S ⊆ ↓ T and S 4? T ⇐⇒ ↑ T ⊆ ↑
S. Because of these equivalences, we will call ?4 the ideal containment relation
and 4? the filter containment relation.</p>
      <p>Definition 2. For a poset hP, ≤i, an anti-chain γ ∈ A(P) is said to be maximal
if every element of P is comparable to some element of γ.</p>
      <p>For any subset Q ⊆ P , the set of elements of P which are comparable to some
element of Q is called the neighborhood of Q and it is denoted by l Q = ↑ Q∪↓ Q.
An anti-chain γ is maximal if and only if l γ = P . The set of maximal anti-chains
of a set is denoted as M A(P), M A(P) = {γ ∈ A(P) | l γ = P } .</p>
      <p>
        It is worth noting that the orderings ?4 and 4? coincide in M A(P).
Proposition 1 ([
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]). If P is a finite poset then hM A(P), ?4i is a lattice.
      </p>
      <p>
        Given an anti-chain A of P , the completion of A to a maximal anti-chain is
not unique but there exists a unique lowest completion [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>Definition 3. For a finite partial order hP, ≤i, and A, B ∈ 2P we define the
highest anti-chain complement of A, denoted A−, and the lowest anti-chain
complement of B, denoted B−, as
·− : 2P</p>
      <p>→ 2P
A 7→ A− = max(P r l A)
·− : 2P</p>
      <p>→ 2P
B 7→ B− = min(P r l B) .</p>
      <p>(4)</p>
      <p>
        Due to the universal complete lattice representation capabilities of FCA, we
expect the lattices of anti-chains to be describable as the concept lattice of a
context. In fact, when focusing on maximal anti-chains, we have the following
isomorphism credited by Reuter [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] to Behrendt [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and Wille [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
Proposition 2. Let P = hP, ≤i be a poset. Then hM A(P), 4i =∼ B(P, P, 6&gt;) .
      </p>
      <p>
        The proposition above states that maximal anti-chains can be obtained as a
concept lattice for a certain context. On the other hand, Behrendt’s theorem [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]
is a universal representation theorem for lattices in terms of maximal anti-chains.
Theorem 1 (Behrendt). Let L = hL, ≤i be a finite lattice. Then there exists a
poset P = hP, ≤P i such that |P | = 2|L|, where any chain has at most 2 elements
and such that L =∼ M A(P), i.e., L is isomorphic to the lattice of maximal
antichains of (P, ≤P ).
      </p>
      <p>It is straightforward that the three following structures are equivalent: formal
contexts, bipartite graphs, and posets without chains with length higher than 2.
Definition 4. Given a context (G, M, I), for all α ⊆ G and β ⊆ M we define
α∼ = M r
[ I(g, ·)
g∈α
β∼ = G r
[ I(·, m)
m∈β
(5)
Then a formal tomos is a pair (α, β) ∈ 2G × 2M , such that α∼ = β and β∼ = α.
The set of formal tomoi of the context (G, M, I) will be denoted by A(G, M, I).
It is not difficult to see that there is a bijection between tomoi and maximal
anti-chains in the formal context interpreted as a poset of height 2. That is,
the set of formal tomoi with the supset-subset hierarchical ordering, denoted</p>
      <p>A(G, M, I), is isomorphic to the corresponding lattice of maximal anti-chains.
Moreover, since
α∼ = {m ∈ M | g I\ m for all g ∈ α}
β∼ = {g ∈ G | g I\ m for all m ∈ β} (6)
it turns out that A(G, M, I) = B(G, M, I\)d, which is the “translation” into
FCA terms. These operators reflect the underlying philosophy of FIA and, in
this terminology, we can obtain the following alternative statement of Behrendt’s
theorem in terms of tomoi: every finite lattice is isomorphic to a lattice of tomoi.</p>
      <p>Now, we can state an analogue for tomoi of the basic theorem of FCA as
follows:
Theorem 2 (Basic theorem of formal independence analysis).
1. Analysis phase. Given a formal context K = (G, M, I),
(a) The operators ·∼ : 2G → 2M and ·∼ : 2M → 2G of (5) form a
rightGalois connection (·∼, ·∼) : (2G, ⊆) (*(2M , ⊆) whose formal tomoi are
the pairs (α, β) such that α∼ = β and α = β∼.
(b) The set of formal tomoi A(G, M, I) with the relation</p>
      <p>(α1, β1) ≤ (α2, β2) iff α1 ⊇ α2 iff β1 ⊆ β2
is a complete lattice, which is called the tomoi lattice of (G, M, I) and
denoted A(G, M, I), where infima and suprema are given by:
^ (αt, βt) =
t∈T
[ αt,
t∈T
\ βt
t∈T
∼
∼!
_ (αt, βt) =
t∈T
\ αt ∼ , [ βt
t∈T ∼ t∈T
!
(c) The mappings γ : G → A(G, M, I) and μ : M → A(G, M, I)
g 7→ γ(g) = ({g}∼∼, {g}∼)</p>
      <p>m 7→ μ(m) = ({m}∼, {m}∼∼)
are such that γ(G) is infimum-dense in A(G, M, I) , μ(M ) is
supremumdense in A(G, M, I).
2. Synthesis phase. Given a complete lattice L = hL, ≤i,
(a) L is isomorphic to A(G, M, I) if and only if there are mappings γ : G →
L and μ : M → L such that
– γ(G) is infimum-dense in L , μ(M ) is supremum-dense in L, and
– g I m is equivalent to γ(g) 6≥ μ(m) for all g ∈ G and all m ∈ M .
(b) In particular, L =∼ A(L, L, 6≥) and, if L is finite, L =∼ A(M (L), J (L), 6≥)
where M (L) and J (L) are the sets of meet- and join-irreducibles, resp.,
of L.</p>
      <p>FEA: Formal Equivalence Analysis
Given (G, M, I) define operators (·)I∃ : 2G → 2M and (·)I∀ : 2M
→ 2G as follows
XI∃ = {m ∈ M | m↓ ∩ X 6= ∅}</p>
      <p>
        YI∀ = { g ∈ G | g↑ ⊆ Y }
(7)
These are the span of a set of objects XI∃ as the set of attributes related to some
object g ∈ X, and the content of a set of attributes YI∀ as the set of objects
which can be completely described by the attributes in Y [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Based on the operators above, we can define a pair of partition-forming
operators, respectively, on the sets of objects and attributes.</p>
      <p>Let us introduce the operators κG(·) : 2G → Π(G) and κM (·) : 2M → Π(M )
defined as follows:</p>
      <p>g1 ≡ g2(κG(X)) iff
m1 ≡ m2(κM (Y ))
iff
(g1 = g2 or
(m1 = m2 or
(g1)I∃ = (g2)I∃ and g1 ∈ X
(m1)I∀ = (m2)I∀ and m1 ∈ Y
where we have used the notation m to refer to the set M r {m}.</p>
      <p>Now, we can define mappings γΠ (·) : G → Π(G) × Π(M ) and μΠ (·) : M →
Π(G) × Π(M ) which assign to each object, respectively, each attribute, a pair
of partitions as follows:
γΠ (g) =
κG (gI∃)I∀ , κM g∃</p>
      <p>I
μΠ (m) =
κG mI∀ , κM (mI∀)I∃</p>
      <p>Now, finally, for partitions of objects G and attributes M let us define the
mappings (·)∃Π : Π(G) → Π(M ) and (·)∀Π : Π(M ) → Π(G):
πΠ∃ = _{σc | c ∈ μΠ (M ), π ≥ πc}</p>
      <p>σΠ∀ = ^{πc | c ∈ γΠ (G), σ ≤ σc} (10)</p>
      <p>We can finally state the:
Theorem 3. (Basic theorem of formal equivalence analysis)
1. Analysis phase. Given a formal context K = (G, M, I),
(a) The operators (·)∃Π and (·)∀Π form a left adjunction whose fixpoints, the
formal partitions, are the pairs (π, σ) such that πΠ∃ = σ and σΠ∀ = π.
(b) The set of formal partitions, denoted P(K), with the relation (π1, σ1) ≤
(π2, σ2) iff π1 ≤ π2 iff σ1 ≤ σ2 is a complete lattice, which is called the
partition lattice of K and denoted P(K), with infima and suprema given
by:
^ (πt, σt) =  ^ πt, </p>
      <p>
t∈T t∈T



^ σt
t∈T
!∃ ∀
!∀ ∃ </p>
      <p> 
Π Π</p>
      <p>
_ (πt, σt) = 
t∈T
_ πt
t∈T
(c) The mappings in (9) γΠ (·) : G → P(K) and μΠ (·) : M → P(K) are
such that γΠ (G) is ∨-dense in P(K) , μΠ (M ) is ∧-dense in P(K).
2. Synthesis phase. Given a complete lattice L = hL, ≤i,
(a) L is isomorphic to P(G, M, I) if and only if there are mappings γ : G →
L and μ : M → L such that
– γ(G) is ∨-dense in L , μ(M ) is ∧-dense in L, and
– g I m is equivalent to γ(g) 6≥ μ(m) for all g ∈ G and all m ∈ M .
(b) In particular, L =∼ P(L, L, 6≥) and, if L is finite, L =∼ P(J (L), M (L), 6≥)
where M (L) and J (L) are the sets of meet- and join-irreducibles, resp.,
of L.
3</p>
      <p>Exploratory Data Analysis of Multilabel Classification
with FxA
General setup. Consider a generic dataset D = {(xi, li)}in=1 capturing the essence
of the MLC task described in Section 1.1 and set the indices over the samples
as formal objects G = {1, . . . , n}. Next build a formal context from it using the
set of labels L as attributes and considering, for each object i ∈ G, its labelset
as a row of the incidence matrix Ii· = li, whence DL = (G, L, I) is the binary
formal context of samples and their labels.</p>
      <p>j n</p>
      <p>
        For observation vectors {x }j=1 their context DF = (G, F, R)—with F the set
of features—will not be binary, in general, but many-valued Ri· = xi. This is the
turf of traditional machine-learning techniques [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], but can be also tackled with
generalizations of FCA for incidences with entries in an idempotent semiring,
e.g. FCA in a fuzzy context [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] or K-FCA where K is a semifield [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
      <p>With the previous modeling relevant concepts in MLC correspond to relevant
concepts in FxA. For instance, labelsets are object intents, and they can be found
through the polars li = {i}↑ . We would like to use the affordances of FxA (FCA
+ FIA + FEA) on this context D = DL | DF to help in solving the MLC task as
described in Sections 1.1 and 2.1. We next introduce a number of possible ways
to do so.
3.1</p>
      <p>
        Task Subdivision with FIA
From a purely Machine Learning perspective, it is important to reduce the
cardinality of L for LP. Furthermore, using BR only makes sense if the labels exhibit a
great degree of independence. In fact, an extreme case of independence between
labels is when the variation of one of them is obtained when the other is missing
altogether and viceversa. This is the kind of situation that FIA detects [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]:
Proposition 3. If FIA can be carried successfully in the context to obtain
different subcontexts, then DL admits a decomposition of the problem into as many
subtasks of reduced complexity as subcontexts detected by FIA.
      </p>
      <p>D1 ∅
∅ D2
(b) Schematic of DL
Proof. (Sketch.) Suppose that FIA of DL is capable of providing the evidence
that allows to split the data context into—without loss of generality—two
subcontexts as in the toy example of Figure 2, that is, tomoi ({4, 5, 6}, {a, b, c})
and ({1, 2, 3a, 3b}, {d, e, g}). Notice that detecting and reordering the rows of
DL would also require the same reordering of the rows in the feature subcontext
DF . Then we would be tempted to try to solve the subtasks built of D1 and D2
and their respective rows of observations, thereby reducing the complexity of the
MLC induction process.</p>
      <p>However, from the point of view of downstream processing such blind FIA is
risky: we simplify the task by creating independent subtasks, but if we only use
the subcontexts D1 and D2 for solving the subtasks we restrict the evidence for
the negative cases of the labels. To avoid discarding data, then we should use
the subpositions D1/∅ and D2/∅ for training—again, with proper reordering of
rows.
tu</p>
      <p>
        Note that FIA does not detect statistical independence, though, and there
would also be merit in detecting such type of independence for BR, specially
when inducing probability-based classifiers with DF [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. For an example of how
to detect subcontexts with FIA, see [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
3.2
      </p>
      <p>Stratified sampling and FEA
In general, each labelset present in the database will be assigned to many different
formal objects. The concept-forming function γ induces a partition on the set of
objects ker γ on G by equality of object-extents, equivalently labelsets: (i1, i2) ∈
ker γ ⇐⇒ {i1}↑ = {i2}↑ = li1 [3, § 1.5]. On the other hand, we expect the
sampling to be good enough mL n so it is safe to suppose that no two labels are
predicated of the same set of objects, otherwise, they would be indistinguishable
in the sample D and one of them excised. Therefore we expect the partition on
labels induced by μ to be the identity ker μ = ιL.</p>
      <p>
        Example 1. Table 1 shows a selection of low- to middle-complexity datasets used
as testbeds for the MLC task. For instance, the emotions MLC dataset [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] has
n = 593 instances and mL = 6 labels, but only 27 of the 26 = 64 labelsets
have been realized. As many as 4 of them are hapaxes, that is, they occur only
once, while at least one of the labelsets has appeared 81 times. This wildly
imbalanced behaviour is typical of MLC datasets [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. As expected, distinct
labels have distinct extensions.
tu
      </p>
      <p>
        The pair of partitions of objects and attributes (ker γ, ker μ) ∈ Π(G) × Π(M )
are the concern of FEA [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In the present case, since no refinement of the
label partition is feasible, we concentrate on describing how FEA of the object
partition guides the resampling process of the learning algorithms.
      </p>
      <p>The MLC induction and assessment procedures demand that we generate
train and test resamplings of the original data. i.e. splitting the original context
D into two subposed subcontexts of training DT and testing DE data so that
D = DT /DE . Note that:
1. Since the samples are supposed to be independent and identically distributed,
the order of these contexts in the subposition, as indeed the reordering of
the rows in the incidence, is irrelevant.
2. The resampling of the labelset context is tied to the resampling of the
observations: we decide on the labelset information and this carries over to the
observations.</p>
      <p>Since the data are a formal context we know that an important part of the
information contained in it comes from the concept lattice, hence we state the
following:
Proposition 4. A necessary condition for the resampling of the data D into
training part DT and testing part DE to be meaningful for the MLC task, is that
the concept lattice of all of these must be the same:</p>
      <p>B(D) =∼ B(DT ) =∼ B(DE )
Proof. Due to the identification of object intents and labelsets, we know that to
respect the complexity of the labelset samples in each subcontext, one sufficient
condition is that one of the labelsets associated to each block in the partition
ker γ is accorded to each of the subcontexts.</p>
      <p>If this is the case, then the sampled subcontexts being join- and meet-dense,
will generate isomorphic concept lattices. Since they each are a clarification of
the original context D, their concept lattices are all isomorphic.</p>
      <p>However, if we only retained the meet- and join-irreducibles to obtain these
concept lattices, then the labelsets of reducible attributes would be lost and
this would change the relative importance of the samples (both labels and
observations, remember), which will therefore impact the induction scheme of the
classifiers. Hence not only the labelsets but also their frequencies of occurrence
are important.</p>
      <p>The above proposition suggests that the analogue of stratified sampling in
MLC is a procedure in which the stratification must proceed on a
block-byblock basis with respect to ker γ. However this comes at a price, when there are
hapaxes in the data. If we choose, for instance, to maintain 80% of the data for
training and 20% for testing, regardless of these proportions, stratified sampling
will force us to include all hapaxes with the following deleterious consequences:
– The relative frequency of the hapaxes will be distorted wrt to other labelsets.
– We will be using some data (the hapaxes) both for training and testing,
which is known to obtain too optimistic performance results in whichever
measure of it.</p>
      <p>Furthermore, if we use, e.g. k-fold validation we have to repeat this procedure
and ensure that the resamplings are somehow different. A usual procedure is to
distribute the original dataset into k blocks in order to aggregate k − 1 of them
into the training dataset DT and use the leftover as the testing dataset DE . It is
common to use k = 5 or k = 10. This can only compound the previous problem,
therefore FEA allows us to spot possible problems with the classifier induction
and validation schemes using resampling.</p>
      <p>Example 2 (Continued). Fig. 3 represents the labelset histogram of the emotions
dataset. Recall there are 4 hapaxes and this should appear in any resampling of
the data, so it is subject to the problems detected above, that are ubiquitous.
Table 1 lists the number of hapaxes or single samples for other MLC datasets.
tu
3.3</p>
      <p>
        Label Dependence Modelling with FCA
Actually, whether BR or LP will outperform the other is presently acknowledged
to depend on the degree of dependence on labels among themselves: if labels are
mostly non-dependent, then the BR method is superior to LP, while the contrary
is expected to hold when dependence between labels is commonplace [
        <xref ref-type="bibr" rid="ref16 ref17">17, 16</xref>
        ].
This line of work has evolved from the Classifier Chain approach [17, Chap. 7]
as more and more solutions to MLC try to model explicitly such dependencies.
Furthermore, it has also been recently hinted that performance measures also
presuppose one model of dependence or other [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], hence explicit modeling of
dependences has become an issue to understand the task.
      </p>
      <p>A crucial step in these solutions is the determination of an order to model
dependencies between attributes adequately. In this context, we believe the
ability of FCA to obtain implications between attributes to be specially relevant,
particularly implications with a single consequent. This study will be left for
future work.
We have sketched in this paper the basics of Formal Context Analysis (FxA),
a framework to exploit the findings of FCA, FIA and FEA under a common
umbrella, in order to help carry out the Exploratory Data Analysis of MLC
tasks.</p>
      <p>We have shown how labelset modelling is straightforward in this framework,
and how FIA helps in the subdivision of the task into easier subtasks, and how
FEA settles the table to understand how “stratified sampling” of MLC tasks
can be made to respect the underlying formal conceptual model. We have also
sketched how FCA could help in the actual design of classifiers that incorporate
the dependencies between labels in the classifier induction step.</p>
      <p>This is work in progress. Next steps will be to provide quantitative support
for this framework by using the recently developed package fcaR, as well as filling
in the details about how to use the probability mass distribution associated to
ker γ, the partition of samples due to labelsets, in conjunction with the purely
FxA methods.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Tukey</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          :
          <article-title>Exploratory data analysis</article-title>
          .
          <source>Addison-Wesley</source>
          (
          <year>1977</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Wille</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Conceptual landscapes of knowledge: a pragmatic paradigm for knowledge processing</article-title>
          . In Mineau, G.,
          <string-name>
            <surname>Fall</surname>
          </string-name>
          , A., eds.
          <source>: Proceedings of the Second International Symposium on Knowledge Retrieval, Use and Storage for Efficiency</source>
          , Vancouver (
          <year>1997</year>
          )
          <fpage>2</fpage>
          -
          <lpage>13</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Ganter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wille</surname>
          </string-name>
          , R.:
          <source>Formal Concept Analysis: Mathematical Foundations</source>
          . Springer, Berlin, Heidelberg (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Valverde-Albacete</surname>
            ,
            <given-names>F.J.</given-names>
          </string-name>
          ,
          <article-title>Gonz´alez-</article-title>
          <string-name>
            <surname>Calabozo</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          , Pen˜as,
          <string-name>
            <surname>A.</surname>
          </string-name>
          ,
          <article-title>Pela´ez-</article-title>
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Supporting scientific knowledge discovery with extended, generalized formal concept analysis</article-title>
          .
          <source>Expert Systems with Applications</source>
          <volume>44</volume>
          (
          <year>2016</year>
          )
          <fpage>198</fpage>
          -
          <lpage>216</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5] Du¨ntsch, I.,
          <string-name>
            <surname>Gediga</surname>
          </string-name>
          , G.:
          <article-title>Modal-style operators in qualitative data analysis</article-title>
          .
          <source>In: Proc. IEEE International Conference on Data Mining</source>
          ,
          <string-name>
            <surname>ICDM</surname>
          </string-name>
          <year>2002</year>
          .
          <article-title>(</article-title>
          <year>2002</year>
          )
          <fpage>155</fpage>
          -
          <lpage>162</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Dubois</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prade</surname>
          </string-name>
          , H.:
          <article-title>Possibility theory and formal concept analysis: Characterizing independent sub-contexts</article-title>
          .
          <source>Fuzzy Sets And Systems</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Valverde-Albacete</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <article-title>Pela´ez-</article-title>
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cabrera</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cordero</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ojeda-Aciego</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Formal independence analysis</article-title>
          .
          <source>CCIS 853</source>
          (
          <year>2018</year>
          )
          <fpage>596</fpage>
          -
          <lpage>608</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Valverde-Albacete</surname>
            ,
            <given-names>F.J.</given-names>
          </string-name>
          ,
          <article-title>Pela´ez-</article-title>
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cabrera</surname>
            ,
            <given-names>I.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cordero</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , OjedaAciego, M.:
          <string-name>
            <given-names>A Data</given-names>
            <surname>Analysis</surname>
          </string-name>
          <article-title>Application of Formal Independence Analysis</article-title>
          .
          <source>In: Concept Lattices and their Applications (CLA'18)</source>
          . (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Valverde-Albacete</surname>
            ,
            <given-names>F.J.</given-names>
          </string-name>
          ,
          <article-title>Pela´ez-</article-title>
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cordero</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ojeda-Aciego</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Formal equivalence analysis</article-title>
          .
          <source>In: Proc. Conf. Internat. Fuzzy Syst. Assoc. and European Soc. Fuzzy Logic and Technol</source>
          .
          <source>(EUSFLAT</source>
          <year>2019</year>
          ), Atlantis Press (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Boutell</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , C.M.
          <article-title>: Learning multi-label scene classification</article-title>
          .
          <source>Pattern Recognition</source>
          <volume>37</volume>
          (
          <issue>9</issue>
          ) (
          <year>2004</year>
          )
          <fpage>1757</fpage>
          -
          <lpage>1771</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Gibaja</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ventura</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A Tutorial on Multilabel Learning</article-title>
          .
          <source>ACM Computing Surveys</source>
          <volume>47</volume>
          (
          <issue>3</issue>
          ) (
          <year>2015</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Herrera</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Charte</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rivera</surname>
            ,
            <given-names>A.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>del Jesus</surname>
            ,
            <given-names>M.J.: Multilabel</given-names>
          </string-name>
          <string-name>
            <surname>Classification. Problem Analysis</surname>
          </string-name>
          ,
          <source>Metrics and Techniques</source>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Mirkin</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Clustering for Data Mining. A Data Recovery Approach</article-title>
          .
          <source>Computer Science and Data Analysis Series. Chapman &amp; Hall</source>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>K.P.:</given-names>
          </string-name>
          <article-title>Machine Learning. A Probabilistic Perspective</article-title>
          . MIT Press (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Wieczorkowska</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Synak</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Ra´s,
          <string-name>
            <surname>Z.W.:</surname>
          </string-name>
          <article-title>Multi-label classification of emotions in music</article-title>
          . In Klopotek,
          <string-name>
            <surname>M.A.</surname>
          </string-name>
          , Wierzchon´,
          <string-name>
            <given-names>S.T.</given-names>
            ,
            <surname>Trojanowski</surname>
          </string-name>
          , K., eds.
          <source>: Intelligent Information Processing and Web Mining</source>
          , Springer Berlin Heidelberg (
          <year>2006</year>
          )
          <fpage>307</fpage>
          -
          <lpage>315</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Tsoumakas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katakis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vlahavas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Mining multi-label data</article-title>
          . In
          <string-name>
            <surname>Maimon</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rokach</surname>
          </string-name>
          , L., eds.:
          <article-title>Data Mining and Knowledge Discovery Handbook</article-title>
          . (
          <year>2010</year>
          )
          <fpage>667</fpage>
          -
          <lpage>685</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Read</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Scalable Multi-label Classification</article-title>
          .
          <source>PhD thesis</source>
          , Univ. Waikato,
          <string-name>
            <surname>Australia</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Davey</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Priestley</surname>
          </string-name>
          , H.:
          <article-title>Introduction to lattices and order</article-title>
          .
          <source>2nd edn</source>
          . Cambridge University Press, Cambridge, UK (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Behrendt</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Maximal antichains in partially ordered sets</article-title>
          .
          <source>Ars Combinatoria 25(C)</source>
          (
          <year>1988</year>
          )
          <fpage>149</fpage>
          -
          <lpage>157</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Reuter</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>The jump number and the lattice of maximal antichains</article-title>
          .
          <source>Discrete Mathematics</source>
          <volume>88</volume>
          (
          <issue>2-3</issue>
          ) (
          <year>1991</year>
          )
          <fpage>289</fpage>
          -
          <lpage>307</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Wille</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Finite distributive lattices as concept lattices</article-title>
          .
          <source>Atti Inc. Logica Mathematica</source>
          <volume>2</volume>
          (
          <year>1985</year>
          )
          <fpage>635</fpage>
          -
          <lpage>648</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22] Bˇelohla´vek, R.:
          <source>Fuzzy Relational Systems. Foundations and Principles. Volume 20 of IFSR International Series on Systems Science and Engineering</source>
          . Kluwer Academic (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Valverde-Albacete</surname>
            ,
            <given-names>F.J.</given-names>
          </string-name>
          ,
          <article-title>Pela´ez-</article-title>
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Extending conceptualisation modes for generalised Formal Concept Analysis</article-title>
          .
          <source>Information Sciences</source>
          <volume>181</volume>
          (
          <year>2011</year>
          )
          <fpage>1888</fpage>
          -
          <lpage>1909</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Charte</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Charte</surname>
            ,
            <given-names>F.D.</given-names>
          </string-name>
          :
          <article-title>Working with multilabel datasets in R: The mldr package</article-title>
          .
          <source>The R journal 7(5)</source>
          (
          <year>2015</year>
          )
          <fpage>149</fpage>
          -
          <lpage>162</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25] Dembczyn´ski,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Waegeman</surname>
          </string-name>
          , W., Cheng, W., Hu¨llermeier, E.:
          <article-title>Regret analysis for performance metrics in multi-label classification: the case of hamming and subset zero-one loss</article-title>
          .
          <source>In: Proc. European Conf. on Machine Learning</source>
          ,
          <source>(ECML PKDD</source>
          <year>2010</year>
          ).
          <article-title>(</article-title>
          <year>2010</year>
          )
          <fpage>280</fpage>
          -
          <lpage>295</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>