=Paper=
{{Paper
|id=Vol-3151/short6
|storemode=property
|title=Leveraging Formal Concept Analysis to Improve n-fold validation in Multilabel Classification
|pdfUrl=https://ceur-ws.org/Vol-3151/short6.pdf
|volume=Vol-3151
|authors=Francisco J. Valverde-Albacete,Carmen Peláez-Moreno
}}
==Leveraging Formal Concept Analysis to Improve n-fold validation in Multilabel Classification ==
Leveraging Formal Concept Analysis to Improve n-fold validation in Multilabel Classification Francisco J. Valverde-Albacete1 ⋆ and Carmen Peláez-Moreno1 Depto. Teorı́a de Señal y Comunicaciones, Univ. Carlos III de Madrid, Madrid, Spain, fva@tsc.uc3m.es carmen@tsc.uc3m.es Abstract. In this application note we revisit previous work on the exploratory analysis of Multilabel Classification (MLC) tasks in Machine Learning. We com- bine Information Theory and Formal Concept Analysis (FCA) to formalize the intuition that inference of classifiers in MLC tasks can only proceed when the training and testing data concerning the labels define the same Concept Lattice. We instantiate our procedure on the emotions dataset, but the procedure is independent of the dataset being explored. An R language interactive notebook carrying out the procedure is available upon request. Keywords: Multilabel Classification · Entropy-based assessment · Data preprocessing · n-fold validation · Formal Context Analysis. 1 Introduction and Motivation 1.1 The Multilabel Classification Task MLC is a relatively recently-formalized task in Machine Learning [1] with applica- tions in text categorization, gene expression analysis, etc. A recent tutorial explains the progress in methods and concerns [2], while a more up-to-date exposition with special emphasis on software tools is [3]. We describe here a variant of the MLC task setting to accommodate feature transformations, as depicted in Fig. 1. PY PX PZ PYb Y observe transform classify Ŷ Fig. 1: Basic scheme for multi-label classification The following way of solving the problem is based on the theory of Statistical Learning [4]: consider a binary multivariate source Y , emitting binary label vectors or labelsets from a label space Y ≡ 2mY , by virtue of the isomorphism between sets and their characteristic vectors1 . Suppose that the labelsets are hidden and we can only ac- cess the result of an observation mechanism of the labelsets in terms of visible instance ⋆ Corresponding author. 1 We assign to each of the labels a certain “meaning” but this is out of this mathematical model. RealDataFCA'2021: Analyzing Real Data with Formal Concept Analysis, June 29, 2021, Strasbourg, France Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). or observation in a feature space X ≡ RmX . The multi-label classification problem is to tag any (feature) vector ⃗x ∈ X , with a labelset ⃗y ∈ Y [3]. The usual way to solve the problem in the simpler, supervised setting is: 1. Model the source of labelsets as random label vectors Y ∼ PY and their observa- tions as the feature vectors X ∼ PX over their respective spaces. 2. Collect a (task) dataset with s samples, D = {(⃗xi , ⃗y i )}sj=1 of labelsets and their observed features to induce a classifier from observations to labelsets h : X → Y . 3. Choose the classifier type and induction scheme for it. 4. In order to assess the classifiers, choose an adequate measure of performance, and implement (any of a number of) schemes of iterated re-sampling of the data into a set of training examples DT = {(⃗xj , ⃗y j )}sj=1 T and a set of test examples DE = k k sE {(⃗x , ⃗y )}k=1 so that the training data are used to induce the classifiers and the testing data to test them, and validate these results, e.g. using k-fold validation [4]. Example 1 (emotions [5] ). A traditional dataset to test new ideas in MLC, emotions captures mX = 72 features and the emotions they produce—as mY = 6 labels like amazed-surprised, angry-aggressive, etc.—for a total of s = 593 song tracks. As expected, distinct labels have distinct expressions in samples. Note that only 27 of the 26 = 64 possible labelsets occur, and as many as 4 of them are hapaxes—they occur only once—while at least one of the labelsets appears 81 times. This wildly imbalanced behaviour is typical of MLC datasets [6]. ⊔ ⊓ 1.2 An FCA-based Interpretation of MLC Note that in the standard approach above FCA is all but absent. However, it is easy to see that FCA and its variants are relevant in modelling this ML task. First, with the labelsets {⃗y i }si=1 of the task dataset we can build a formal context using the set of labels L as attributes, with |L| = mY , and taking for each sample— considered as a formal object i ∈ G with |G| = s—its labelset as a row of the incidence matrix Ii· = ⃗yi , whence DL = (G, L, I) is the binary formal context of samples and their labels. This context captures the information in the stochastic source Y . Secondly, for observation vectors {⃗xi }si=1 their context DF = (G, F, R)—with F the set of features—will not be binary, in general, but many-valued Ri· = ⃗xi . This context captures the information in the observations X 2 . With the previous modelling, relevant notions in MLC correspond to relevant no- tions in FCA. For instance, labelsets are object intents, and they can be found through ↑ the polars ⃗yi = {i} . In this way, we can reason about the sampling of the stochastic variables Y and X—the dataset—in terms of the contexts above, and vice-versa. For instance, each labelset present in the dataset—that is, each object-extent—may appear with different samples (formal objects). Recall that the concept-forming function ↑ γ induces a partition ker γ on G by equality of labelsets: (i1 , i2 ) ∈ ker γ ⇐⇒ {i1 } = ↑ {i2 } = ⃗yi1 . This partition is crucial for defining the entropy of the label source (see § 2). In the other direction, we expect the sampling to be good enough mY ≪ s so it is 2 Note that Fig. 1 suggests another context formed by the transformed observations captured by the random vector Z, but this will not be dealt with here. safe to suppose that no two labels are predicated of the same set of objects. Therefore we expect the partition on labels induced by µ to be the identity ker µ = ιL , which holds on standard testing datasets (see e.g. those in [3]). 1.3 The Problem and the FCA-induced Solution The following problem and its theoretical solution were proposed in [7] as an example of the above-suggested operation: The MLC classifier induction and assessment proce- dures demand that we generate train and test resamplings of the original data [8, 4]. This amounts to splitting the original context DL into two subposed subcontexts of training DT and testing DE data: D = DT /DE . Note that 1. Since the samples are supposed to be independent and identically distributed, the order of these contexts in the subposition—indeed the row order of I—is irrelevant. 2. The resampling of the labelset context is tied to that of the observations: we decide on the labelset context and this carries over to the observation context. Since the data are a formal context we know that an important part of the informa- tion contained in it comes from the concept lattice, hence: Hypothesis 1. 1. (FCA intuition) A necessary condition for the resampling of the data D into training part DT and testing part DE to be meaningful for the MLC task, is that the concept lattices of all of these be isomorphic: B(D) ∼ = B(DT ) ∼ = B(DE ) 2. (ML intuition) Not only the labelsets but also their ocurrence frequencies as quan- tified by the cardinalities of the blocks of ker γ are important. The last consideration follows because if we only retained the meet- and join- irreducibles to obtain these concept lattices, then the labelsets of reducible attributes would be lost so the relative importance of the samples—both labels and observations— would change, impacting the induction scheme of the classifiers. The above proposition suggests that the analogue of stratified sampling in MLC is a procedure in which the stratification must proceed on a block-by-block basis with respect to ker γ. However this comes at a price, when there are hapaxes in the data. If we choose, for instance, to maintain 80% of the data for training and 20% for testing, regardless of these proportions, stratified sampling will force us to include all hapaxes with the following deleterious consequences: – The relative frequency of the hapaxes will be distorted wrt to other labelsets. – We will be using some data (the hapaxes) both for training and testing, which is known to obtain too optimistic performance results in whichever measure of it. Furthermore, if we use, e.g. k-fold validation we have to repeat this procedure and ensure that the resamplings are somehow different. A usual procedure is to distribute the original dataset into k blocks in order to aggregate k − 1 of them into the training dataset DT and use the leftover as the testing dataset DE . It is common to use k = 5 or k = 10. This can only compound the previous problem, therefore FCA allows us to spot possible problems with the classifier induction and validation schemes using resampling. In this paper we will use side-by-side FCA-based and Information Theory-based devices to support the claim that they improve the understanding of the task when used in synergy. For that purpose we first review an Exploratory Data Analysis (EDA) tech- nique in Sec. 2 to help in interpreting the experimental results in Sec. 3. 2 Methods: Source Multivariate Entropy Triangles The Source Multivariate Entropy Triangle (SMET, [9]) is a visual tool based on Infor- mation Theory to explore the informational content of multivariate sources. It is better suited to discrete sources—be they binary or multiclass—and optimally suited for ana- lyzing the set of label assignments of MLC tasks. Due to space limitations, we will only describe the graphic approach to the SMET. We refer the reader to [9] for its theoretical bases. In brief, the information content of a discrete multivariate source can be depicted in the simplex of Fig. 2, where the vertices describe extreme behaviours of the labels. If a label appears in the lowest vertex, is not stochastic. If it appears in the right vertex, its information can be predicted from that of the rest of the labels, whereas if it appears in the left vertex, it has a lot of private information. The opposite sides to these vertices represent their opposite behaviours. In general, the behaviour of labels is a mix of these three types (see Fig. 3). Fig. 2: General Source Multivariate Entropy Triangle from [9] To assess the effectiveness of stratified sampling schemes for MLC, this paper pro- poses to use the visualization of the information balance of a formal context of labels DL using the SMET. 3 Results: Example Data Analysis for an MLC Dataset SW resources. The following analysis is carried out on the Example 1 emotions dataset [5], as pre-processed and presented by the mldr R package [6]. The interactive R notebook embodying the analysis is available from the authors upon request. Basic EDA of the labels. Since we are only considering the set of labels Y , we ex- tracted the histogram of the labelsets {⃗yj , nj }j∈J from the dataset and considered a set of minimal frequencies of occurrence nT ∈ {0, 1, 4, 9, 16, 25} acting as thresholds based on it. The case nT = 0 actually represents the original dataset and Fig. 3 shows the information balance of the six labels of emotions as well as the average bal- ance for them all. We see that most labels are rather random, with relaxing-calm completely so. No label is completely specified by the rest of them, nor is any totally independent. This in essence means that the dataset is truly multilabel. Fig. 3: SMET for the original emotions dataset, e.g. with nT = 0 Disposing of hapaxes to improve stratified sampling. Previous analyses of the his- togram of labelsets made us realize that this dataset is not adequate for resampling due to hapaxes and in general low-counts of many labelsets [7]. This applies to most MLC datasets used at present [3]. To dispose of hapaxes without disposing of samples we must re-assign each to a more frequent labelset. The rationale for this decision is because we consider hapaxes errors in label codification, and assume that the “real” labelset is the closest non-hapax in Hamming distance3 . However, this re-assignment changes the histogram of labelsets what results in an impoverishment of the information balance of the labels and the dataset in general. To explore this trade-off, at each threshold nT , a labelset ⃗y was considered a gener- alized hapax if n⃗y < nT . For each threshold nT we calculated the Hamming distance between each generalized hapax ⃗ynT and the non-hapaxes, and found the set of those closest to it. Then we re-assigned ⃗ynT to one of them uniformly at random (allowing for repetitions) 4 . 3 Recall that the Hamming distance between two sequences of bits of identical length is the number of positions in which they differ. 4 Note that an alternative strategy would have been a scheme considering the original frequen- cies in the histogram, to simulate a rich-get-richer phenomenon. But such a procedure would decrease the source entropy more than the one we have chosen. This reassignment defined a new dataset whose information balance was repre- sented by the SMET, as suggested in Section 2, whence Fig. 4 ensued. What we can Fig. 4: SMET for emotions in several thresholds. see is a general tendency to the increment of the total correlation as the thresholds in- crease. But this entails that the individual distinctiveness of each label is diminished. See, for instance, the case for angry-aggressive that can actually be predicted from the other labels when n = 25, confirming that too aggressive a threshold will substantively change the relative information content of the labels in the dataset. Choosing the adequate threshold. Note that a threshold of n is needed to request an (n + 1)-fold cross-validation of any magnitude about the dataset, since all labelset will have at least (n + 1) representatives for the stratified sampling requested by the cross validation procedure. Is it possible to balance the identical sampling property on train and test, yet avoid too much loss of information content? Figure 5 depicst a choice of thresholds typically used in validation—1, 4 and 9, corresponding to 2-, 5- and 10-fold validation—for three differently behaving labels— angry-aggressive, quiet-still and relaxing-calm—and the average of the dataset, both for the ensembles of training and testing folds. – As applied to the estimation of the entropies, the (n + 1)-fold validation yields the same result in train and test, the sought-for result. – We can see the general drift towards increased correlation in all labels, but much more in, say, angry-aggressive than in quiet-still. – For this particular dataset, a threshold of nT = 4 with 5-fold validation seems to be a good compromise for attaining statistical validity vs. dataset fidelity. FCA confirmation. To strengthen the validity of the last two conclusions, we calcu- lated the number of concepts of all of the train and test label contexts using the fcaR package [10]. This is possible since such datasets are just binary tables. After creat- ing the contexts, we clarified and obtained the lists of concepts, then we compared the cardinality of the training and test concept lattices both for the unsplit dataset—after reassigning the generalized hapaxes, when needed—and the (n + 1)-cross validated versions. The results are shown in Fig. 6. Fig. 5: SMET for emotions with cross-validated entropies (some labels not shown for clarity). As expected, for nT = 0 the difference in number of concepts between the non- sampled and sampled versions of the dataset make it non-adequate for appropriate sam- pling5 . The training and test splits had the same number of concepts for every other threshold. For nT ∈ {1, 4, 16}, the number of concepts was constant among folds, but due to the randomness inherent in sampling for nT ∈ {9, 25} one of the folds was different (not shown explicitly). 4 Summary and Discussion We have tested and strengthened a data hypothesis put forward in previous work [7]. This hypothesis demands that the training and testing splits for solving a MLC task have the same concept lattice associated with their multilabel contexts, and also similar entropies in the induced partition of the sample set. We first explored the influence of carrying out different degrees of smoothing of the labelset frequencies and found a balance between it and the increase in total correlation of the labels, leading to a loss of their individual information content. We also checked that the recipe for defining an appropriate threshold n to carry out an (n+1)-fold cross validation actually works by comparing the informational balances of training and testing splits of the dataset as random estimates of their actual values. Further work should start applying this rationale to solving of the MLC task. Pre- liminary work on using both FCA- and information theoretic-approaches suggest that present-day solutions are far from satisfactory in this respect. In the future, we plan to use more FCA-induced techniques from [7] to improve on this. 5 It is a fluke of the dataset that both the training and test subcontexts have the same number of concepts as some of the hapaxes are singletons Fig. 6: Number of concepts vs. threshold for different n and splits of the dataset References [1] Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification. Pattern Recognition 37 (2004) 1757–1771 [2] Gibaja, E., Ventura, S.: A Tutorial on Multilabel Learning. ACM Computing Surveys 47 (2015) 1–38 [3] Herrera, F., Charte, F., Rivera, A.J., del Jesus, M.J.: Multilabel Classification. Problem Analysis, Metrics and Techniques. Springer (2016) [4] Murphy, K.P.: Machine Learning. A Probabilistic Perspective. MIT Press (2012) [5] Wieczorkowska, A., Synak, P., Raś, Z.W.: Multi-label classification of emotions in mu- sic. In Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K., eds.: Intelligent Information Processing and Web Mining, Springer Berlin Heidelberg (2006) 307–315 [6] Charte, F., Charte, F.D.: Working with multilabel datasets in R: The mldr package. The R journal 7 (2015) 149–162 [7] Valverde Albacete, F.J., Pelaez-Moreno, C., Cabrera, I.P., Cordero, P., Ojeda-Aciego, M.: Exploratory Data Analysis of Multi-Label Classification Tasks with Formal Context Anal- ysis. In Trnecka, M., Valverde Albacete, F.J., eds.: Concept Lattices and their Applications CLA, Tallinn University of Technology (2020) 171–183 [8] Sechidis, K., Tsoumakas, G., Vlahavas, I.P.: On the Stratification of Multi-label Data. In: Lecture Notes in Computer Science. Volume 6913., Berlin, Heidelberg, Springer Berlin Heidelberg (2011) 145–158 [9] Valverde-Albacete, F.J., Peláez-Moreno, C.: The evaluation of data sources using multi- variate entropy tools. Expert Systems with Applications 78 (2017) 145–157 [10] Lopez Rodriguez, D., Mora, A., Dominguez, J., Villalon, A.: fcaR: Formal Concept Anal- ysis. (2020) R package version 1.0.7.